Bot Filtering & Unrelevant Domains¶
ezPAARSE uses various filters to reduce the noise generated by the many non-pertinent log lines that are present in a typical logfile and that can represent up to 80-90% of the lines.
Excluding the Robots’ Accesses¶
By default, a list of IP addresses is used to recognize and exclude the accesses made by robots (spiders, indexing robots, etc.)
The log lines generated by such accesses are thus rejected.
To complete the robots’ list, you just have to create a file and place it in the
exclusions folder. Its name has to start with
robots., and must contain one IP address per line.
Excluding Arbitrary IP Addresses¶
It is also possible to filter out some distinct IP addresses. The typical use case for this is when a training session takes place and you don’t want to count those accesses: you declare the IP addresses for the computers used during the training session. The matched lines are rejected in a different file from the robots rejects.
To complete this list, you just have to create a file in the
exclusions folder and its name must start with
hosts., and contain one IP address per line.
The Unrelevant Domains¶
Accesses to proxied domains that are of no interest for you can be ignored by ezPAARSE. Because they are not relevant to the electronic ressources usage you need to trace, they will simply be filtered out and counted as ignored. This can help slim down the rate of rejected lines.
To complete this list of unrelevant domains, you just have to create a file in the
exclusions folder and name it with the
domains. prefix. It must contain only one IP address per line.
By default, we already provide you with three exclusion files :
domains.default.txt: containing a list of domains related to Google
domains.cdn.txt: containing a list of Content Delivery Networks (CDN) subdomains
domains.static.txt: containing a list of subdomains serving static resources (mostly images)