Bot Filtering & Unrelevant Domains

ezPAARSE uses various filters to reduce the noise generated by the many non-pertinent log lines that are present in a typical logfile and that can represent up to 80-90% of the lines.

Excluding the Robots’ Accesses

By default, a list of IP addresses is used to recognize and exclude the accesses made by robots (spiders, indexing robots, etc.)

The log lines generated by such accesses are thus rejected.

To complete the robots’ list, you just have to create a file and place it in the exclusions folder. Its name has to start with robots., and must contain one IP address per line.

Excluding Arbitrary IP Addresses

It is also possible to filter out some distinct IP addresses. The typical use case for this is when a training session takes place and you don’t want to count those accesses: you declare the IP addresses for the computers used during the training session. The matched lines are rejected in a different file from the robots rejects.

To complete this list, you just have to create a file in the exclusions folder and its name must start with hosts., and contain one IP address per line.

The Unrelevant Domains

Accesses to proxied domains that are of no interest for you can be ignored by ezPAARSE. Because they are not relevant to the electronic ressources usage you need to trace, they will simply be filtered out and counted as ignored. This can help slim down the rate of rejected lines.

To complete this list of unrelevant domains, you just have to create a file in the exclusions folder and name it with the domains. prefix. It must contain only one IP address per line.

By default, we already provide you with three exclusion files :

  • domains.default.txt: containing a list of domains related to Google
  • domains.cdn.txt: containing a list of Content Delivery Networks (CDN) subdomains
  • domains.static.txt: containing a list of subdomains serving static resources (mostly images)