Job reports

ezPAARSE generates an execution report, everytime it processes a log file. The various sections of this report are documented below.

  • General: contains general information related to the processing
  • Rejects: lists all rejects, how much they are and the links to the files containing the rejected lines
  • Statistics: provides the first global figures
  • Alerts: lists the active alerts
  • Notifications: lists the email for the recipients of processing notifications
  • Duplicates: algorithm used for deduplication
  • File: list of processed log files
  • First consultation: content of the first access event

There is also a special file called domains.miss.csv, located at the root of the /ezpaarse where unknown domains get stored (deduplicated and sorted). This file persists between every processing job. See below for details.

General

Job-Date 2014-06-16T14:55:04+02:00
Processing date
Job-Done true
Has the processing correctly completed?
Job-Duration 4 m 22 s
Processing duration
Job-ID 6f601540-f555-11e3-b477-758199fa5dc1
Unique Identifier for the processing
Rejection-Rate 96.74 %
Rejected lines rate (ie. unknown domains, duplicates,etc.) among the relevant lines
URL-Traces http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/job-traces.log
Access to the execution traces for the processing
client-user-agent Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/33.0.1750.152 Chrome/33.0.1750.152 Safari/537.36
ezPAARSE-version ezPAARSE 2.3.0
geolocalization all
Requested geo-location fields
git-branch master
git-last-commit 429e61bf29e80326b09958b0a68a01c0ae3add91
git-tag 1.7.0
input-first-line rate-limited-proxy-72-14-199-16.google.com - - [19/Nov/2013:00:11:05 +0100] "GET http://gate1.inist.fr:50162/login?url=http://www.nature.com/rss/feed?doi=10.1038/465529d HTTP/1.1" 302 0
First log line found in a submitted log file
input-format-literal %h %l %u %t "%r" %s %b (ezproxy)
Format used to identify the elements found in a log file
input-format-regex ^([a-zA-Z0-9\.\-]+(?:, ?[a-zA-Z0-9\.\-]+)*) ([a-zA-Z0-9\-]+|\-) ([a-zA-Z0-9@\.\-_%,=]+) \[([^\]]+)\] "[A-Z]+ ([^ ]+) [^ ]+" ([0-9]+) ([0-9]+)$
Regular expression corresponding to the given format for log lines
nb-denied-ecs 104
Number of denied consultation events (access to not subscribed resources)
nb-ecs 14224
Total number of consultation events found in the log file
nb-lines-input 792049
Number of log lines found in the file given as input
on-campus-accesses 6549
Total number of on-campus consultation events
process-speed 3019 lignes/s
Processing speed
enhancement-errors 0
Number of consultation events that could not be enriched because of MongoDB errors
result-file-ecs http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1
URL for accessing the result file
url-denied-ecs http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/denied-ecs.csv
URL for accessing the file containing denied consultations (for non subscribed resources)

Rejects

nb-lines-duplicate-ecs 1893
Number of deduplicated access events (following the COUNTER algorithm)
nb-lines-ignored 351891
Number of ignored lines (not relevant)
nb-lines-ignored-domains 4
Number of lines for which the domain has been ignored (ie declared in EZPAARSE_IGNORED_DOMAINS)
nb-lines-pkb-miss-ecs 2107
Number of lines with unknown vendors identifiers
nb-lines-unknown-domains 335068
Number of lines with an unknown domain
nb-lines-unknown-formats 1891
Number of lines with an unknown format
nb-lines-unordered-ecs 0
Number of lines chronologically disordered (the chronological order is necessary for deduplication)
nb-lines-unqualified-ecs 86974
Number of unqualified lines (because they don't contain enough information)
url-duplicate-ecs http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-duplicate-ecs.log
URL to the file containing the deduplicated lines
url-ignored-domains http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-ignored-domains.log
URL to the file containing the lines with an ignored domain
url-pkb-miss-ecs http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-pkb-miss-ecs.log
URL to the file containing the lines with an unknown vendor's identifier
url-unknown-domains http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-unknown-domains.log
URL to the file containing the lines with an unknwon domain (ie no parser has been triggered by ezPAARSE)
url-unknown-formats http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-unknown-formats.log
URL to the file containing the lines with an unknown format
url-unordered-ecs
URL to the file containing the lines with a chronological anomaly
url-unqualified-ecs http://localhost:59599/6f601540-f555-11e3-b477-758199fa5dc1/lines-unqualified-ecs.log
URL to the file containing the lines containing too few information

Statistics

mime-HTML 4540
Numbers of access events for the main mime-types (names prefixed with mime-)
mime-MISC 3612
mime-PDF 6072
platform-acs 538
Number of access events for recognized platforms (names prefixed with platform-platform_shortname)
platform-ar 97
platform-bioone 15
platform-bmc 75
platform-cup 22
platform-edp 27
platform-hw 1740
platform-jstor 9
platform-mal 97
platform-metapress 27
platform-npg 3132
platform-sd 5255
platform-springer 1675
platform-wiley 1515
platforms 14
Number of distinct platforms recognized during the processing
rtype-ABS 1142
Number of access events for the main resources types (name prefixed with rtype-)
rtype-ARTICLE 9991
rtype-BOOK 218
rtype-BOOKSERIE 23
rtype-BOOK_SECTION 314
rtype-TOC 2536

Alerts

active-alerts unknown-domains
List of alerts that can be thrown
alert-1 www.ncbi.nlm.nih.gov is unknown but represents 64% of the log lines
Alert content

Notifications

mailto someone@somewhere.com
Recepient(s) of the mail sent at the end of the processing
mail-status success
Status of the mail sending.

Deduplicating

activated true
fieldname-C session
fieldname-I host
fieldname-L login
strategy CLI
window-html 10
Number of seconds used for the deduplication timeframe of HTML consultations (ie. consultations of a resource with the same ID are grouped together in a single event, cf COUNTER)
window-misc 30
Number of seconds used for the deduplication timeframe of MISC consultations
window-pdf 30
Number of seconds used for the deduplication tiemframe of PDF consultations

Files

1 fede.bibliovie.ezproxy.2013.11.19.log.gz

First consultation event

date 2013-11-19
datetime 2013-11-19T00:11:57+01:00
domain www.nature.com
geoip-addr
GeoIP Address extracted from the IP address of the consulting host
geoip-city
City, extracted from the IP address of the consulting host
geoip-coordinates
Coordinates (longitude and latitude) extracted from the IP address of the consulting host
geoip-country
Country code extracted from the IP address of the consulting host
geoip-family
geoip-host
GeoIP Host extracted from the IP address of the consulting host
geoip-latitude
geoip-longitude
geoip-region
host test.proxad.net
Original host
login MYLOGIN
Login used for accessing the resource
mime MISC
Mime-type of the ressource, as recognized by the parser
platform npg
Short name for the consulted platform (ie name of the parser used to analyse the resource's URL)
rtype TOC
Reousrce type for the consulted resource, as recognized by the parser
size 40054
HTTP Request size
status 200
HTTP code sent by the server when the resource is accessed
timestamp 1384816317
title_id siteindex
Vendor identifier, as determined by the parser
unitid siteindex
Unique identifier for the resource, as determined by the parser (used for deduplicating identical resources)
url http://www.nature.com:80/siteindex/index.html

Unknown Domains

The domains.miss.csv file persists between every processing job. It is where the unknown domains (ie domains for which no parser gets started) get stored, deduplicated and sorted: if URLs present in that file correspond to a provider’s platform that should be analysed by ezPAARSE, you have to check on the Analogist platform analysis website if the platform is already listed and you will also get an indication of how advanced its analysis is.