Process API

ezPAARSE is a RESTful application. REST: Representational State Transfer

Sending logs

The main route for ezPAARSE is the root of the web service. The GET method gives access to the logs submission form, and the POST method allows sending logs. The submitted files are parsed and access events are sent back, as a resulting stream.

PATH GET POST
/ Submission form Parses a log file

The detailed POST request

POST / HTTP/1.1

Parameters (headers)

The parameters list

Body

Log lines generated by a proxy server.

EZProxy documentation Squid documentation

Response to a POST request

Status code

  • 200 OK: the logs have been successfully processed.
  • 400 Bad Request: a request element makes the processing of logs impossible.
  • 406 Not Acceptable: encoding or output format not supported.

Headers

  • Job-ID: unique identifier associated to the current processing job.
  • Job-Report: URL for the detailed processing report, including all of the headers sent by ezPAARSE.
  • ezPAARSE-Status: return code if an error is raised.
  • ezPAARSE-Status-Message: explanation message on the return code.

Headers containing the URLs for accessing the logs :

  • Job-Traces: traces for the current ezPAARSE job (the verbosity level can be modified with the Traces-Level header)
  • Lines-Unknown-Formats: lines for which the format has not been recognized.
  • Lines-Ignored-Domains: lines for which the domain is ignored.
  • Lines-Unknown-Domains: lines for which the domain is not associated to a parser.
  • Lines-Unqualified-ECs: lines that generated access events with too few information. (More details)
  • Lines-PKB-Miss-ECs: lines that generated identifiers that can’t be found in the PKB for the corresponding platform.
  • Lines-Duplicate-ECs: lines filtered out by the double-clicks detection algorithm.
  • Lines-Unordered-ECs: lines rejected because they were not chronologically ordered
  • Lines-Robots-ECs: lines generated by non-human agents (robots, crawlers, spides, etc.).
  • Lines-Ignored-Hosts: lines that were filtered based on their IP address.

Body

CSV or JSON containing all of the generated access events.

Access event example:

{
  "host": "1234567d6b8dd5dddc87939c4a407987",
  "login": "IDEXEMPLE",
  "date": "2011-12-31T10:42:42+01:00",
  "url": "http://www.une-adresse.com/exemple.php?id=16",
  "status": "200",
  "size": "0",
  "domain": "www.une-adresse.com",
  "type": "PDF",
  "issn": "1111-1111"
}

Request examples

curl -X POST http://127.0.0.1:59599 --no-buffer --data-binary @file.log -v
curl -X POST --proxy "" --no-buffer --data-binary @test/dataset/sd.2012-11-30.log  http://127.0.0.1:59599 -v
curl -X POST --proxy "" --no-buffer -H "Accept: application/json" --data-binary @test/dataset/sd.2012-11-30.log  http://127.0.0.1:59599 -v

Access the traces and rejects

When ezPAARSE is processing a request (a job), it generates informative files bound to its activity. Those can be accessed by using the unique identifier attributed to the job.

PATH Information given
/{jobID}/job-traces.log Traces of the internal process. It's only interesting when a something has gone wrong.
/{jobID}/job-report.(json|html) Report aggregating data on the job: how many lines were rejected, reject rate, date and job duration, etc. Use it like /{jobID}/job-report.html?standalone=1 to generate a standalone html report
/{jobID}/lines-unknown-formats.log Lines for which the format was not recognized because it doesn't look like the input parameters
/{jobID}/lines-ignored-domains.log Lines for which the domain is ignored.
/{jobID}/lines-unknown-domains.log Lines for which the domain is not associated to a parser.
/{jobID}/lines-unqualified-ecs.log Lines that generated access events with too few information. [(More details)](../features/qualification.html)
/{jobID}/lines-pkb-miss-ecs.log Lines that generated identifiers that can't be found in the PKB for the corresponding platform.
/{jobID}/lines-duplicate-ecs.log Lines filtered out by the COUNTER double-clicks detection algorithm.
/{jobID}/lines-unordered-ecs.log Lines rejected because they were not chronologically ordered.
/{jobID}/lines-robots-ecs.log Lines generated by non-human agents (robots, crawlers, spides, etc.).
/{jobID}/lines-ignored-hosts.log Lines that were filtered based on their IP address.
  • jobID: unique identifier attributed to the job.

General information

These routes are useful to get various information like: the list of platforms, the types of access events. They only respond to the GET method.

URL Information given
/info/platforms Lists the available plateforms
/info/rid Lists the resources identifiers
/info/rtype Lists the resources types
/info/mime Lists the resources formats (or mimetypes)
/info/codes Lists the application's return codes and their meaning
/info/codes/{code} Returns the meaning of one return code
/info/form-predefined lists the predefined parameters for the advanced options in the form
/info/usage General usage statistics

Administration

These routes are used to administrate ezPAARSE. For the most part, they can be used through the application’s admin page. They require being authentified, except for /register.

PATH Méthode Usage
/register POST Creates the first account as administrator. It doesn't work if one or more users are already existing.
Parameters: username, password
/platforms/status GET Reports on the platforms' state
Returns: uptodate or outdated
/platforms/status PUT Updates the platforms
The body must contain uptodate
/users GET Returns the list of local users
/users/ POST Creates a local user
Parameters: username, password
/users/{username} DELETE Deletes a local user