Platforms

Prerequisites for the development: git(hub) user’s guide

As a developer, you first have to sign up to Github to be able to contribute to ezPAARSE and:

  • write a new parser
  • maintain an existing parser
  • contribute to improvements in the core of ezPAARSE’s code

You then need to know a few git commands:

#get a local version of the github
git clone https://github.com/ezpaarse-project/ezpaarse.git

#if you already had the project, update it
cd ezpaarse/
git pull

#edit a file
cd ezpaarse/
echo "// my modification" >> ./app.js

#get an overview of which files have been modified/added/deleted (before a commit)
git status

#compare the local modifications with the local repository, line by line, before saving the modifications
git diff

#send the modifications to you local repository
git commit ./app.js -m "a comment explaining my modification""

#display the list of commits
git log

#add a new file
touch ./myexamplefile
git add ./myexamplefile
git commit ./myexamplefile -m "add an example file"

#send the (commmitted) modifications to the distant repository (authorization from the distant repo needed).
git push

Note: unless you have a privileged access (from the ezPAARSE team), you must first “fork” the ezPAARSE github repository in order to work on that copy. Once you are satisfied with your changes, you can submit your work to the team by sending a “pull request”. Your job will be reviewed and integrated by the team, if no problem is detected. The team then provides write access rights to regular contributors, in order to facilitate contributions.

How does a parser work?

A parser takes the form of an executable parser.js file accompanied by a description file manifest.json and a validation structure (contained in the test directory, see below).

The executable can take two kinds of input:

  • a stream of URLs to analyze (one per line)
  • or stream of JSON objects (one per line) with the --json option, each containing:
    • the URL to analyze
    • other information to qualify the access event (such as the size of the download)

The parser outputs a stream of JSON objects, one per line and for each analyzed URL. The object may be empty if the parser doesn’t find anything.

Its usage is documented when you call it with the --help option. An example parser is available.

Usage Examples

echo "http://www.sciencedirect.com:80/science/bookseries/00652296" | ./parser.js
#{"unitid":"00652296","print_identifier":"0065-2296","title_id":"00652296","rtype":"BOOKSERIE","mime":"MISC"}

echo '{ "url": "http://www.sciencedirect.com:80/science/bookseries/00652296", "status": 200 }' | ./parser.js --json
#{"unitid":"00652296","print_identifier":"0065-2296","title_id":"00652296","rtype":"BOOKSERIE","mime":"MISC"}

Writing a Parser

Parsers are written in Javascript. A good knowledge of the language is not really necessary to write a parser. Most of the code being outsourced in a common file for all parsers, only the URL analysis function must be adapted, making the code short and relatively simple. Most parsers still require a basic knowledge of regular expressions.

Writing a new parser thus consists of:

  • creating the manifest.json file (see below)
  • creating the test file, according to what is documented on the plaftform analysis page
  • creating the parser so that its output is conform to the test file (see below)
  • launch the validation tests (see below)

Once the parser passes the tests, it can be integrated into the github repo.

A skeleton can be used as a starting point. The directory structure and the files can be automatically generated by launching the platform-init command.

Some tools are available online to help you in the the writing of regular expressions.

A detailed procedure is available on AnalogIST.

Testing Parsers

Each parser comes with the necessary to be tested: one or more files located in the test subdirectory. These files are in CSV format and follow the platform.version.csv pattern.

The test principle is represented by the following diagram: Parsers' test

For each row, values from columns prefixed with in- are sent to the parser, and the result is compared with the values coming from columns prefixed with out-. These must be strictly identical.

More details about the identifiers returned by the parsers are available on this page.

When the parser only takes an input URL (ie. no other fields prefixed with in-), it is possible to manually run the test file with the following command (from the directory the platform):

#platform.version.csv is the test file
cat test/platform.version.csv | ../../bin/csvextractor --fields="in-url" -c --noheader | ./parser.js

The tests are now integrated into ezPAARSE’s platforms folder (instead of coming with the core of ezPAARSE). It means that ezPAARSE doesn’t need to be running as the parsers’ tests are now independent. Before you launch the platforms’ tests, you need to setup the environment:

cd platforms/
make install

Then you can either test all parsers:

make test

or test only a selection, by naming them (use the shortnames). For example, if you want to test the parsers for Nature and ScienceDirect:

make test sd npg

See the ezpaarse-platforms README for more information.

Description of a parser

The parser is described by a manifest.json file, located in its root directory. This file contains the following information:

  • name: the short name of the parser, used as a prefix to the file names. Care should be taken not to use a name already used.
  • version: the current version of the parser. It is used in the validation file names parsers.
  • longname: the long name of the parser, used in the documentation.
  • describe: the description for the parser, can be a paragraph.
  • docurl: the URL of the documentation on the analogist website (must end by /).
  • domains: an array of domains that the parser can handle.
  • pkb-domains: if the platform has a PKB and if domains are present, this field is the column that contains them,

Vendors’ PKBs management principles

Knowledge bases are used to:

  • match the identifiers found on publishers platforms (which can be specific and proprietary) and standardized identifiers (like ISSNs)
  • include the titles of accessed resources in the results

Knowledge bases are saved as text file KBART format and are specific to each platform. The platform_AllTitles.txt file contains the mappings between identifiers of a specific platform and ISSN (or other standardized identifier). The KBART field called title_id is used to establish this correspondence with the print_identifier field (for paper resources) or online_identifier (electronic resources). The list of KBART fields and their meaning is available.

Knowledge bases are loaded by ezPAARSE and their structure must be previously controlled by the pkbvalidator tool

More on AnalogIST

Running a specific test

For testing a specific feature, use mocha and give the path of the test file as a parameter. For example, to test custom formats:

. ./bin/env
mocha ./test/custom-formats-test

For running only one feature test, use mocha and give the path of the test file and the two-digit number (@xx) of the test (with -g) as parameters.

For example, running only the second test of personalized formats will look like:

. ./bin/env
mocha ./test/custom-formats-test -g @02