Wednesday, July 31, 2019

2019-03-27: Install ParsCit on Ubuntu

ParsCit is a citation parser developed by a joint effort of Pennsylvania State University and National University of Singapore. Over the past ten years, it is been compared with many other citation parsing tools and is still widely used. Although Neural ParsCit has been developed, the implementation is still not as easy to use as ParsCit. In particular, PDFMEF encapsules ParsCit as the default citation parser.

However, many people found that installing ParsCit is not very straightforward. This is partially because it is written in perl and the instructions on the ParsCit website are not 100% accurate. In this blog post, I describe the installation procedures of ParsCit on a Ubuntu 16.04.6 LTS desktop. Installation on CentOS should be similar. The instructions do not cover Windows.

The following steps assume we install ParsCit under /home/username/github.
  1. Download the source code from https://github.com/knmnyn/ParsCit and unzip it.
    $ unzip ParsCit-master.zip
  2.  Install c++ compiler
    $  sudo apt install g++
    To test it, write a simple program hello.cc and run
    $ g++ -o hello hello.cc
    $ ./hello
  3. Install ruby
    $ sudo apt install ruby-full
    To test it, run
    $ ruby --version
  4. Perl usually comes with the default Ubuntu installation, to test it, run
    $ perl --version
  5. Install Perl modules, first start CPAN
    $ perl -MCPAN -e shell
    choose the default setups until the CPAN prompt is up:
    cpan[1]>
    Then install packages one by one
    cpan[1]> install Class::Struct
    cpan[2]> install Getopt::Long
    cpan[3]> install Getopt::Std
    cpan[4]> install File::Basename
    cpan[5]> install File::Spec
    cpan[6]> install FindBin
    cpan[7]> install HTML::Entities
    cpan[8]> install IO::File
    cpan[9]> install POSIX
    cpan[10]> install XML::Parser
    cpan[11]> install XML::Twig
    choose the default setups
    cpan[12]> install XML::Writer
    cpan[13]> install XML::Writer::String
  6. Install crfpp (verison 0.51) from source.
    1. Get into the crfpp directory
      $ cd crfpp/
    2. Unzip the tar file
      $ tar xvf crf++-0.51.tar.gz
    3. Get into the CRF++ directory
      $ cd CRF++-0.51/
    4. Configure
      $ ./configure
    5. Compile
      $ make
      This WILL cause an error like below
      path.h:26:52: error: 'size_t' has not been declared
           void calcExpectation(double *expected, double, size_t) const;
                                                          ^
      Makefile:375: recipe for target 'node.lo' failed
      make[1]: *** [node.lo] Error 1
      make[1]: Leaving directory '/home/jwu/github/ParsCit-master/crfpp/CRF++-0.51'
      Makefile:240: recipe for target 'all' failed
      make: *** [all] Error 2
      This is likely caused by missing the following two lines in node.cpp and path.cpp. Add these two lines before other include statements, so the beginnings of either file look like
      #include "stdlib.h"
      #include <iostream>
      #include <cmath>
      #include "path.h"
      #include "common.h"

      then run ./configure and "make" again.
    6. Install crf++
      $ make clean
      $ make
      This should rebuld crf_test and crf_learn.
  7. Move executables to where parscit expects to find them.
    $ cp cp crf_learn crf_test ..
    $ cd .libs
    $ cp -Rf * ../../.libs
  8. Test ParsCit. Under the bin/ directory, run
    $ ./citeExtract.pl -m extract_all ../demodata/sample2.txt
    $ ./citeExtract.pl -i xml -m extract_all ../demodata/E06-1050.xml

1 comment: