Wednesday, May 29, 2013

2013-05-29 mcurl - Command Line Memento Client

The Memento protocol works in two directions:
  • Server implementation: the server complies with Memento protocol, so it can read the "Accept-Datetime" header, do the content-negotiation in datetime dimension, and return the memento near the requested datetime to the user. Successful examples include: Internet Archive Wayback Machine, British Library Wayback Machine, and DBpedia.
  • Client implementation: the user needs a tool to sets the requested URI with the preferred  datetime in the past. Current tools include: FireFox add-ons MementoFox, British Library Memento Service, and Memento Browser for Android and iPhone.
Today, we are pleased to announce mcurl, a command-line memento client. mcurl is a wrapper for the unix curl command that is capable of doing content negotiation in the datetime dimension with Memento TimeGates. mcurl supports all curl parameters in addition to the new parameters that are Memento related.

Users may use the curl command to do content-negotiation in the datetime dimension by passing the "Accept-Datetime" header with -H argument and connect directly to the TimeGate, however mcurl has more features than that.
  • TimeGate identification: using mcurl, the user needs to specify the datetime and the uri only. mcurl has its own default TimeGate, it could be overwritten by user. Also, mcurl can read the TimeGate from the link response header returned from the URI.
  • Handling redirection: mcurl implemented the HTTP redirection retrieval policy as appeares in section 4 of the Memento Internet Draft v 7.
  • Embedded resources rewriting: mcurl provides two modes for the embedded resources. Strict mode, where mcurl will accept the embedded resources URI from the web archive, and Thorough mode, where mcurl will repeat the content-negotiation for each embedded resource URI to get the best/nearest resource.   Thus, using the Memento Aggregator mcurl can construct the page from multiple archives.
For example, using curl command to get a memento for http://ipl.org near Fri, 05 Feb 2010 could be formed as the following:

curl -H "Accept-Datetime: Fri, 05 Feb 2010 14:28:00 GMT" 
http://mementoproxy.cs.odu.edu/aggr/timegate/http://ipl.org

If you look deeply in the returned page, you will find the embedded resources came from the live web instead of the web archive. It happened because the current Wayback Machine's Memento implementation doesn't provide rewriting for the embedded resources. This problem is easily solved by mcurl.



perl ./mcurl.pl -L  --mode thorough 
--datetime 'Fri, 05 Feb 2010 14:28:00 GMT' 
--replacedump dump.txt http://ipl.org


Environment setup
mcurl is written in Perl, version 5 or later is required. Also, curl verion 7.15.5 and HTML::Parser package are required.

Memento related Parameters
mcurl supports a wide range of Memento related identifiers that help the user to set his favorite datetime, timegate and embedded resources mode.
  • -tm, --timemap <link|rdf>: To select the type of Timemap it may be link or html.
  • -tg, --timegate <uri[,uri]>: To select the favorite Timegates.
  • -dt, --datetime <date in rfc822 format>: To select the date in the past (For example, Thu, 31 May 2007 20:35:00 GMT).
  • -mode  <thorough|fast>: To specify mcurl embedded resource policy, default value is thorough.
  • --debug: To enable the debug mode to display more results.
Download
mcurl is available on GitHub repository. There are three files required: mcurl.pl, MementoThread.pm, and MementoParser.pm. 

Usage Examples
In this section, we list some usage examples that explain the behavior of mcurl.
  1. Calling an original resource with the default timegate
  2. mcurl.pl -I -L --debug --datetime 'Sun, 23 July 2006 12:00:00 GMT' http://www.cnn.com
    Expected results: it will do the content negotiation in the datetime dimension, it uses the default timegate when required

  3. Calling timemap in link format with the default timegate
  4. mcurl.pl -I -L --debug --timemap link http://www.cnn.com
    Expected results: it will download the timemap in application-link format, it uses the default timegate

  5. Calling an original resource with a specific timegate
  6. mcurl.pl -I -L --debug --timegate 'http://mementoproxy.lanl.gov/aggr/timegate/' http://www.cnn.com
    Expected results: it will do the content negotiation in the datetime dimension and get the last memento, it uses the specified timegate when required

  7. Calling an original resource with a specific timegate
  8. mcurl.pl -I -L --debug --datetime 'Sun, 23 July 2006 12:00:00 GMT' --timegate 'http://mementoproxy.lanl.gov/aggr/timegate/' http://www.cnn.com
    Expected results: it will do the content negotiation in the datetime dimension, it uses the specified timegate when required

  9. Calling timemap in link format with the specific timegate
  10. mcurl.pl -I -L --debug --timemap link --timegate 'http://mementoproxy.lanl.gov/aggr/timegate/' http://www.cnn.com
    Expected results: it will download the timemap in application-link format, it uses the specified timegate when required

  11. Calling an original resource that will respond with timegate in response headers
  12. mcurl.pl -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT" http://lanlsource.lanl.gov/hello
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate

  13. Calling an original resource (R1) that has a redirection (R2), (R1) has valid mementos
  14. mcurl.pl -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT" http://www.zeit.de/
    Expected results: it will do the content negotiation in the datetime dimension for R2.

  15. Calling an original resource (R1) that has a redirection (R2), (R1) does NOT have valid mementos
  16. mcurl.pl -I -L --debug --datetime "Thu, 23 July 2009 12:00:00 GMT" http://lanlsource.lanl.gov
    Expected results: it will do the content negotiation in the datetime dimension using R2.

  17. Calling an original resource that has a timegate redirection
  18. mcurl.pl -I -L --debug --datetime "Mon, 23 July 2007 12:00:00 GMT" http://lanlsource.lanl.gov/hello
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate. The timegate /tg/ has a redirection to /ta/

  19. Calling an original resource that has a timegate redirection
  20. mcurl.pl -I -L --debug --datetime "Sat, 23 July 2011 12:00:00 GMT" http://lanlsource.lanl.gov/hello
    Expected results: it will do the content negotiation in the datetime dimension, the site will provide a timegate which will override the default timegate. The timegate /tg/ has a redirection to /ts/

  21. Calling an original resource with Acceptable time period
  22. mcurl.pl -I -L --debug --datetime Thu, 23 July 2009 12:00:00 GMT; -P5MT5H;+P5MT6H' http://www.cs.odu.edu
    Expected results: it will do the content negotiation in the datetime dimension with specified time period which has valid mementos, it uses the default timegate when required

  23. Calling an original resource with NOT Acceptable time period
  24. mcurl.pl -I -L --debug --datetime 'Thu, 23 July 2009 12:00:00 GMT; -P5MT5H;+P5MT6H' http://www.cs.odu.edu
    Expected results: it will do the content negotiation in the datetime dimension with specified time period which does not have any valid mementos, it uses the default timegate when required

  25. Calling an original resource with invalid Accept-datetime header
  26. mcurl.pl -I --debug --datetime 'Sun, 23 July xxxxxxxxxxxxxxxx' http://www.cnn.com
    Response code: 400

  27. Override the discovered timegate with the specific one
  28. mcurl.pl -I -L --debug --datetime "Sat, 23 July 2011 12:00:00 GMT" --timegate 'http://mementoproxy.cs.odu.edu/aggr/timegate' --override http://lanlsource.lanl.gov/hello

  29. using the --replacedump switch to dump the replacement for the embedded resources to an external file for further analysis
  30. mcurl.pl -L --mode thorough --datetime "Sat, 03 Dec 2010 12:00:00 GMT" --replacedump cnnreplace.txt http://www.cnn.com

  31. accessing the dbpedia archive
  32. mcurl.pl -L --mode thorough --datetime "Sat, 03 Dec 2010 12:00:00 GMT" http://dbpedia.org/page/Brisbane
----
Ahmed AlSum

No comments:

Post a Comment