Tuesday, September 2, 2014

2014-09-02: WARCMerge: Merging Multiple WARC files into a single WARC file

WARCMerge is the name given to a new tool for organizing WARC files. The name describes it -- merging multiple WARC files into a single one. In web archiving, WARC files can be generated by well-known web crawlers such as Hertrix and Wget command, or by state-of-the-art tools like WARCreate/WAIL and Webrecorder.io which were developed to support the personal web archiving. WARC files contain records not only for HTTP responses and metadata elements but also all original HTTP requests. By having those WARC files, any replay tools (e.g., Wayback Machine) can be used to reconstruct and display the original web pages. I would emphasize here that a single WARC file may consist of records related to different web sites. In other words, multiple web sites can be archived in the same WARC file.

This Python program runs in three different modes. In the first mode, the program sequentially reads records one by one from different WARC files and combines them into a new file in which an extra metadata element is added to indicate when this merging operation occurred. Furthermore, a new “warcinfo” record is placed at the beginning of the resulting WARC file(s). This record contains information about WARCMerge program and metadata for date and time.

The second mode is very similar to the first mode; the only difference here is the source of WARC files. In the first mode the source files are from a specific directory, while in the second mode an explicit list of WARC files is provided.


In third mode, an existing WARC file is appended to the end of another WARC file. In this case, only one metadata element (WARC-appended-by-WARCMerge) is added to each “warcinfo” record found in the first file.

Finally, regardless of the mode, WARCMerge always checks for errors like validating the resulting WARC files as well as ensuring that the size of the resulting file does not exceed the maximum size limit. (The maximum size limit can be changed through the program's source code by assigning a new value to the variable: MaxWarcSize).

Download WARCMerge:
  • WARCMerge's source code is available on GitHub, or by running the following command:
                   git clone https://github.com/maturban/WARCMerge.git

Dependencies:
  • Tested on Linux Ubuntu 12.04
  • Requires Python 2.7+
  • Requires Java to run Jwattool for validating WARC files
Running WARCMerge:

As described above, WARCMerge can be run in three different modes; see the three examples below (adding the option '-q' will make the program run in a quiet mode; the program does not display any messages)


Example 1: Merging WARC files (found in "input-directory") into new WARC file(s):

%python WARCMerge.py ./collectionExample/ my-output-dir 

   Merging the following WARC files:
    ----------------------------------:
   [ Yes ]./collectionExample/world-cup/20140707174317773.warc 
   [ Yes ]./collectionExample/warcs/20140707160258526.warc 
   [ Yes ]./collectionExample/warcs/20140707160041872.warc 
   [ Yes ]./collectionExample/world-cup/20140707183044349.warc 

   Validating the resulting WARC files:
    ----------------------------------: 
     - [ valid ] my-output-dir/WARCMerge20140806040712197944.warc
Example 2: Merging all listed WARC files into new WARC file(s):

%python WARCMerge.py  1.warc  2.warc  ./dir1/3.warc  ./warc/4.warc mydir

    Merging the following WARC files:
    ----------------------------------: 
    [ Yes ] ./warc/4.warc
    [ Yes ] ./1.warc
    [ Yes ] ./dir1/3.warc
    [ Yes ] ./2.warc 
    
    Validating the resulting WARC files:
    ----------------------------------:
    - [ valid ] mydir/WARCMerge20140806040546699431.warc
Example 3: Appending a WARC file to another WARC file. The option '-a' is used here to make sure that any change in the destination file is done intentionally:

%python WARCMerge.py -a ./test/src/1.warc ./test/dest/2.warc

      The resulting (./test/dest/2.warc) is valid WARC file
In case a user enters any incorrect command-line arguments, the following message will be shown:

usage: WARCMerge [[-q] -a <source-file> <dest-file> ]                 
                             [[-q] <input-directory> <output-directory> ]
                             [[-q] <file1> <file2> <file3> ... <output-directory> ]  
WARCMerge can be useful as an independent component or it can be integrated with other existing tools. For example, WARCreate, is a Google Chrome extension that helps users to create WARC files for any visited web pages in the browser. Instead of having hundreds of such WARC files, WARCMerge brings all files together in one place.

-M. Aturban

2014-09-04 Edit:

One feature of the tool that we neglected to mention in the original post is the ability to create multiple merged WARC files.  We can set the maximum desired WARC file size and if the WARCs to be merged exceed that limit, multiple merged WARCs will be created.  

Example:  We want to merge 4 WARC files with sizes 165 KB, 600 KB, 680 KB, and 900 KB, respectively, and have set a MaxWarcSize of 1 MB.

% python WARCMerge.py ./smallCollectionExample/about-warcs/20140707160041872.warc ./smallCollectionExample/about-warcs/20140707160258526.warc ./smallCollectionExample/world-cup-2014/20140707174317773.warc ./smallCollectionExample/world-cup-2014/20140707183044349.warc ./warcs

Merging the following WARC files: 
----------------------------------: 
[Yes]./smallCollectionExample/world-cup-2014/20140707174317773.warc
[Yes]./smallCollectionExample/about-warcs/20140707160258526.warc
[Yes]./smallCollectionExample/about-warcs/20140707160041872.warc
[Yes]./smallCollectionExample/world-cup-2014/20140707183044349.warc

Validating the resulting WARC files: 
----------------------------------: 
[valid]  ./warcs/WARCMerge20140904030816942537.warc
[valid]  ./warcs/WARCMerge20140904030817019383.warc
[valid]  ./warcs/WARCMerge20140904030817129653.warc

2 comments:

  1. From the WARC spec: "to allow the concatenation of WARC files into a larger valid WARC file, it is allowable for 'warcinfo' records to appear in the middle of a WARC file."

    The spec allows simple concatenating of WARC files. Editing the warcinfo isn't necessary. Just cat *.warc > big.warc. Done. Fully in line with the spec.

    Adding a new warcinfo as you imply when you state "a new “warcinfo” record is placed at the beginning of the resulting WARC file(s)" is questionable as a warcinfo applies to any subsequent record until end of file or another warcinfo is encountered. To quote the spec again: "A 'warcinfo' record describes the records that follow it, up through end of file, end of input, or until next 'warcinfo' record."

    Thus it wouldn't really apply to any of the records since the next record would likely be the warcinfo from the first WARC file.

    If there is a use case where that the spec doesn't cover, I very much like to know what it is. You don't mention what they underlying point of this tool is.

    Editing the warcinfo to indicate concatenation also seems unnecessary if not as blatantly in violation of the spec.

    ReplyDelete
    Replies
    1. Thank you for the valuable feedback, Kristinn. Yes, you are correct that cat can be used to concatenate WARC files.

      We were considering the original WARC files to be artifacts of an archiving process that took place at a particular time with a particular crawler. Thus, we considered the merging of multiple WARC files to be creating a new artifact, so we wanted to note that with the insertion of the new warinfo record and metadata. If either inserting a metadata field or inserting a new warcinfo record are in violation of the spec, we can certainly make changes to the tool to follow the spec.

      One feature of the tool that we neglected to mention in the blog post is the ability to create multiple merged WARC files. We can set the maximum desired WARC file size and if the WARCs to be merged exceed that limit, multiple merged WARCs will be created.

      Example: We want to merge 4 WARC files with sizes 165 KB, 600 KB, 680 KB, and 900 KB, respectively, and have set a MaxWarcSize of 1 MB.

      % python WARCMerge.py ./smallCollectionExample/about-warcs/20140707160041872.warc ./smallCollectionExample/about-warcs/20140707160258526.warc ./smallCollectionExample/world-cup-2014/20140707174317773.warc ./smallCollectionExample/world-cup-2014/20140707183044349.warc ./warcs

      Merging the following WARC files:
      ----------------------------------:
      [Yes]./smallCollectionExample/world-cup-2014/20140707174317773.warc
      [Yes]./smallCollectionExample/about-warcs/20140707160258526.warc
      [Yes]./smallCollectionExample/about-warcs/20140707160041872.warc
      [Yes]./smallCollectionExample/world-cup-2014/20140707183044349.warc

      Validating the resulting WARC files:
      ----------------------------------:
      [valid] ./warcs/WARCMerge20140904030816942537.warc
      [valid] ./warcs/WARCMerge20140904030817019383.warc
      [valid] ./warcs/WARCMerge20140904030817129653.warc

      We can add this example to the blog post for completeness.

      One example of an anticipated use case is with our browser extension, WARCreate. This extension creates a single WARC for each webpage that is archived. In an extended archiving session, many small WARC files would be created. The WARCMerge tool would allow the user to combine the whole directory of files into a smaller set of larger WARC files with a single command.

      We also hope to deploy WARCMerge with our WAIL suite of archiving tools in the future. The native Python implementation of WARCMerge will make it easier to integrate with WAIL on non-Unix platforms.

      Delete