This Python program runs in three different modes. In the first mode, the program sequentially reads records one by one from different WARC files and combines them into a new file in which an extra metadata element is added to indicate when this merging operation occurred. Furthermore, a new “warcinfo” record is placed at the beginning of the resulting WARC file(s). This record contains information about WARCMerge program and metadata for date and time.
The second mode is very similar to the first mode; the only difference here is the source of WARC files. In the first mode the source files are from a specific directory, while in the second mode an explicit list of WARC files is provided.
In third mode, an existing WARC file is appended to the end of another WARC file. In this case, only one metadata element (WARC-appended-by-WARCMerge) is added to each “warcinfo” record found in the first file.
Finally, regardless of the mode, WARCMerge always checks for errors like validating the resulting WARC files as well as ensuring that the size of the resulting file does not exceed the maximum size limit. (The maximum size limit can be changed through the program's source code by assigning a new value to the variable: MaxWarcSize).
- WARCMerge's source code is available on GitHub, or by running the following command:
git clone https://github.com/maturban/WARCMerge.git
- Tested on Linux Ubuntu 12.04
- Requires Python 2.7+
- Requires Java to run Jwattool for validating WARC files
- Requires the warc python library to work with WARC files and WARC records
As described above, WARCMerge can be run in three different modes; see the three examples below (adding the option '-q' will make the program run in a quiet mode; the program does not display any messages):