Thursday, September 10, 2015

2015-09-10: CDXJ: An Object Resource Stream Serialization Format

I have been working on an IIPC funded project of profiling various web archives to summarize their holdings. The idea is to generate statistical measures of the holdings of an archive under various lookup keys where a key can be a partial URI such as Top Level Domain (TLD), registered domain name, entire domain name along with any number of sub-domain segments, domain name and a few segments from the path, a given time, a language, or a combination of two or more of these. Such a document (or archive profile) can be used answer queries like "how many *.edu Mementos are there in a given archive?", "how many copies of the pages are there in an archive that fall under netpreserve.org/projects/*", or "number of copies of *.cnn.com/* pages of 2010 in Arabic language". The archive profile can also be used to determine the overlap between two archives or visualize their holdings in various ways. Early work of this research was presented at the Internet Archive during the IIPC General Assembly 2015 and later it was published at:
Sawood Alam, Michael L. Nelson, Herbert Van de Sompel, Lyudmila L. Balakireva, Harihar Shankar, and David S. H. Rosenthal, Web Archive Profiling Through CDX Summarization, Proceedings of TPDL 2015.
One of many challenges to solve in this project was to come up with a suitable profile serialization format that has the following properties:
  • scalable to handle terabytes of data
  • facilitates fast lookup of keys
  • allows partial key lookup
  • allows arbitrary split and merge
  • allows storage of arbitrary number of attributes without changing the existing data
  • supports link data semantics (not essential, but good to have)
We were initially leaning towards JSON format (with JSON-LD for linked data) because it has wide language and tool support and it is expressive like XML, but less verbose than XML. However, in the very early stage of our experiments we realized that it has scale issues. JSON, XML, and YAML (a more human readable format) are all single root node document formats, which means a single document serialized in any of these formats can not have multiple starting nodes; they all must be children of a single root node. This means it has to be fully loaded in the memory which can be the bottleneck in the case of big documents. Although there are streaming algorithms to parse XML or JSON, they are slow and usually only suitable for cases when an action is to be taken while parsing the document as opposed to using them for frequent lookup of the keys and values. Additionally, JSON and XML are not very fault tolerant i.e., a single malformed character may result in making the entire document fail to be parsed. Also, because of the single root node, splitting and merging of the documents is not easy.

We also thought about using simple and flat file formats such as CSV, ARFF, or CDX (a file format used in indexing WARC files for replay). These flat formats allow sorting that can facilitate fast binary lookup of keys and the files can be split in arbitrary places or multiple files with the same fields (in the same order) can be merged easily. However, the issue with these formats is that they do not support nesting and every entry in them must have the same attributes. Additionally the CDX has limited scope of extension as all the fields are already described and reserved.

Finally, we decided to merge the good qualities from CDX and JSON to come up with a serialization format that fulfills our requirements listed above. We call it CDXJ (or CDX-JSON). Ilya Kreymer first introduced this format in PyWB, but there was no formal description of the format. We are trying to formally introduce it and make some changes that makes it extendable so that it can be utilized by the web archiving community as well as broader web communities. The general description of the format is, "a plain file format that stores key-value pairs per line in which the keys are strings that are followed by their corresponding value objects where the values are any valid JSON with the exception that the JSON block does not contain any new line characters in it (encoded newline "\n" is allowed)." Here is an example:
@context ["http://oduwsdl.github.io/contexts/arhiveprofiles"]
@id {"uri": "http://archive.org/"}
@keys ["surt_uri", "year"]
@meta {"name": "Internet Archive", "year": 1996}
@meta {"updated_at": "2015-09:03T13:27:52Z"}
com,cnn)/world - {"urim": {"min": 2, "max": 9, "total": 98}, "urir": 46}
uk,ac,rpms)/ - {"frequency": 241, "spread": 3}
uk,co,bbc)/images 2013 {"frequency": 725, "spread": 1}

Lines starting with @ sign signify special purpose keys and they make these lines to appear together at the top of the file when sorted. The first line of the above example with the key @context provides context to the keywords used in the rest of the document. The value of this entry can be an array of contexts or an object with named keys. In the case of an array, all the term definitions from all the contexts will be merged in the global namespace (resolving name conflicts will be the responsibility of the document creator) while in the case of a named object it will serve like the XML Namespace.

The second entry @id holds an object that identifies the document itself and established relationship with other documents such as a parent/sibling when it is split. The @keys entry specifies the name of the key fields in the data section as an array of the field names in the order they appear (such as the primary key name appears first then the secondary key, and so on). To add more information about the keys, each element of the @keys array can have an object. All the lines except the special keys (@-prefixed) must have the exact number of fields as described in the @keys entry. Missing fields in the keys must have the special placeholder character "-".

The @meta entries describe the aboutness of the resource and other metadata. Multiple entries of the same special keys (that start with an @ sign) should be merged at the time of consuming the document. Splitting them in multiple lines increases the readability and eases the process of updates. This means the two @meta lines can be combined in a single line or split into three different lines each holding "name", "year", and "updated_at" separately. The policy to resolve the conflicts in names when merging such entries should be defined per key basis as suitable. These policies could be "skip", "overwrite", "append" (specially for the values that are arrays), or some other function to derive new value(s).

The latter three lines are the data entries in which the first one starts with a key com,cnn)/world (which is the SURT for of the http://www.cnn.com/world) followed by a nested data structure (in JSON format) that holds some statistical distribution of the archive holdings under that key. The next line holds different style of statistics (to illustrate the flexibility of the format) for a different key. The last line illustrates a secondary key in which the primary keys is the SURT form of a URI followed by the a secondary key that further divides the distribution yearly.

Now, let's reproduce the above example in JSON-LD, YAML, and XML respectively for comparison:
{
  "@context": "http://oduwsdl.github.io/contexts/arhiveprofiles",
  "@id": "http://archive.org/",
  "meta": {
    "name": "Internet Archive",
    "year": 1996,
    "updated_at": "2015-09:03T13:27:52Z"
  },
  "surt_uri": {
    "com,cnn)/world": {
      "urim": {
        "min": 2,
        "max": 9,
        "total": 98
      },
      "urir": 46
    },
    "uk,ac,rpms)/": {
      "frequency": 241,
      "spread": 3
    },
    "uk,co,bbc)/images": {
      "year": {
        "2013": {
          "frequency": 725,
          "spread": 1
        }
      }
    }
  }
}
---
  @context: "http://oduwsdl.github.io/contexts/arhiveprofiles"
  @id: "http://archive.org/"
  meta: 
    name: "Internet Archive"
    year: 1996
    updated_at: "2015-09:03T13:27:52Z"
  surt_uri: 
    com,cnn)/world: 
      urim: 
        min: 2
        max: 9
        total: 98
      urir: 46
    uk,ac,rpms)/: 
      frequency: 241
      spread: 3
    uk,co,bbc)/images: 
      year: 
        2013: 
          frequency: 725
          spread: 1
<?xml version="1.0" encoding="UTF-8"?>
<profile xmlns="http://oduwsdl.github.io/contexts/arhiveprofiles">
  <id>http://archive.org/</id>
  <meta>
    <name>Internet Archive</name>
    <year>1996</year>
    <updated_at>2015-09:03T13:27:52Z</updated_at>
  </meta>
  <data>
    <record surt-uri="com,cnn)/world">
      <urim>
        <min>2</min>
        <max>9</max>
        <total>98</total>
      </urim>
      <urir>46</urir>
    </record>
    <record surt-uri="uk,ac,rpms)/">
      <frequency>241</frequency>
      <spread>3</spread>
    </record>
    <record surt-uri="uk,co,bbc)/images" year="2013">
      <frequency>725</frequency>
      <spread>1</spread>
    </record>
  </data>
</profile>

The WAT format, commonly used in the web archiving community also uses JSON fragments as values for each entry separately to deal with the single root document issue, but it does not restrict the use of new-line character. As a consequence, sorting the file line-wise is not allowed, which affects the lookup speed. In contrast, CDXJ files can be sorted (like CDX files) which allows binary search in the files on the disk and prove very efficient in lookup heavy applications.

We have presented the earlier thoughts to seek feedback on the CDXJ serialization format at Stanford University during IIPC GA 2015. The slides of the talk are available at:


Going forward we are proposing to split the syntax and semantics of the format in separate specifications where the overall syntax of the file is defined as a base format while further restrictions and semantics such as adding meaning to the keys, making certain entries mandatory, giving meaning to the terms, enforcing specific sort order and defining the scope of the usage for the document are described in a separate derived specification. This practice is analogous to the XML which defines the basic syntax of the format and other XML-based formats such as XHTML or  Atom add semantics to it.

A generic format for this purpose can be defined as Object Resource Stream (ORS) that registers ".ors" file extension and "application/ors" media type. Then CDXJ extends from that to add semantics to it (as described above) which registers ".cdxj" file extension and "application/cdxj+ors" media type.

Object Resource Stream (ORS)

The above railroad diagram illustrates the grammar of the ORS format. Every entry in this format acquires one line. Empty lines are allowed that should be skipped during the consumption of the file. Apart from the empty lines, every line starts with a string key, followed by a single-line JSON block as the value. The keys are allowed to have optional leading or trailing spaces (SPACE or TAB characters) for indentation, readability, or alignment purposes, but should be skipped when consuming the file. Keys can be empty strings which means values (JSON blocks) can be present without being associated with any keys. Quoting keys is not mandatory, but if necessary one can use double quotes for the purpose. Quoted string keys will preserve any leading or trailing spaces inside quotes. None of the keys or values are allowed to break the line (new lines should be encoded if any) as the line break starts a new entry. As mentioned before, it is allowed to have a value block without a corresponding key, but not the opposite. Since the opening square and curly brackets indicate the start of the JSON block, hence it is necessary to escape them (as well as the escape and double quote characters) if they appear in the keys, and optionally their closing pairs should also be escaped. An ORS parser should skip malformed lines and continue with the remaining document. Optionally the malformed entries can be reported as warnings.

CDXJ (or CDX-JSON)

The above railroad diagram illustrates the grammar of the CDXJ format. CDXJ is a subset of ORS as it introduces few extra restriction in the syntax that are not present in the ORS grammar. In the CDXJ format the definition of the key string is strict as it does not allow leading spaces before the key or empty string as the key. If there are spaces in the CDXJ key string, it is considered a compound key where every space separated segment has an independent meaning. Apart from the @-prefixed special keys, every key must have the same number of space separated fields and empty fields use the placeholder "-". CDXJ only allows a single SPACE character to be used as the delimiter between the parts of the compound key. It also enforces a SPACE character to separate the key from the JSON value block. As opposed to the ORS, CDXJ does not allow TAB character as the delimiter. Since the keys cannot be empty strings in CDXJ, there must be a non-empty key associated with every value in it. Additionally, the CDXJ format also prohibits empty lines. These restrictions are introduced in the CDXJ to encourage its use as sorted files to facilitate binary search on the disk. When sorting CDXJ files, byte-wise sorting is encouraged for greater interoperability (this can be achieved on Unix-like operating systems by setting an environment variable LC_ALL=C). On the semantics side CDXJ introduces optional @-prefixed special keys to specify metadata, the @keys key to specify the field names of the data entries, and the @id and the @context keys to provision linked-data semantics inspired by JSON-LD.

Applications

There are many applications where a stream of JSON block is being used or can be used. Some of the applications even enforce the single line JSON restriction and optionally prefix the JSON block with associated keys. However, the format is not formally standardized and it is often called JSON for the sake of general understanding. The following are some example applications of the ORS or CDXJ format:
  • Archive Profiler generates profiles of the web archives in CDXJ format. An upcoming service will consume profiles in the CDXJ format to produce a probabilistic rank ordered list of web archives with the holdings of a given URI.
  • PyWB accepts (and encourages the usage of) CDXJ format for the archive collection indexes and the built-in collection indexer allows generating CDXJ indexes.
  • MemGator is a Memento aggregator that I built. It can be used as a command line tool or run as a web service. The tool generates TimeMaps in CDXJ format along with Link and JSON formats. The CDXJ format response is sorted by datetime as the key and it makes it very easy and efficient to consume the data chronologically or using text processing tools to perform filtering based on partial datetime.
  • 200 million Reddit link (posts) corpus that I collected and archived recently (it will be made publicly available soon) in CDXJ format (where the key is the link id), while 1.65 billion Reddit comments corpus is available in a format that conforms ORS format (although it is advertised as series of JSON blocks delimited by new lines).
  • A very popular container technology Docker and a very popular log aggeragation and unification service Fluentd are using a data format that conforms the above described specification of ORS. Docker calls their logging driver JSON which actually generates a stream of single-line JSON blocks that can also have the timestamp prefix with nano-second precision as the key for each JSON block. Fluentd log is similar, but it can have more key fields as prefix to each line of JSON block.
  • NoSQL databases including key-value store, tuple store, data structure server, object database, and wide column store implementations such as Redis and CouchDB can use ORS/CDXJ format to import and export their data from and to the disk.
  • Services that provide data streams and support JSON format such as Twitter and Reddit can leverage ORS (or CDXJ) to avoid unnecessary wrapper around the data to encapsulate the under a root node. This will allow immediate consumption of the stream of the data as it arrives to the client, without waiting for the end of the stream.
In conclusion, we proposed a generic Object Resource Stream (ORS) data serialization format that is composed of single line JSON values with optional preceding string keys per line. For this format we proposed the file extension ".ors" and the media-type "application/ors". Additionally, we proposed a derivative format of ORS as CDXJ with additional semantics and restrictions. For the CDXJ format we proposed the file extension ".cdxj" and the media-type "application/cdxj+ors". The two formats ORS and CDXJ can be very helpful in dealing with endless streams of structured data such as server logs, Twitter feed, and key-value stores. These formats allow arbitrary information in each entry (like schema-less NoSQL databses) as opposed to the fixed-field formats such as spreadsheets or relational databases. Additionally, these formats are text processing tool friendly (such as sort, grep, and awk etc.) which makes them very useful and efficient for file based data store. We have also recognized that the proposed formats are already in use on the Web and have proved their usefulness. However, they are neither formally defined nor given a separate media-type.

--
Sawood Alam

1 comment:

  1. Hi Sawood,

    I definitely like having a more formal spec for CDXJ, and I like the "@" metadata addition.

    I an not sure that there is a need for the more general ORS, unless there other communities that would like to be involved in and willing to adopt a more generic standard.

    It may be more beneficial to just focus on CDXJ for web archiving / url analysis, and use "application/cdxj". The exact semantics are already pretty open-ended for CDXJ.

    (A quick search for line-delimited JSON formats reveals:

    http://ndjson.org/
    http://jsonlines.org/

    The addition that ORS would bring, a key field, is only useful for situations where lexicographic ordering is important, and I don't know how widespread that would be)

    Ilya



    ReplyDelete