Friday, October 19, 2018

Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.

  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text proceeding. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

    encodings = ['ISO-8859-1','ascii']
    for e in encodings:
            fh =['xmlfile'],'r',encoding=e)
        except UnicodeDecodeError:
            logging.debug('got unicode error with %s, trying a different encoding' % e)
            logging.debug('opening the file with encoding: %s' % e)
    f =['xmlfile'],encoding=e)
    soup = BeautifulSoup(,'html.parser')

Note that we use instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

Thursday, October 11, 2018

2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News SitesThe Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend: 
  • Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
  • Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.
Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.
Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.
This workshop also exposed many interesting tools to use for archiving and exploring emails including:

Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.

Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.
During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.
Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.
Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.

Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:
  • Planned / unplanned - informal processes
  • Migration / mobility, patterns, modalities of inclusion & exclusion
  • Dynamic of nature & technology, urban ecologies
After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.

Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.

Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

  • Emulation of auxiliary machines
  • Synthetic stub services or simulations
  • Capture network traffic and re-enact on access 

Dragan Espenschied presenting on Apparently Infiinite Objects 
The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:
“It’s not the kind of thing that gets taught in art school, is it?”
“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.” 
The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.
The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.
For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.

After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards 

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.
After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.
After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!
Another interesting poster consisted of research into file format usage over time.
I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture
Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.
Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.
I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.
This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:
  • Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
  • Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it  by Eld Zierau
  • Best poster award: Precise & Persistent Web Archive References by Eld Zierau

Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

As a final farewell, attendees were handed bags of tulip buds and told this:
"An Honorary Award will be presented to the people with the best tulip pictures."
It seems William Kilbride, among others, have already got a foot up on all the competition.
This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.

Resources for iPRES 2018:

Wednesday, October 10, 2018

2018-10-10: Americans More Open Than Asians to Sharing Personal Information on Twitter: A Paper Review

Mat Kelly reviews "A Personal Privacy Preserving Framework..." by Song et al. at SIGIR 2018.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ

Americans are more open to share personal aspects on the Web than Asians. — Song et al. 2018

I recently read a paper published at SIGIR 2018 by Song et al. titled "A Personal Privacy Preserving Framework: I Let You Know Who Can See What" (PDF). The title alone captivated my interest with the above claim deep within the text.

The authors' goal of the work was to reduce users' privacy risks on social networks by determining who could see what sort of information they posted. They did so by establishing boundary regulations through summarizing the literature and associate them with 32 categories corresponding to a personal aspect of a user, broken down into 8 groups spanning the categories of personal attributes to life milestones. The authors then fed a list of keywords to the Twitter Search Service for each category they established. From this taxonomy they created a model to be used to uncover personal aspects from users' posts. Their model, TOKEN (a forced acronym of laTent grOup multi-tasK lEarniNg), allowed the authors to create guidelines for information disclosure by users into four kinds of social circles and generate a data set consisting of a rich set of privacy-oriented features (available here).

The authors noted that users' private tweets are very sparse and thus they used the Twitter service to gather posts that met the categories in their taxonomy to collect just over 269k tweets. To reduce the noise in the collection, the authors filtered tweets that contain URLs that were not in reference to the users' respective other social media posts. Retweets and tweets less than 50 characters were excluded. The authors did not justify this exclusion.

To establish a ground truth, the authors used Amazon Mechanical Turk to have each post annotated with their selective categories. Turkers that did not validate at least 80% with the authors sampling were excluded from the results. This procedure resulted in just over 11k posts being labeled. To determine inter-worker reliability, the authors employed Fleiss' kappa (PDF of 1969 paper), adapting for the potential variance in label count/post by reducing to a binary classification, to determine moderate agreement (Fleiss' coefficient of 0.43).

The authors then extracted a set of privacy-oriented linguistic features using Linguistic Inquiry Word Count (LIWC), a Privacy Dictionary (per Vasalou et al.'s 2011 JASIST work), Sentiment Analysis (via Stanford's NLP classifier), Sentence2Vector (with each tweet a sentence), and an ad hoc meta-feature approach. The aforementioned final approach considered the presence of hashtags, slang words, images, emojis, and user mentions. Slang, here, was identified using the Internet Slang Dictionary.

Following this analysis, the authors established a prediction component by first formulating a predictive model inter-relating each of the 32 "tasks" within the 8 "groups". The authors anticipated that tasks within the same group would share relevant features, e.g., "places planning to go" and "current location", would share common features within the location group in their taxonomy. From this initial formulation they established the matrix L, whose columns represent the latent features, and S, whose rows represent the weights of the features in L.

To solve L and S, the authors optimized one variable while fixing the other in each iteration of analysis. To determine L, they took the derivative of their objective function (their Equation 5, see paper) with respect to L to produce a linear system with a vector B representing the stacking of columns into a single matrix and A, a definite and invertible matrix. Computing S with L fixed was a bit more mathematically complex that I will leave as an exercise in understanding to the interested reader.


...there is still a societal consensus that certain information is more private than the others (sic) from a general societal view.A. Islam et al. 2014

The authors used Mechanical Turk to build guidelines regarding disclosure norms in different circles. This was performed on two selections of Turkers limited by respective geographies of the U.S. and Asia. The authors note that 99% of the Asian participants were Indians. An anticipated real world goal of the authors was, when posting a tweet containing information on a health condition (for example), to set the privacy setting to only share this with her family members. This, I felt, would be an odd recommendation given:

  1. The corpus was of publicly available tweets.
  2. Twitter does not currently have a means of limiting who may see a tweet akin to services like Facebook.

This drastically reduces the usefulness of the recommendation, I feel, in the context of the medium observed.


The authors sought to detect privacy leakage by comparing the precision of TOKEN as compared to the S@K and P@K metrics, as they had previously done in Song et al. 2015 from (IJCAI). Here, S@K is representative of the mean probability that a correct interest is captured within the top K recommended categories and P@K standing for the proportion of the top K recommendations being correct. They used a grid search strategy to obtain the optimal parameters with 10-fold cross-validation.

Using S@K and P@K where K was set to 1, 3, and 5, the authors found LIWC to be most representative of the characterization of users' privacy features as compared to the aforementioned Privacy Dictionary, Sentence2Vector, etc. approaches. They attributed this to LIWC's inclusion of pronouns and verb tense that provide references and temporal hints.

In applying these feature configurations to their corpus, the authors noticed that timestamps played an important role in identifying private information leakage, so took a detour to cursorily explore this. Based on the patterns found (pictured below), various activities peak at certain times of day, e.g., drug and alcohol tweets around "20pm" (sic). It is unclear from the paper whether this was applied to both the U.S. and the Asian results. Further, the multiple plot display with variable inter-plot y-axis scales produces a deceptive result that the authors do not address.

Plots per Song et al. show temporal patterns.

To finally validate their model compared to S@K and P@K, they used SVM, MTL_Lasso,

A different set of categories was compared to show similarities in sharing comfort between Americans and Asians. Of these (healthcare treatments, health conditions, passing away, specific complaints, home address, current location, contact information, and places planning to go), American Turkers were much more restrictive about sharing with the outside world where Asian Turkers exhibited a similarly and relatively conservative sentiment about sharing. From this, the author concluded:

Americans are more open to share personal aspects on the Web than Asians. — Song et al. 2018

Take Home

I found this study to be interesting despite some of the methodological problems and derived conclusions. As I mentioned, the inability to regulate who sees tweets when posting (a la Facebook) affects the nature of the tweet with a potential likely bias toward the tweeter being less concerned for privacy. The authors did not mention whether it was asked if each Turker personally used Twitter or if they even mentioned to the Turkers that the text they judged were tweets and not just "messages posted online". This context, if excluded, could make those judging the tweets unsuitable to do so. I would hope to see an expanded version of this study (say, posted to arXiv) with more comprehensive results, as the authors stated space was a limitation, but there was no indication as such.

—Mat (@machawk1)