Monday, December 17, 2018

2018-12-17: CoQA Challenge: Machine Reading Competition Recent Result

CoQA is a dataset containing more than 127,000 questions with answers collected from more than 8000 conversations. Each conversation is about a passage in the form of questions and answers. One example of the passage is below

Once upon a time, in a barn near a farm house, there lived a little white kitten named Cotton. Cotton lived high up in a nice warm place above the barn where all of the farmer's horses slept. But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters. All of her sisters were cute and fluffy, like Cotton. But she was the only white one in the bunch. The rest of her sisters were all orange with beautiful white tiger stripes like Cotton's mommy. Being different made Cotton quite sad. She often wished she looked like the rest of her family. So one day, when Cotton found a can of the old farmer's orange paint, she used it to paint herself like them. When her mommy and sisters found her they started laughing. 

"What are you doing, Cotton?!" 

"I only wanted to be more like you". 

Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. 

"Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!" 

Then Cotton thought, "I change my mind. I like being special".

This reads like a picture book story, so you can see what kind of text current machine reading can achieve. The sample questions and their answers are 
Q  What color was Cotton?
A  white || a little white kitten named Cotton
A  white || white kitten named Cotton
A  white || white 
A  white || white kitten named Cotton.
Q  Where did she live?
A  in a barn || in a barn near a farm house, there lived a little white kitten
A  in a barn ||  in a barn near a farm house, there lived a little white kitten named Cotton
A  in a barn || in a barn
A  in a barn near ||  in a barn near a farm house, there lived a little white kitten named Cotton.
Q  Did she live alone?
A  no || Cotton wasn't alone
A  no || But Cotton wasn't alone
A  No ||  wasn't alone
A  no ||  But Cotton wasn't alone in her little home above the barn, oh no. She shared her hay bed with her mommy and 5 other sisters.

Note that there could be multiple answers because they are given based on different sentences quoted from the story. So these sentences are used as the explanations or justifications of the answers.

Up until November 2018, the best model is an ensemble model called SDNet developed by Microsoft with an overall accuracy of about 79%. In December 2018, iFlyTek and HIT (Harbin Institute of Technology) beats them and achieves an overall accuracy of about 80% using a single model. iFlyTek is a Chinese IT company and HIT is a research institute in China. The SDNet model and iFlyTek model both adopt Google's BERT module. The Stanford NLP group is at #8 with an accuracy of 65%. AllenAI is at #4 following Microsoft (single model) with an accuracy of 75%. This represents the best performance of QA systems nowadays. The SDNet system is described in a paper on arXiv.

For the most recent result, please see the front page of the competition website

Below is copied directly from the competition website. 
The unique features of CoQA include 1) the questions are conversational; 2) the answers can be free-form text; 3) each answer also comes with an evidence subsequence highlighted in the passage; and 4) the passages are collected from seven diverse domains. CoQA has a lot of challenging phenomena not present in existing reading comprehension datasets, e.g., coreference and pragmatic reasoning.

Jian Wu

Friday, December 14, 2018

2018-12-14: New Insight to Big Data: Trip to IEEE Big Data 2018

The IEEE Big Data 2018 was held in the Westin Seattle Hotel between December 10 and December 13, 2018. There are more than 1100 people registered. The accepting rates vary between 13% to 24%, with an average rate of 19%. I have a poster accepted titled “CiteSeerX-2018: A Cleansed Multidisciplinary Scholarly Big Dataset”, co-authored with C. Lee Giles, two of his graduate students (Bharath and Shaurya), as well as an undergraduate student who produced preliminary results (Jianyu Mao). I attended the conference on Day 2 and Day 3 and left the conference hotel after the keynote on Day 3.

Insights from Personal Meetings
The most important thing to attend conferences is to meet with old friends and know new friends. Old friends I met include Kyle Williams (Microsoft Bing), Mu Qiao (IBM, chair of I&G track), Yang Song (Google AI, co-chair of I&G track), Manlin Li (Google Cloud), and Madian Khabsa (Apple Siri). 

Kyle introduced the recent project on recommendations inferred from dialogs. He also committed giving an invited talk for my IR class in the Spring semester.
Mu mentioned his project on anomaly detection on time-series data.
Yang talked about his previous work on CiteSeerX, and Microsoft Academic Search. He said that one big obstacle for people to use MAS (and all other digital library search engines) is that none of them is comparable to Google Scholar in terms of completeness. The reason was simple: people want to see higher citation rates of their papers. He suggested me switching my focus on mining information that is not available by publishers from the text.
Madian told me that although I may think nobody uses Siri, there are still quite a lot of usage logs. One of the reasons that Siri is not very perfect is the relative smaller team compared with Google and Microsoft. He also says that it is a good time to apply for academic jobs these days because the industry pays far more than universities which attracts the best PhDs in AI.

I also introduced myself to Aidong Zhang, an NSF IIS director. Apparently, she knows Yaohang Li, and Jing He well. I sent my CV to her. I also met Huaglory Tianfield and Liqiang Wang at the University of Central Florida.

Insights from Keynote Speakers
There are two keynote speakers that I like the best, one is Blaise Aguera y Arcas from Google AI (actually he is the boss of Yang Song), and the other is Xuedong Huang from Microsoft. 

Blaise’s talk started from the first NN paper by McCulloch & Pitts (1943), now cited 16k+ based on Google Scholar.  He reviewed the development of AI since 2006, the year when Deep Learning people started to go to the CS conference. He talked about Jeff Dean, the director of Google Brain, and the recent paper by Bonawitz et al. (2016). He pointed out the recent progress on Federated Learning — learning of deep neural networks from decentralized data. Finally, he made a very good point: a successful application does not only depend on the model, but also on the data. He gave an example of a project that attempts to predict sexuality using face features. These features strongly depend on the shooting angle of the photograph, so the model makes wrong predictions. On the other hand, a work on predicting criminality using facial features of standard ID photographs achieves a very accurate result. 

Xuedong Huang’s talk was also comprehensive. He focused on the impact of big data on natural language processing, using Microsoft products as case studies. One of the most encouraging results is that Microsoft has developed effective real-time translation tools that can facilitate team meeting using different languages. It implies that if TTS (text to speech) becomes very sophisticated, people may not need to learn a foreign language anymore. He also reminds people that big data is a vehicle, not the final destination. Knowledge is the final destination. He also admits that current techniques are not sophisticated on denoising data. 

The other keynote speeches were not very impressive to me. I always feel that although it is OK for keynote speakers to talk about their own research/product, they should always try to stand at a higher place, overseeing a lot of problems the community are interested in, rather than focusing on a few very narrowly defined problems with too many jargons, definitions, and math equations. 

Impressive Talks
I selectively went to presentations and posters. What I felt was that streaming data, temporal data, and anomaly detection have been more and more popular. Below are some talks I was particularly interested in.

BigSR: real-time expressive RDF stream reasoning on modern Big Data platforms (Session L9: Recommendation Systems and Stream Data Management)
The motivation is to use a semantic based method to facility anomaly detection. This is my first time to hear Apache Flink. BigSR and Ray are promising replacements of Spark. I just took a Spark training session by PSC last week. Now there are systems faster than Spark!

Unsupervised Threshold Autoencoder to Analyze and Understand Sentence Elements (Annual Workshop on Big Data Analytics)
The author was working on a multiclass classification problem using an autoencoder. He found that the performance of the model depends on some hyperparameters, such as the number of hidden layers and/or neutrons. I commented that this was an artifact of his relatively low training size (44k). With unlimited training data, the difference of different model architectures may diminish. The author did not explain very well about how he manages the imbalance problem of training samples in different categories. 

Forecasting and Anomaly Detection on Application Metrics using LSTM (In Intelligent Data Mining Workshop)
The two challenges are (1) interpretability (explain the reason of anomaly), and (2) rarity (how rarely this abnormal sample is). The author uses Pegasus: an algorithm to solve the non-linear classification with SVM.

Multi-layer Embedding Neural Architecture with External Memory for Large-Scale Text Categorization: Mississippi state. (In Intelligent Data Mining Workshop)
The authors attempt to capture long-range correlations by storing more memory in LSTM nodes. The Idea looks intuitive but I am suspicious of (1) how useful it is to scholarly data as the model was trained on news articles and (2) whether the overhead is significant to classify big data.

A machine learning based NL question and answering system for hellcat data search using complex queries (In health data workshop)
The author attempts to classify all incoming questions into 6 categories. Although this particular model looks simplistic (the author admits he has scalability issues), It may be a good idea to map all questions into a narrow range of questions. This greatly reduces dimensions and may be useful summarization.

Conference Organization, Transportation, and the City of Seattle
The organization was very good. The registration is very expensive ($700). The conference was well sponsored by Baidu and another Chinese company. One impressive part of this conference is a hackathon, asking participants to solve a practical problem in 24 hours. I think JCDL should do something like this. The results may not be the best, but it pushes participants to think intensively within a very limited time window.

The conference center is located in Downtown Seattle. Transportation is super convenient, with Bus, Light Rail, and Monorail stations nearby to any places of interests. The Pike place, where the first Starbucks store is located is 10 min walk. There are many restaurants with gourmet food all over the world. I live in the Mediterranean Inn, 1 mile from the center, which is still within the walking distance. The Expedia combo (Hotel+Flight) costs me $850 for a 3-night hotel stay and a round-trip flight from ORF to SEA.

Seattle is a beautiful city. It was always lightly rainy this season so local people like to wear a waterproof hoodie sweater. People are nice. I got a chance to visit the University of Washington Library, where the Hogwarts school scenes in Harry Potter was shot.  

Jian Wu

2018-12-14: CNI Fall 2018 Trip Report

Mat Kelly reports on his recent trip to Washington, DC for the CNI Fall 2018 meeting                                                                                                                                                                                                                                                                                                                                                                           ⓖⓞⓖⓐⓣⓞⓡⓢ

I (Mat Kelly, @machawk1) attended my first CNI (#cni18f) meeting on December 10-11, 2018, an atypical venue for a PhD student, and am reporting my trip experience (also see previous trip reports from Fall 2017, Spring 2017, Spring 2016, Fall 2015, and Fall 2009).

Dr. Nelson (@phonedude_mln) and I left Norfolk, VA for DC, previously questioning whether the roads would be clear from unseasonably significant snow storm the night before (they were):

The conference was split up into eight sessions with up to 7 separate presentations being given concurrently in each session, which required attendees to choose a session. Between each session was a break, which allowed for networking and informal discussions. The eight sessions I chose to attend were:

  1. Collaboration by Design: Library as Hub for Creative Problem Solving Space
  2. From Prototype to Production: Turning Good Ideas into Useful Library Services
  3. First Steps in Research Data Management Under Constraints of a National Security Laboratory
  4. Blockchain: What's Not To Like?
  5. The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions
  6. What Is the Future of Libraries in Academic Research?
  7. Blockchain Can Not Be Used To Verify Replayed Archived Web Pages
  8. Building Infrastructure and Services for Open Access to Research

Also be sure to check out Dale Askey's report of the CNI 2018 Fall Membership Meeting. With so many concurrent sessions, he had a different experience.

Day One

In the Open Plenary prior to the current sessions Cliff Lynch described his concerns with the suggestions for using blockchain as the panacea of data problems. He discounted blockchain's practicality as a solution to most problems to which it is applied and expressed more concern but enthusiasm for the use of machine learning (ML), however, stated his wariness of ML's alignment with AI. Without training sets, he noted further, ML does not do well. He also noted that if there is bias in training data, the classifiers learn to internalize the biases.

Cliff continued by briefly discussion the rollout of 5G and how it will create competition in home cable-based Internet access, it will not fix the digital divide. Those that don't currently have access will likely not gain access with the introduction of 5G. He went on with his concerns over IoT devices and emulation of old systems and the security implications of reintroducing old, unpatched software.

He then mentioned the upcoming sunsetting of The Digital Preservation Network (DPN) and how their handling of the phase out process is a good example of what is at stake in terms of good behavior (cf. "we're shutting down in 2 weeks"). DPN's approach is very systematic in that they are going through their holdings and figuring out where the contents need to be returned, where other copies of these contents are are being held, etc. As was relevant, he also mentioned the recently announced formalization of a succession plan by CLOCKSS for when the time comes that the organization ceases operation.

Continuing, Cliff referenced Lisa Spiro's Plenary Talk at Personal Digital Archiving (PDA) 2018 this past April (which WS-DL attended in 2017, 2013, 2012, and 2011) and the dialogue that occurred following Hurricane Harvey on enabling and archiving the experiences of those affected. He emphasized that as exemplified in the cases of natural disasters like the hurricane, the recent wildfires in California, etc., we are in an increasing state of denial about how increasingly valuable the collections on our sites are.

Cliff also referenced recent developments in scholarly communication with respect to open-access, namely of the raising of the technical bar with the deposit of articles with strict DTD as prescribed by the European Plan S. The plan requires researchers who receive state funding to publish their work in open repositories or journals. He mentioned that for large open-access journals like PLoS and "big commercial players", doing so is not much of a problem as compared to the hardship that will be endured by smaller "labor of love" journals like those administered using OJS. He also lamented the quantification of Impact measurement in non-reproducible ways and the potential long terms implications using measures like this. In contrast, he noted that when journal editors get together and change the rules to reflect desired behaviors in a community, they can be a powerful force for change. He used the example of genomic journals requiring data submission to GenBank prior to any consideration of submission.

After touching a few more points, Cliff welcomed the attendees and thus begun the concurrent sessions.

Collaboration by Design: Library as Hub for Creative Problem Solving Space

The first session I attended was very interactive. The three speakers (Elliot Felix, Julia Maddox, and Mary Ann Mavrinac) gave a high-level of the iZone system as deployed at the University of Rochester. They first asked the attendees to go to a site or text their reply to the role of libraries and its needs then watched the presentation screen enumerating the responses as they came in real time.

The purpose of the iZone system as they described was to serve as a collaboration hub for innovation for the students to explore ideas of social or community benefit. The system seemed open-ended but the organization helpful to students where they "didn't have a methodology to do research or didn't know how to form teams."

Though the iZone team tried to encourage an "entrepreneurial" mindset, their vision statement intentionally did not include the word, as they found that students did not like the connotations of word. The presenters then had the audience members fill out a sort-of Mad Lib as supplied on the attendees seats stating:

For __audience__ who __motivation__ we deliver __product/service__ with __unique characteristic__ that benefit __benefit__.

Most of those that supplied their response were of a similar style of offering students some service for the benefit of what ever the library at their institution offered. Of course, the iZone representatives provided their own relating to offering a "creative space for problem solving".

Describing another barrier with students, the presenters utilized Bob McKim's tactic on the first day of classes while still teaching of having students draw their neighbor on a sheet of paper for 20 seconds. Having the audience at CNI do this was to demonstrate that "we fear judgement of peers" and "Throughout our education and upbringing, we worry about society's reaction to creative thoughts and urges, no matter how untamed they may be."

This process was an example of how they (at iZone) would help all students to become resilient, creative problem solvers.

Slides for this presentation are available (PDF)

From Prototype to Production: Turning Good Ideas into Useful Library Services

After a short break, I attended the session presented by Andrew K. Pace (@andrewkpace) of OCLC and (Holly Tomren @htomren) of Temple University Libraries. Andrew described a workflow that OCLC Research has been using to ensure that prototypes that are created and experimented with do not end up sitting in the research department without going to production. He described their work on two prototype-to-production projects consisting of IIIF integration into a digital discovery environment and a prototype for digital discovery of linked data in Wikibase. His progression from prototyping to production consisted of 5 steps:

  1. Creating a product vision statement
  2. Justifying the project using a "lean canvas" to determine effort behind a good idea.
  3. "Put the band together" consisting of assembling those fit to do the prototyping with "stereophonic fidelity" (here he cited Helen Keller with "Happiness is best attained through fidelity to a worthy purpose")
  4. Setting Team Expectations using OCLC Community Center (after failing to effective use a listserv) to concretely declare a finishing date that they had to stick to to prevent research staff from having to manage the project after completion.
  5. Accepting the outcome with a Fail Fast & Hard approach, stating that "you should be disappointed if you wind up with something that looks exactly lie you expected to build from the start.

Holly then spoke of her experience at Template piloting the PASSAGE project (the above Wikibase project) from May to September, 2018. An example use case they used in their pilot was asking users to annotate the Welsh translation of James and the Giant Peach. One such question asked was which properties should be associated with the original work and which to the translation.

Another such example was with a portrait of Caroline Still Anderson from Temple University Libraries' Charles L. Blockson Afro-American Collection and deliberating on attributes like "depicts" rather than "subject" in describing the work. In a discussion with the community, they sought to clarify the issue of the photo itself and the subject in the photo. "Which properties belong to which entity?", they asked, noting the potential loss of context if you did not click through. To further emphasize this point, she discussed a photo title "Civil Rights demonstration at Girard College August 3, 1965" where a primary attribute of "Philadelphia" would remove too much context in favor of more descriptive attribute of the subject in the photo like "Event: demonstration" and "Location: Girard College".

These sort of descriptions, Holly summarized, needed a cascading, inheritance style of description relative to other entities. This experience was much different than her familiarity with using MARC records to describe entities.

First Steps in Research Data Management Under Constraints of a National Security Laboratory

Martin Klein (@mart1nkle1n) and Brian Cain (@briancain101) of Los Alamos National Laboratory (LANL) presented next with Martin initially highlighting a 2013 OSTP stating that all federal agencies over $100 million in R&D research are required to store their data and make it publicly accessible to search, retrieve, and analyze. LANL being one of 17 national labs under the US Department of Energy with $12 billion in R&D funding (much greater than $100 million) was required to abide.

Brian highlighted a series of interviews at other institutions inclusive of in-depth interviews about data at their own institution. Responses to these interviews expressed a desire for a centralized storage solution to resolve storing it locally and having more assurance of its location "after the postdoc has left".

Martin documented an unnecessarily laborious process of having to burn data to a DVD, walking it to their review and release system (RASSTI) then once complete, physically walk the approval to a second location. He reported that this was a "Humungous pain" and thus "lots of people don't do this even though they should". He noted that the lab has started initiatives that have started to look into where money goes tracing it from starting points of an idea to funding, to paper, patents, etc.

He went on to describe the model used by the Open Science Framework (OSF) to bring together portability measures the researchers at LANL were already used to. Based on OSF, they created "Nucleus", a scholarly commons to connect the entire cycle and act as the glue that sits in the middle of research workflow and productivity portals. Nucleus can connect to storage solutions like GitLab and other authentication systems (or whatever their researchers are used to want to reuse) to act as a central means of access. As a prototype, Martin's group established an ownCloud instance to demonstrate the power of a sync-n-share solution for their lab. The intention of Nucleus would make the process of submitting datasets to RASTSTI much less laborious to obtain approval and comply.

Blockchain: What's Not To Like?

David Rosenthal presented the final session of the day that I attended, and much anticipated based on the promotion in the official CNI blog post. As is convention, Rosenthal's presentation consisted of a near-literal reading of his (then-) upcoming blog post with an identical title. Go and read that to do his very interesting and opinionated presentation justice.

As a very high-level summary, Rosenthal emphasized the attractiveness but mis-application of blockchain in respect to usage in libraries. He expressed multiple instances of Santoshi Nakamoto's revolutionary idea to have the consensus concept decentralized, which is often the problematic counterpart in this sorts of systems. The application of the ideas though, he summarized, and the side effects (e.g., potential manipulation of consensus, high trading latency) of Bitcoin as an exemplification of blockchain highlighted the real-world use case and the corresponding issues.

Rosenthal repeatedly mentioned the pump-and-dump schemes that allow for price manipulation and to "create money out of thin air". Upon completion of his talk and some less formal, opinionated thoughts on Austrian-led efforts for promotion of blockchain/Bitcoin (through venues of universities, companies, etc.), Dr. Nelson asked "Where are we in 5 years?"

Rosenthal answered with his prediction of "Cryptocurrency has been devaluaing for a year. It is hard to sustain a belief that cryptocurrencies will continue "going up"; miners are getting kicked out of mining. This is a death spiral. If it gets to this level, someone can exploit it. This has happened to small altcoins. You can see instances of using Amazon computing power to mount attacks".

Day 1 of CNI finished with a reception consisting of networking and some decent crab cakes. In a gesture of cosmic unconsciousness, Dr. Nelson preferred the plates of shrimp.

Day Two

Day two of the CNI Fall 2018 meeting started with breakfast and one of the four sessions of the day.

The State of Digital Preservation: A Snapshot of Triumps, Gaps, and Open Research Questions

The first session I attended was presented by Roger C. Schonfeld (@rschon) & Oya Tieger (@OyaRieger) of Ithaka S+R (of recent DSHR infamy) who reported on a recent open-ended study with "21 subject experts" to identify outstanding perspectives and issues in digital preservation. Oya noted that the interviewees were not necessarily a representative sample.

Oya referenced her report, recently published in October 2018 titled, "The State of Digital Preservation in 2018" that highlights the landscape of Web archiving, among other things, and how to transition the bits preserved for future use. In the report she (summarily) asked:

  1. What is working well now?
  2. What are you thoughts on how the community is preparing for new content types and format?
  3. Are you aware of any new research in practices and their impact?
  4. What areas need further attention?
  5. If you were writing a new preservation grant, what would be your focus?

From the paper, she noted that there are evolving priorities in research libraries, which are already spread thin. She questioned whether digital preservation is a priority for libraries' overall role in the community. Oya referenced the recent Harper's article, "Toward an ethical archive of the web" with a thought-provoking pull quote of "When no one is likely to lay eyes on a particular post or web page ever again, can it really be considered preserved?"

What Is the Future of Libraries in Academic Research?

Tom Hickerson, John Brosz (@jbrosz), and Suzanne Goopy of University of Calgary presented next by noting that academic research has changed and whether libraries have adapted. Through support of the Mellon Foundation, his group explored a multitude of project, which John enumerated. They sought to develop a new model for working with campus scholars using a research platform as well as providing equipment to augment the library's technical offerings.

Suzanne, a self-described "library user" described Story Map ECM (Empathic Cultural Mapping) to help identify small data in big data and vise-versa. This system merges personal stories of newcomers to Calgary using a map to show (for example) how adjustment to bus routes in Calgary can affect a Calgary's newcomer's health.

Tom closed the session by emphasizing the need to be able to support a diversity of research endeavors through a research platform to offer economy of scale instead of one-off solutions. Of the 12 projects that John described, he stated, there was only one instance where they asked for a resource that we had to try subscribe to, emphasizing the under-utilized availability of library resources. Even with this case, he mentioned, it was an unconventional example of access. "By having a common association with a research project", he continued, "these various silos of activity have developed new relationships with each other and strengthened our collegial involvement."

Blockchain Can Not Be Used To Verify Replayed Archived Web Pages

WS-DL's own Dr. Michael L. Nelson (@phonedude_mln) presented the second of two sessions relating to Blockchain that I attended at CNI, greeting the attendees with Explosions in the Sky (see related posts) and starting with a recent blog post from Peter Todd claimed to "Carbon Date (almost) the Entire Internet" and the contained caveat stating "In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshot". Todd's report was more applicable to ensuring IA holdings like music recordings have not been altered (Nelson stated that "It's great to know that your Grateful Dead recording has not been modified) but is not as useful to validate Web pages. The fixity techniques Todd used are too naive to be applicable to Web archiving.

Nelson then demonstrated this using a memento of a Web page recording his travel log over the years. When this page is curled from different archives, each reports a different content length due to how the content has been amended at replay time. This served as a base example of the runtime manipulation of a page without an external resources. However, most pages contain embedded resources, JavaScript that can manipulate the HTML, etc. that cause an increasingly level of variability to this content length, the content preserved, and the content served at time of replay.

As a potential solution, Nelson facetiously suggested that the whole page with all embedded resources could be loaded, a snapshot taken, and the snapshot hashed; however, he demonstrated a simple example where a rotating image changed at runtime via JavaScript would indicate a change in presentation despite no change in contents, so discarded this approach.

Nelson then highlighted work relating to temporal violations in the archive where, because of the difference in time of preservation of embedded resources, pages that never existed are presented as the historical record of the live Web.

The problem even when viewing the same memento over time shows that what one sees at time in an archive may be different later -- hardly a characteristic of what would expect from an "archive". As an example, Nelson replay a memento of the raw and rewritten versions for the homepage of the future losers of the 2018 Peach Bowl (at the URI By doing so 35 times between November 2017 and October 2018, Nelson noted variance of the very same memento, even when stable (e.g., images failed to load) as well as an archival outage due to a subsequently self-reported upgrade. Nelson found that in 11 months, 11% of the URLs they surveyed disappeared or changed. This conclusion was supported by observing 16,627 URI-Ms over that time frame and observing from 17 different archives an 87.92% result of the hash of a page being two different values within that time frame. The conclusion: You cannot replay replay twice the same archived page (with a noted apology to Heraclitus).

As a final analogy, Nelson played a a video of a scene from Monty Python and the Holy Grail alluding to the guards as the archive and the king as the user.

Building Infrastructure and Services for Open Access to Research

The final session was presented by Jefferson Bailey (@jefferson_bail), Jason Priem (@jasonpriem, presenting remotely via Zoom), and Nick Shockey (@nshockey). Jason initially described his motivations and efforts in creating Unpaywall that is seeking to create an open database of scholarly articles. He initially emphasized that their work is open source and was delighted to see its reuse of their early prototypes by Plum Analytics. All data that Unpaywall collects is available using their data APIs, which serve about 1 million calls per day and are "well used and fast".

Jason emphasized that his organization behind Unpaywall (Impactstory) is a non-profit, so it cannot be "acquired" in the traditional sense. Unpaywall seeks to be 98% accurate in the level of open access in returned results and works with publishers and authors to increase the degree of openness of work if unsatisfactory.

He and co-owner of Impactstory published a paper titled "The state of OA: a large-scale analysis of the prevalence and impact of Open Access article" that categorized these degrees of open access and quantified the current state of open access articles in the scholarly record. Some of these articles from the 1920s, he stated, were listed as Open Access even though the concept did not exist at the time. He predicted that within 8 years, 80% of articles will be Open Access based on current trends. Unpaywall has a browser extension freely available.

Jefferson described Internet Archive's efforts at preservation in general with projects like GifCities, a search engine on top of archived Geocities for all GIFs contained within, and a collection of every powerpoint in military domains (about 60,000 in number). Relating to the other presenters' objectives, he provided the IA one liner objective to "Build a complete, use-oriented, highly available graph and archive of every publicly access scholarly article with bibliographic metadata and full-text, enhanced with public identifier metadata, linked with associated data/blog/etc, with a priority on long-tail, at-risk publications and distributed, machine-readable access."

He mentioned that a lot of groups (inclusive of Unpaywall) are doing work in aggregating Open Access articles. They are taking three approaches toward facilitating this:

  • Top-down: using lists, IDs, etc to target harvesting
  • Middle-sideways: Integrating with OA public systems and platforms
  • Bottom-up: using open source tools, algorithms, and machine learning to identify extant works, assess quality of preservation, identify additional materials.

Jefferson referenced IA's use of Grobid for metadata extraction and through their focus on the not-so-well archived, they found 2 million long tail articles that have DOIs that are not archived. Of those he found, 2 out of 3 articles were duplicates. With these removed, IA currently has about 10 million articles in their collection. Their goal is to build a huge knowledge graph of what is archived, what is out there, etc. Once they have that, they can build services on top of it.

Nick presented last of the group and first mentioned he was filling in for Joe McArthur (@Mcarthur_Joe). Nick introduced Open Access Button that provides free alternatives to paywalled articles with a single click. If they are unable to, their service "finds a way to make the article open access for you". They recently switched from a model of a user tools to institutional tooling (with a library focus. Their tools, as Nick reported, was able to find Open Access versions for 23.2% of ILL requests using Open Access or Unpaywall. They are currently building a way to deposit article when an Open Access version is not available using a simple drag-and-drop procedure after notifying authors. This tool can also be embedded in institutions' Web pages for easier accessibility for authors to facilitate more Open Access works.

Slides for this presentation are available (PDF).

Closing Plenary

And with that session (and a break), Cliff Lynch began the closing plenary of CNI by introducing Patricia Flatley Brennan, directory of the National Library of Medicine (NLM). She initially described NLM's efforts to creates Trust in Data, "What does a library do?", she said, "We fundamentally create trust. The substrate is data."

She referenced the NLM is best known for its products and services like PubMed, the MEDLINE database, the Visible Human Project, etc.

"There has never been a greater need for trusted, secure, accessible, valued information in this world.", she said, "Libraries, data scientists, and networked information specialists are essential to the future." Despite the "big fence" around the physical NLM campus in Bethesda, the library is open for visits. She described a refactoring of PubMed via PubMed Labs to create a relevance-based ranking tool instead of reverse temporal order. This would also entail a new user interface. Both of these improvements were formed by the observation that 80% of the people that launch a PubMed search never go to the second page.

...and finally

Upon completion of the primary presentation and prior to audience questions, Dr. Nelson and I left to beat the DC traffic back to Norfolk. Patricia slides are promised to be available soon from CNI, which I will later include in this post.

Overall, CNI was an interesting and quite different meeting with which I am used to attending. The heavier, less technical focus was an interesting perspective and made me even more aware that there quite a lot of what is done in libraries that I have only a high-level idea. As a PhD student, in Computer Science no less, I am grateful for the rare opportunity to see the presentations in-person when I have only ever had to view them via Twitter from a far. Beyond this post I have also taken extensive notes for many topics that I plan to explore in the near future to make myself aware of current work and research going on at other institutions.

—Mat (@machawk1)

Monday, December 3, 2018

2018-12-03: Using Wikipedia to build a corpus, classify text, and more

Wikipedia is an online encyclopedia, available in 301 different languages, and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas.
Using Wikipedia as a corpus:
Wikipedia has been used to create corpora that can be used for text classification or annotation. In “Named entity corpus construction using Wikipedia and DBpedia ontology” (LREC 2014), YoungGyum Hahm et al. created a method to use Wikipedia, DBpedia, and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language.
Fabian Suchanek used Wikipedia, WordNet, and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “YAGO: A large ontology from Wikipedia and Wordnet” (Web Semantics 2008), describes how this dataset was created.
Using Wikipedia to classify entities:
In the paper, Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach” (VLDB Endowment 2013), Abhishek Gattani et al. created a method that accepts text from social media, such as Twitter, and then extracts important entities, matches the entity to Wikipedia links, filters, classifies the text, and then creates tags for the text. The data used is called a knowledge base (KB). Wikipedia was used as a KB and its graph structure is converted into a taxonomy. For example, if we have the following tweet “Obama just gave a speech in Hawaii”, then the entity extraction selects the two tokens “Obama” and “Hawaii”. Then the resulting tokens are paired with a Wikipedia link (Obama, and (Hawaii, This step is called entity linking. Finally, the classification and tagging of the tweet are set to “US politics, President Obama, travel, Hawaii, vacation”, which is referred to social tagging. The actual process to go from tweet to tag takes ten steps. The overall architecture is shown in Figure 1.
  1. Preprocess: detect the language (English), and select nouns and noun phrases
  2. Extract pair of (string, Wiki link): using the text in the tweet, the text is matched to Wikipedia links and is paired, where the pair of (string, Wikipedia) is called a mention
  3. Filter and score mentions: remove certain pairs and score the rest
  4. Classify and tag tweet: use mentions to classify and tag the tweet
  5. Extract mention features
  6. Filter mentions
  7. Disambiguate: select between topics, e.g. is apple categorized to a fruit or a technology?
  8. Score mentions
  9. Classify and tag tweet: use mentions to classify and tag the tweet
  10. Apply editorial rules
This dataset used in this paper was described in “Building, maintaining, and using knowledge bases: a report from the trenches” (SIGMOD 2013) by Omkar Deshpande et al. In addition to using Wikipedia, the Web and social context were used for the process of tagging the tweet more correctly. After collecting tweets, they gather web context for tweets, which is getting the link included in the tweet if exists and extracting its content, title, and other information. Then entity extraction is performed, followed by link, classify, and tag. Next, the tweet with the tag is used to create a social context of the user, hashtag, and web domains. This information is saved and used for new tweets that need to be tagged. They also used the web and social context for each node in the KB, and this is saved for future usage.
Abhik Jana et al. added Wikipedia links on the keywords in scientific abstracts in WikiM: Metapaths Based Wikification of Scientific Abstracts” (JCDL 2017). This method helped the reader determine if they are interested in reading the full article. They first step was to detect important keywords in the abstract, which they call mentions, using tf-idf. Then a list of candidate Wikipedia links, which they call candidate entries, were selected for each mention. The candidate entries are ranked based on similarity. Finally, a single candidate entry with the highest similarity score is selected for each mention.
Using Wikipedia to cluster documents:
Xiaohuo Hu et al. used Wikipedia in clustering documents in “Exploiting Wikipedia as External Knowledge for Document Clustering” (KDD 2009). In this work, documents are enriched with Wikipedia concepts and category information. Both exact concept match and related concepts are included. Then similar documents are combined based on document content, content from Wikipedia is added, and category information is added. This method was used on three datasets: TDT2, LA Times, and 20-newsgroups. Different methods were used to cluster the documents:
  1. Cluster-based on word vector
  2. Cluster-based on concept vector
  3. Cluster-based on category vector
  4. Cluster-based on the combination of word vector and concept vector
  5. Cluster-based on the combination of word vector and category vector
  6. Cluster-based on the combination of concept vector and category vector
  7. Cluster-based on the combination of word vector, concept vector, and category vector
They found that with all three datasets, clustering based on word and category vector (method #5) and clustering based on word, concept, and category vector (method #7) always had the best results.
Using Wikipedia to annotate documents:
Wikipedia was used to annotate documents, such as in the paper “Wikipedia as an ontology for describing Documents” (ICWSM 2008) by Zareen Sab Sayed et al. Wikipedia text and links were used to identify topics related to some terms in a given document. In this work, three methods were tested using the article text, the article text and categories with spreading activation, and the article text and links with spreading activation. However, the accuracy of the work depends on some factors such as that a Wikipedia page might link to a non-relevant article, the presence of links between related concepts, and the extent of having a concept appear in Wikipedia.
Using Wikipedia to create recommendations:
Wiki-Rec uses Wikipedia to create semantically based recommendations. This technique is discussed in the paper “Wiki-rec: A semantic-based recommendation system using Wikipedia as an ontology” (ISDA 2010) by Ahmed Elgohary et al. They predicted terms common to a set of documents. In this work, the user reads a document and evaluates it. Then using Wikipedia, all the concepts in the document are annotated and stored. After that, the user's profile is updated based on the new information. By matching the user's profile with other user's profiles that contain similar interests, a list of recommended documents is presented to the user. The overall system model is shown in Figure 2.
Using Wikipedia to match ontologies:
Other work, such as “WikiMatch -Using Wikipedia for ontology match” (OM 2012) by Sven Hurtling and Heiko Paulheim, used Wikipedia information to determine if two ontologies are similar, even if they are in different languages. In this work, the Wikipedia search engine is used to get articles related to a term. Then for the articles, all language links are retrieved. Two concepts are compared by comparing the articles' titles. However, this approach is time-consuming because of querying Wikipedia.
In conclusion, Wikipedia is not only an information source, it has also been used as a corpus to classify entities, cluster documents, annotate documents, create recommendations, and match ontologies.
-Lulwah M. Alkwai

2018-12-03: Acidic Regression of WebSatchel

Mat Kelly reviews WebSatchel, a browser based personal preservation tool.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ

Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".

With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:

Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.


I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at

curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 19:44:59 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Content-Type: text/html

Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 20:13:04 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5

...and repeated recursively until the browser reports "Too Many Redirects".

Interacting with the Capture

Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).

"Your" Captures

I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.


WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.

—Mat (@machawk1)