Thursday, November 16, 2017

2017-11-16: Paper Summary for Routing Memento Requests Using Binary Classifiers

While researching my dissertation topic, I re-encountered the paper, "Routing Memento Requests Using Binary Classifiers" by Bornand, Balakireva, and Van de Sompel from JCDL 2016 (arXiv:1606.09136v1). The high-level gist of this paper is that by using two corpora of URI-Rs consisting of requests to their Memento aggregator (one for training, the other for training evaluation), the authors were able to significantly mitigate wasted requests to archives that contained no mementos for a requested URI-R.

For each of the 17 Web archives included in the experiment, with the exception of the Internet Archive on the assumption that a positive result would always be returned, a classifier was generated. The classifiers informed the decision of, given a URI-R, whether the respective Web archive should be queried.

Optimization of this sort has been performed before. For example, AlSum et al. from TPDL 2013 (trip report, IJDL 2014, and arXiv) created profiles for 12 Web archives based on TLD and showed that it is possible to obtain a complete TimeMap for 84% of the URI-Rs requested using only the top 3 archives. In two separate papers from TPDL 2015 (trip report) then TPDL 2016 (trip report), Alam et al. (2015, 2016) described making routing decisions when you have the archive's CDX information and when you have to use the archive's query interface to expose its holdings (respectively) to optimize queries.

The training data set was based off of the LANL Memento Aggregator cache from September 2015 containing over 1.2 million URI-Rs. The authors used Receiver Operating Characteristic (ROC) curves comparing the rate of false positives (URI-R should not have been included but was) to the rate of true positives (URI-R was rightfully included in the classification). When requesting a prediction from the classifier once training, a pair of each of these rates is chosen corresponding to the most the most acceptable compromise for the application.

A sample ROC curve (from the paper) to visualize memento requests to an archive.

Classification of this sort required feature selection, of which the authors used character length of the URI-R and the count of special characters as well as the Public Suffix List domain as a feature (cf. AlSum et al.'s use of TLD as a primary feature). The rationale in choosing PSL over TLD was because of most archiving covering the same popular TLDs. An additional token feature was used by parsing the URI-R, removing delimiters to form tokens, and transforming the tokens to lowercase.

The authors used four different methods to evaluating the ranking of the features being explored for the classifiers: frequency over the training set, sum of the differences between feature frequencies for a URI-R and the aforementioned method, Entropy as defined by Hastie et al. (2009), and the Gini impurity (see Breiman et al. 1984). Each metric was evaluated to determine how it affected the prediction by training a binary classifier using the logistic regression algorithm.

The paper includes applications of the above plots for each of the four feature selection strategies. Following the training, they evaluated the performance of each algorithm, with a preference toward low computation load and memory usage, for classification using correspond sets of selected features. The algorithms evaluated were logistical regression (as used before, Multinomial Bayes, Random Forest, and SVM. Aside from Random Forest, the other three algorithms had similar runtime predictions, so were evaluated further.

A classifier was trained using each permutation of the three remaining algorithms and each archive. To determine the true positive threshold, they brought in the second data set consisting of 100,000 unrelated URI-Rs from the Internet Archive's log files from early 2012. Of the three algorithms, they found that logistic regression performed the best for 10 archives and Multinomial Bayes for 6 others (per above, IA was excluded).

The authors then evaluated the trained classifiers using yet another dataset of URI-Rs from 200,000 randomly selected requests (cleaned to just over 187,000) from oldweb.today. Given the data set was based on inter-archive requests, it was more representative of that of an aggregator's requests compared to the IA dataset. They computed recall, computational cost, and response time using a simulated method to prevent the need for thousands of requests. These results confirmed that the currently used heuristic of querying all archives has the best recall (results are comprehensive) but response time could be drastically reduced using a classifier. With a reduction in recall of 0.153, less than 4 requests instead of 17 would reduce the response time from just over 3.7 seconds to about 2.2 seconds. Additional details of optimization obtained through evaluation of the true positive rate can be had in the paper.

Take Away

I found this paper to be an interesting an informative read on a very niche topic that is hyper-relevant to my dissertation topic. I foresee a potential chance to optimize archival query from other Memento aggregators like MemGator and look forward to further studies is this realm on both optimization and caching.

Mat (@machawk1)

Nicolas J. Bornand, Lyudmila Balakireva, and Herbert Van de Sompel. "Routing Memento Requests Using Binary Classifiers," In Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries (JCDL), pp. 63-72, (Also at arXiv:1606.09136).

Monday, November 6, 2017

2017-11-06: Association for Information Science and Technology (ASIS&T) Annual Meeting 2017

The crowds descended upon Arlington, Virginia for the 80th annual meeting of the Association for Information Science and Technology. I attended this meeting to learn more about ASIS&T, including its special interest groups. Also attending with me was former ODU Computer Science student and current Los Alamos National Laboratory librarian Valentina Neblitt-Jones.
The ASIS&T team had organized a wonderful collection of panels, papers, and other activities for us to engage in.

Plenary Speakers

Richard Marks: Head of the PlayStation Magic Lab at Sony Interactive Entertainment

Richard Marks talked about the importance of play to the human experience. He covered innovations at the Playstation Magic Lab in an effort to highlight possible futures of human-computer interaction. The goal of the laboratory is "experience engineering" whereby the developers focus on improving the experience of game play rather than on more traditional software development. Play is about interaction and the Magic Lab focuses on amplifying that interaction.

One of the new frontiers of gaming is virtual reality, whereby users are immersed in a virtual world. Marks talked about how using an avatar in a game intiates a "virtual transfer of identity". Consider the example of pouring water: seeing onesself pour water on a screen while using a controller provides one level of immersion, but seeing the virtual glass of water in your hands makes the action far more natural. He mentioned that VR players confronted with a virtual tightrope suspended above New York City had difficulty stepping onto the tightrope, even though they knew it was just a game.

He talked about thresholds of technology change, detailing the changes in calculating machines throughout the 20th Century and how "when you can get it into your shirt pocket, now everything changes". Though this calculator example seems an obvious direction of technology, it was not entirely obvious when calculating machines were first being developed. The same parallel can be made for user interfaces. Marks also mentioned that games allow their researchers to explore many different techniques without having to worry about the potential for loss of life or other challenges that confront interface researchers in other industries.

William Powers: Laboratory for Social Machines at the MIT Media Lab

William Powers, author of "Hamlet's Blackberry" and reporter at the Washington Post, gave a thoughtful talk on the effects of information overload on society. To him, tech revolutions are all about depth and meaning. Depth is about focus, reflection, and when "the human mind takes its most valuable and most important journeys". Meaning is our ability to develop "theories about what exists is all about".

He talked about the current social changes people are experiencing in the online (and offline) world. He personally found that he was not able to give attention to things he cared about. The more time he spent online, the harder it became to read longer pieces of work, like books. A number of media stories exist about diminishing attention spans correlated to an increase in online use.

While at a fellowship at Harvard's Shorenstein Center, Powers began work on what print on paper had done for civilization. He covered different "Philosophers of Screens" from history. Socrates believed that the alphabet would destroy our minds, fearing that people would not think outside of the words on the page. Socrates felt that people needed distance to truly digest the world around them. Seneca lived in a world of many new technologies, such as postal systems and paved roads, but he feared the "restless energy" that haunted him, developing mental exercises to focus the mind. By inventing the printing press, Gutenberg helped mass produce the written word, leading some of his era to fear the end of civilization as misinformation was being printed. In Shakespeare's time, people complained that the print revolution had given them too much to read and that they would not be able to keep up with it. Benjamin Franklin worked to overcome his own addictions through the use of ritual. Henry David Thoreau bemoaned the distracted nature of his compatriots in the 19th Century, noting that "when our life ceases to be inward and private, conversation degenerates to gossip." Marshall McLuhan also believed that we could rise above information overload by developing our own strategies.

The output of this journey became the paper "Hamlet's Blackberry: Why Paper Is Eternal", which then led to the book &quotHamlet's Blackberry". The common thread was that each age has had new technical advances and concerns that people were becoming less focused and more out of touch. Each age also had visionaries who found that they could rise above this information fray by developing their own techniques for focus and introspection. Every technical revolution starts with the idea that the technology will consume everything, but this is hardly the case. Says Powers, "If all goes well with the digital revolution, then tech will allow us to have the depth that paper has given us." Powers even mentioned that he had been discussing with Tim Berners-Lee how to build a "better virtual society in the virtual world" that would in turn improve our real world.


Sample of Papers Presented

As usual, I cannot cover all papers presented, and, due to overlaps, was not present at all sessions. I will discuss a subset of the presentations that I attended.

Top Ranked Papers

Eric Forcier presented something near to one of my topics of interest in "Re(a)d Wedding: A Case Study Exploring Everyday Information Behaviors of the Transmedia Fan". In the paper he talks about the phenomena of transmedia fandom: fans who explore a fictional world through many different media types. The paper specifically focuses on an event in the Game of Thrones media franchise: The Red Wedding. Game of Thrones is an HBO television show based on a series of books named A Song of Ice and Fire. This story event is of interest because book fans were aware of the events of the Red Wedding before television fans experienced them, leading to a variety of different experiences for both. Forcier details the different types of fans and how they interact. Forcier's work has some connection to my work on spoilers and using web archives to avoid them.


In "Before Information Literacy [Or, Who Am I , As a Subject-Of-(Information)-Need?]", Ronald Day of the Department of Information and Library Science at Indiana University discusses the current issue of fake news. In his paper he considers the current solutions of misinformation exposure to be incomplete. Even though we are focusing on developing better algorithms for detecting fake news and also attempting to improve information literacy, there is also the possibility of improving a person's ability to determine what they want out of an information source. Day's paper provides an interesting history of information retrieval from an information science perspective. Over the years, I have heard that "data becomes information, but not all data is information"; Day extends this further by stating that "knowledge may result in information, but information doesn't necessarily have to come from or result in knowledge".

In "Affordances and Constraints in the Online Identity Work of LGBTQ+ Individuals", Vanessa Kitzie discusses the concepts of online identity in the LGBTQ+ community. Using interviews with thirty LGBTQ+ individuals, she asks about the experiences of the LGBTQ+ community in both social media and search engines. She finds that search engines are often used by members of the community to find the language necessary to explore their identity, but this is problematic because labels are dependent on clicks rather than on identity. Some members of the community create false social profiles so that they can "escape the norms confining" their "physical body" and choose the identity they want others to see. Many use social media to connect to other members of the community. The suggestions of further people to follow often introduces the user to more terms that help them with their identity. Her work is an important exploration of the concept of self, both on and offline.

Other Selected Papers
Sarah Bratt presented "Big Data, Big Metadata, and Quantitative Study of Science: A Workflow Model for Big Scientometrics". In this paper, she and her co-authors demonstrates a repeatable workflow used to process bibliometric data for the GenBank project. She maps the workflow that they developed for this project to the standard areas detailed in Jeffrey M. Stanton's Data Science. It is their hope that the framework can be applied to other areas of big data analytics and they intend to pursue a workflow that will work in these areas. I wondered if their workflow would be applicable to projects like the Clickstream Map of Science. I was also happy to see that her group was trying to tackle disambiguation, something I've blogged about before.


Yu Chi presented "Understanding and Modeling Behavior Patterns in Cross-Device Web Search." She and her co-authors conducted a user study to explore the behaviors surrounding beginning a web search on one device and then continuing it on another compared with just searching on a single device. They make the point that "strategies found on the single device, single-session search might not be applicable to the cross-device search". Users switching devices have a new behavior, re-finding, that might be necessary due to the interruption. They discovered that there are differences in user behavior in the two instances and that Hidden Markov Models could be used to model and uncover some user behavior. This work has implications for search engines and information retrieval.


"Toward A Characterization of Digital Humanities Research Collections: A Contrastive Analysis of Technical Designs" is the work of Katrina Fenlon. She talks about thematic research collections, which are collected by scholars who are trying to "support research on a theme". She focuses on the technical designs of thematic research collections and explores how collections with different uses have different designs. In the paper, she reviews three very different collections and categorizes them based on need: providing advanced access to value-added sources, providing context and interrelationships to sources, and also providing a platform for "new kinds of analysis and interpretation". I was particularly interested in Dr. Felon's research because of my own work on collections.


I was glad to once again see Leslie Johnston from the United States National Archives and Records Administration. She presented her work on "ERA 2.0: The National Archives New Framework for Electronic Records Preservation." This paper discusses the issues of developing the second version of Electronic Records Archives (ERA), the system that receives and processes US government records from many agencies before permanently archiving them for posterity. It is complex because records consist not only of different file formats, but many have different regulations surrounding their handling. ERA 2.0 now uses an Agile software methodology for development as well as cloud computing in order to effectively adapt to changing needs and requirements.


Unique to my experience at the conference was Kolina Koltai's presentation of "Questioning Science with Science: The Evolution of the Vaccine Safety Movement." In this work, the authors interviewed those who sought more research on vaccine safety, often called "anti-vaxxers". Most participants cited concern for children, and not just their own, as one of their values. They often read scientific journals and are concerned about financial conflicts of interest between government agencies and the corporations that they regulate, especially in light of prior issues involving research into the safety of tobacco and sugar. The Deficit Model, the idea that the group just lacks sufficient information, does not exist for this group. They discovered that the Global Belief Model has not been effective in understanding members of this movement. It is the hope of the authors that this work will be helpful in developing campaigns and addressing concerns about vaccine safety. In a larger sense, it supports other work on "how people develop belief systems based upon their values" also providing information for those attempting to study fake news.


Manasa Rath presented "Identifying the Reasons Contributing to Question Deletion in Educational Q&A." They looked at "bad" questions asked on the Q&A site Brainly. I was particularly interested in this work because the authors identified what features of a question caused moderators to delete it and then discovered that a J48-Decision Tree classifier is best at predicting if a given question would be deleted.


"Tweets May Be Archived: Civic Engagement. Digital Preservation, and Obama White House Social Media Data" was presented by Adam Kriesberg. Using data from the Obama White House Social Media Archive stored at the Internet Archive the authors discussed the archiving -- not just web archiving -- of Barack Obama's social media content on Twitter, Vine, and Facebook. Problems exist on some platforms, such as Facebook, where data can be downloaded by users, but is not necessarily structured in a way useful to those outside of Facebook. Facebook data is only browseable by year and photographs included in the data store lack metadata. Obama changed Vine accounts during his presidency, making it difficult for archivists to determine if they have a complete collection from even a single social media platform. An archived Twitter account is temporal, meaning that counts for likes and retweets are only from a snapshot in time. On this note, Kriesberg says that values are likes and retweets are "incorrect", but I object to the terminology of "incorrect". Content drift is something I and others of WS-DL have studied and any observation from the web needs to be studied with the knowledge that it is a snapshot in time. He notes that even though we have Obama's content, we do not have the content of those he engaged with, making some conversations one-sided. He finally mentions that social media platforms provide a moving target for archivists and researchers, as APIs and HTML changes quickly, making tool development difficult. I recommend this work for anyone attempting to archive or work with social media archives.

Social

As with other conferences, ASIS&T provided multiple opportunities to connect with researchers in the community. I appreciated the interesting conversations with Christina Pikas, Hamid Alhoori, and others during breaks. I also liked the lively conversations with Elaine Toms and Timothy Bowman. I want to thank Lorri Mon for inviting me to the Florida State University alumni lunch with Kathleen Burnett, Adam Worrall, Gary Burnett, Lynette Hammond Gerido, and others where we discussed each others' work as well as developments at FSU.

 I apologize to anyone else I have left off.

Summary

ASIS&T is a neat organization focusing on the intersections of information science and technology. As always, I am looking forward to possibly attending future conferences, like Vancouver in 2018.

-- Shawn M. Jones