Monday, December 3, 2018

2018-12-03: Using Wikipedia to build a corpus, classify text, and more

Wikipedia is an online encyclopedia, available in 301 different languages, and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas.
Using Wikipedia as a corpus:
Wikipedia has been used to create corpora that can be used for text classification or annotation. In “Named entity corpus construction using Wikipedia and DBpedia ontology” (LREC 2014), YoungGyum Hahm et al. created a method to use Wikipedia, DBpedia, and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language.
Fabian Suchanek used Wikipedia, WordNet, and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “YAGO: A large ontology from Wikipedia and Wordnet” (Web Semantics 2008), describes how this dataset was created.
Using Wikipedia to classify entities:
In the paper, Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach” (VLDB Endowment 2013), Abhishek Gattani et al. created a method that accepts text from social media, such as Twitter, and then extracts important entities, matches the entity to Wikipedia links, filters, classifies the text, and then creates tags for the text. The data used is called a knowledge base (KB). Wikipedia was used as a KB and its graph structure is converted into a taxonomy. For example, if we have the following tweet “Obama just gave a speech in Hawaii”, then the entity extraction selects the two tokens “Obama” and “Hawaii”. Then the resulting tokens are paired with a Wikipedia link (Obama, and (Hawaii, This step is called entity linking. Finally, the classification and tagging of the tweet are set to “US politics, President Obama, travel, Hawaii, vacation”, which is referred to social tagging. The actual process to go from tweet to tag takes ten steps. The overall architecture is shown in Figure 1.
  1. Preprocess: detect the language (English), and select nouns and noun phrases
  2. Extract pair of (string, Wiki link): using the text in the tweet, the text is matched to Wikipedia links and is paired, where the pair of (string, Wikipedia) is called a mention
  3. Filter and score mentions: remove certain pairs and score the rest
  4. Classify and tag tweet: use mentions to classify and tag the tweet
  5. Extract mention features
  6. Filter mentions
  7. Disambiguate: select between topics, e.g. is apple categorized to a fruit or a technology?
  8. Score mentions
  9. Classify and tag tweet: use mentions to classify and tag the tweet
  10. Apply editorial rules
This dataset used in this paper was described in “Building, maintaining, and using knowledge bases: a report from the trenches” (SIGMOD 2013) by Omkar Deshpande et al. In addition to using Wikipedia, the Web and social context were used for the process of tagging the tweet more correctly. After collecting tweets, they gather web context for tweets, which is getting the link included in the tweet if exists and extracting its content, title, and other information. Then entity extraction is performed, followed by link, classify, and tag. Next, the tweet with the tag is used to create a social context of the user, hashtag, and web domains. This information is saved and used for new tweets that need to be tagged. They also used the web and social context for each node in the KB, and this is saved for future usage.
Abhik Jana et al. added Wikipedia links on the keywords in scientific abstracts in WikiM: Metapaths Based Wikification of Scientific Abstracts” (JCDL 2017). This method helped the reader determine if they are interested in reading the full article. They first step was to detect important keywords in the abstract, which they call mentions, using tf-idf. Then a list of candidate Wikipedia links, which they call candidate entries, were selected for each mention. The candidate entries are ranked based on similarity. Finally, a single candidate entry with the highest similarity score is selected for each mention.
Using Wikipedia to cluster documents:
Xiaohuo Hu et al. used Wikipedia in clustering documents in “Exploiting Wikipedia as External Knowledge for Document Clustering” (KDD 2009). In this work, documents are enriched with Wikipedia concepts and category information. Both exact concept match and related concepts are included. Then similar documents are combined based on document content, content from Wikipedia is added, and category information is added. This method was used on three datasets: TDT2, LA Times, and 20-newsgroups. Different methods were used to cluster the documents:
  1. Cluster-based on word vector
  2. Cluster-based on concept vector
  3. Cluster-based on category vector
  4. Cluster-based on the combination of word vector and concept vector
  5. Cluster-based on the combination of word vector and category vector
  6. Cluster-based on the combination of concept vector and category vector
  7. Cluster-based on the combination of word vector, concept vector, and category vector
They found that with all three datasets, clustering based on word and category vector (method #5) and clustering based on word, concept, and category vector (method #7) always had the best results.
Using Wikipedia to annotate documents:
Wikipedia was used to annotate documents, such as in the paper “Wikipedia as an ontology for describing Documents” (ICWSM 2008) by Zareen Sab Sayed et al. Wikipedia text and links were used to identify topics related to some terms in a given document. In this work, three methods were tested using the article text, the article text and categories with spreading activation, and the article text and links with spreading activation. However, the accuracy of the work depends on some factors such as that a Wikipedia page might link to a non-relevant article, the presence of links between related concepts, and the extent of having a concept appear in Wikipedia.
Using Wikipedia to create recommendations:
Wiki-Rec uses Wikipedia to create semantically based recommendations. This technique is discussed in the paper “Wiki-rec: A semantic-based recommendation system using Wikipedia as an ontology” (ISDA 2010) by Ahmed Elgohary et al. They predicted terms common to a set of documents. In this work, the user reads a document and evaluates it. Then using Wikipedia, all the concepts in the document are annotated and stored. After that, the user's profile is updated based on the new information. By matching the user's profile with other user's profiles that contain similar interests, a list of recommended documents is presented to the user. The overall system model is shown in Figure 2.
Using Wikipedia to match ontologies:
Other work, such as “WikiMatch -Using Wikipedia for ontology match” (OM 2012) by Sven Hurtling and Heiko Paulheim, used Wikipedia information to determine if two ontologies are similar, even if they are in different languages. In this work, the Wikipedia search engine is used to get articles related to a term. Then for the articles, all language links are retrieved. Two concepts are compared by comparing the articles' titles. However, this approach is time-consuming because of querying Wikipedia.
In conclusion, Wikipedia is not only an information source, it has also been used as a corpus to classify entities, cluster documents, annotate documents, create recommendations, and match ontologies.
-Lulwah M. Alkwai

2018-12-03: Acidic Regression of WebSatchel

Mat Kelly reviews WebSatchel, a browser based personal preservation tool.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ

Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".

With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:

Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.


I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at

curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 19:44:59 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Content-Type: text/html

Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 20:13:04 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5

...and repeated recursively until the browser reports "Too Many Redirects".

Interacting with the Capture

Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).

"Your" Captures

I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.


WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.

—Mat (@machawk1)

Friday, November 30, 2018

2018-11-30: Archives Unleashed: Vancouver Datathon Trip Report

The Archives Unleashed Datathon #Vancouver was a two day event from November 1 to November 2, 2018 hosted by the Archives Unleashed team in collaboration with Simon Fraser University Library and Key, SFU's big data initiative. This was second in a series of Archives Unleashed datathons to be funded by The Andrew W. Mellon Foundation. This is the first time for me, Mohammed Nauman Siddique of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University to travel to the datathon at Vancouver.

 Day 1

The event kicked off with Ian Milligan welcoming all the participants at the Archives Unleashed Datathon #Vancouver. It was followed by welcome speech from Gwen Bird, University Librarian at SFU and Peter Chow-White, Director and Professor at GeNA lab. After the welcome, Ian talked about the Archives Unleashed Project, why we care about the web archives, purpose of organizing the datathons, and the roadmap for future datathons.
Ian's talk was followed by Nick Ruest walking us through the details of  the Archives Unleashed Toolkit and the Archives Unleashed Cloud. For more information about the Archives Unleashed Toolkit and the Archives Unleashed Cloud Services you can follow them on Twitter or check their website.
For the purpose of the datathon, Nick had already loaded all the datasets onto six virtual machines provided by Compute Canada. We were provided with twelve options for our datasets courtesy of  University of Victoria, University of British Columbia, Simon Fraser University, and British Columbia Institute of Technology. 
Next, the floor was open for us to decide our projects and form teams. We had to arrange our individual choices on the white board with information about the dataset we wanted to use in blue, tools we intended to use in pink, and research questions we cared about in yellow. The teams started to form quickly based on the datasets and purpose of the project. The first team, led by Umar Quasim, wanted to work on ubc-bc-wildfires dataset which was a collection of webpages related to wildfires in British Columbia. They wanted to understand and find relationships between the events and media articles related to wildfires. The second team, led by Brenda Reyes Ayala, wanted to work on improving the quality of archives pages using the uvic-anarchist-archives dataset. The third team, led by Matt Huculak, wanted to investigate on the politics of British Columbia using the uvic-bc-2017-candidates dataset. The fourth team, led by Kathleen Reed, wanted to work on ubc-first-nations-indigenous-communities to investigate about the history and its discourse in media about first nations indigenous communities.

I worked with Matt Huculak, Luis Menese, Emily Memura, and Shahira Khair on the British Columbia candidates dataset. Thanks to Nick, we had already been provided with the derivative files for our datasets which included list of all the captured domain names with their archival count, extracted text from all the WARC files with basic file metadata, and a Gephi file with network graph. It was the first time that the Archives Unleashed Team had provided the participating teams with derivative files, which saved us hours of wait time which would have been wasted in extracting all the information from the dataset WARC files. We continued to work on our projects through the day with a break for lunch. Ian moved around the room to check on all the teams, motivate us with his light humor, and providing us any help needed to get going on our projects.

Around 4 pm, the floor was open for Day 1 talk session. The talk started with Emily Memura (PhD student at University of Toronto) presenting her research on understanding the use and impact of web archives. Emily's talk was followed by Matt Huculak (Digital Scholarship Librarian at University of Victoria) who talked about the challenges faced by libraries in creating web collections using Archive-It. He emphasized on the use of regular expressions in Archive-It and problems it poses to non-technical librarians and web archivists. Nick Ruest presented Warclight and its framework, the latest service released by the Archives Unleashed Team which was followed by a working demo of the service. Last but not the least, I presented my research work on Congressional Deleted Tweets talking about why we care about the deleted tweets, difficulties involved in curating the dataset for Members of Congress, and results about the distribution of deleted tweets in multiple services which can be used to track deleted tweets.

We called it a day at 4:30 pm only to meet again for dinner at 5 pm at Irish Heather in Downtown Vancouver. At dinner Nick, Carl Cooper, Ian, and I had a long conversation ranging from politics to archiving to libraries. After dinner, we called it a day only to meet again fresh the next day.

Day 2

The morning of Day 2 at Vancouver greeted us with a clear view of mountains across the Vancouver harbor which called for a perfect start to our morning. We continued on our project with the occasional distraction of clicking pictures of the beautiful view that lay in front of us. We did some brainstorming on our network graph and bubble chart visualizations from Gephi to understand the relationship between all the URLs in our dataset. We also categorized all the captured URLs into political party URLs, social media URLs and rest. While reading the list of crawled domains present in the dataset, we discovered a bias towards a particular domain which made up approximately 510k mementos out of approximately 540k mementos. The standout domain we refer to was, which was owned by Brian Taylor who ran as an independent candidate. We set out to investigate  the reason behind that bias in our dataset, only to parse out and analyze the status codes from response headers of each WARC file. We realized that out of approximately 540k mementos only 10k mementos were of status code 200 OK and the rest were either 301s, 302s or 404s. Our investigation of all the URLs that showed up for led us to the conclusion that it was a calendar trap for crawlers.    

Most relevant topics word frequency count in BC Candidates dataset 

During lunch, we had three talks scheduled for Day 2. The first speaker was Umar Quasim from the University of Alberta who talked about the current status of web archiving in their university library and discussed some their future plans. The second presenter, Brenda Reyes Ayala, Assistant Professor at University of Alberta, talked about measuring archival damage and the metrics to evaluate them which had been discussed in her PhD dissertation. Lastly, Samantha Fritz talked about the future of the Archives Unleashed toolkit and cloud service. She mentioned in her talk that starting from 2019 computations using the Archives Unleashed toolkit will be a paid service.

Team BC 2017 Politics

We were first to start our presentation on the BC Candidates dataset. with a talk about the dataset we had at our disposal and different visualizations we had used to understand our dataset. We talked about relationships between different URLs and their connections. We also highlighted the issue of and the crawler trap issue. Our dataset comprised of 510k mementos of 540k mementos from a single domain The reason for the large memento count from a single domain was due to a calendar crawler trap which was evident on analyzing all the URLs which had been crawled for this domain. Of the 510k mementos crawled, only six of them were 302s and seven of them were 200s, while the rest of the URLs returned a status code of 404. In a nutshell, we had a meager seven mementos with useful information from approximately 510k mementos crawled for this domain. We highlighted the fact that the dataset with approximately 540k mementos had only approximately 10k mementos with relevant information. Based on our brainstorming over the last two days, we summarized lessons learned and an advice for future historians who are curating seeds for creating collections on Archive-It.

Team IDG

Team IDG started off by talking about difficulties faced in settling for their final dataset by waling us through the different datasets they tried before settling for the final dataset (ubc-hydro-cite-c) used in their project. They presented visualization on top keywords based on frequency count and relationship between different keywords. They also highlighted the issue of extracting text from the tables and talked about their solution. They walked us through all the steps involved in plotting their events on a map. It started it using the table of processed text to create a geo-coding for their dataset and plot in onto a map showing the occurrences of the events. They also showed a timeline of how the events evolved over time by plotting it onto the map.

Team Wildfyre

Team Wildfyre opened their talk with description of their dataset and other datasets they used in their project. They talked about research questions and tools used in their project. They presented multiple visualizations showing top keywords, top named entity and geo-coded map of the events. They also had a heat map for distribution of datasets based on the domain names available in their dataset. They pointed out that even when analyzing named entities in the wild fire dataset, the most talked about entity during these events was Justin Trudeau.

Team Anarchy


Team Anarchy had split their project into two smaller projects. The first project undertaken by Ryan Deschamps was about finding linkages between all the URLs in the dataset. He presented a concentric circles graph talking about the linkage between pages from depth 0 to 5. They found that starting from the base URL to depth level 5 took them to a spam or a government website in most cases. He also talked about the challenges faced in extracting images from the WARC files and comparing them their live version counterparts. The second project undertaken by Brenda was about capturing archived pages and measuring the degree of difference from the live version of these pages. She showed multiple examples with varying degree of difference between the archived and their live pages.

Once the presentations were done, Ian asked us all to write out our votes and the winner would be decided based on popular vote. Congratulations to Team IDG for winning the Archives Unleashed Datathon #Vancouver. For closing comments Nick talked about what to take away from these events and how to build a better web archiving research community. After all the suspense, the next edition of Archives Unleashed Datathon was announced.

More information about Archives Unleashed Datathon #WashingtonDC can be found using their website or following the Archives Unleashed team on Twitter.

This was my first time at a Archives Unleashed Datathon. I went with the idea of meeting researchers, librarians, and historians all under one roof who propel the web archiving research domain. The organizers try to strike a perfect balance by inviting different research communities with the web archiving community with diverse background, and experience. It was an eye-opening trip for me, where I learned from my fellow participants about their work, how libraries build collections for web archives and the difficulties and challenges faced by them. Thanks, to Carl Cooper, Graduate Trainee at Bodlian Libraries- Oxford University for strolling down Vancouver Downtown with me. I am really excited and look forward to attending the next edition of Archives Unleashed Datathon at Washington DC.

View of Downtown Vancouver
Thanks again to the organizers (Ian Milligan, Rebecca Dowson, Nick Ruest, Jimmy Lin, and Samantha Fritz), their partners and SFU library for hosting us. Looking forward to see you all at future Archive Unleashed datathons.

Mohammed Nauman Siddique

2018-11-30: The Illusion of Multitasking Boosts Performance

Illustration showing hands completing many tasks at once
Today, I read the article on The title is "The Illusion of Multitasking Boosts Performance". At first, I thought it argues for single-task at once, but after reading it, I found that it is not. It actually supports multi-tasking, but in the sense that the worker "believes" the work he is working on is a combination of multi-tasks.

The original paper published in Psychological Science has a title "The Illusion of Multitasking and Its Positive Effect on Performance". 

In my opinion, the original article's title is accurate, but the press release reveals part of the story and actually distorted the original meaning of the article. The reader actually got an illusion that multi-tasking is producing a negative effect.

Jian Wu

Thursday, November 15, 2018

2018-11-15: LANL Internship Report

Los Alamos National Laboratory
On May 27 I landed in sunny Sante Fe, New Mexico to start my 6 month internship at Los Alamos National Laboratory (LANL) for the Digital Library Research and Prototyping Team under the guidance of Herbert Van de Sompel and WSDL alumnus Martin Klein.

Work Accomplished

A majority of my time was used to work on the Scholarly Orphans project, which is a joint project between LANL and ODU, sponsored by the Andrew Mellon Foundation. This project explores from an institution perspective how it can discover, capture, and archive scholarly artifacts that an institution's researcher deposits in various productivity portals. After months of working on the project, Martin Klein showcased the Scholarly Orphans pipeline at TPDL 2018.

Scholarly Orphans pipeline diagram

My main task for this pipeline was to create and manage two components: the artifact tracker and pipeline orchestrator. Communication between different components was completed using ActivityStream2 (AS2) messages and Linked Data Notification (LDN) inboxes for sending and receiving messages. AS2 messages describe events users have accomplished providing a "human friendly but machine-processable" JSON format. LDN inboxes provide endpoints for messages to be received, advertising these endpoints via link headers. Applications (senders) can discover these endpoints and send messages to these endpoints (receivers). In this case each component was a sender and a receiver. For example, the orchestrator sent an AS2 message to the tracker component's inbox to start a process to track a user for a list of portals, the tracker responds and sends an AS2 message with results to the orchestrator's inbox which is then saved in a database.

This pipeline was designed to be a distributed network, where the orchestrator knows where each component inbox is before sending messages. The tracker, capture, and archiver components are told by the orchestrator where to send their AS2 messages and also where their generated AS2 event message will be accessible. An example of an AS2 message from the orchestrator component to the tracker component shows an event object with an endpoint "to" telling the tracker where to send the message and a "tracker:eventBaseUrl" to append a uuid for where the event generated by the tracker will be accessible. After the tracker has found events for the user it will generate a new AS2 message and send it to the orchestrator "to" endpoint.

Building the tracker and orchestrator components allowed me to learn a great deal about W3C Web Standards mostly dealing with the Semantic Web. I was required to learn about various programmatic technologies during my work which included: Elasticsearch as a database, Celery task scheduling, using Docker-Compose in a production environment, Flask and uWSGI as a python web server, and working with OAI-PMH interfaces.

I was also exposed to the various technologies the Prototyping Team had developed previously and included these technologies in various components of the Scholarly Orphans pipeline. These included: Memento, Memento Tracer, Robust Links, and Signposting.

The prototype interface of the Scholarly Orphans project is hosted at for a limited time. On the website you can see the various steps of the pipeline, the AS2 event messages, the WARCs generated from the capture process, and the replay of the WARCs via the archiver process for each of the researcher's productivity portal events. The tracker component of the Scholarly Orphans pipeline was made available via Github found here:

New Mexico Lifestyle


Over the course of my stay I stayed in a house located in Los Alamos shared by multiple Ph.D. students studying in diverse fields such as Computer Vision, Nuclear Engineering, Computer Science, and Biology. The views of the mountains were always amazing and only ever accompanied by rain during the monsoon season. A surprising discovery during the summer was that there always seemed to be a forest fire somewhere in New Mexico. 
Los Alamos, NM


During my stay and adventures I found out the level of spiciness that apparently every New Mexican had become accustomed to by adding the local Green Chile to practically any and/or every meal. 


Within the first two weeks of landing I had already planned a trip to Southern NM. Visiting Roswell, NM I discovered aliens were very real.
Roswell, NM International UFO Museum
Going further south I got to visit Carlsbad, NM the home of the Carlsbad Caverns which were truly incredible.
Carlsbad, NM Carlsbad Caverns
I was able to visit Colorado for a few days and went on a few great hikes. On August 11, I got to catch the Rockies vs. Dodgers MLB game where I got to see for the first time a walk-off home run by the Rockies

I also managed a weekend road trip to Zion Canyon, Utah allowing me to hike some great trails like Observation Point Trail, The Narrows, and Emerald Pools.
Zion Canyon, Utah - Observation Point Trail


If you're a visiting researcher not hired by the lab consider living in a shared home with multiple other students. This can help alleviate you of boredom and also help you to find people to plan trips with. Otherwise you will usually be excluded from the events planned by the lab for other students.

If you're staying in Los Alamos, plan to make weekend trips out to Santa Fe. Los Alamos is beautiful and has some great hikes, but can be short on entertainment frequently.

Final Impressions

I feel very blessed to have been offered this 6 month internship. At first I was reluctant to move out to the West, however it allowed me to travel to many great locations with new friends. My internship has allowed me to be exposed to various subjects relating to WS-DL research which will surely improve, expand, and influence my own research in the future.

A special thanks to Herbert Van de Sompel, Martin Klein, Harihar Shankar, and Lyudmila Balakireva for allowing me to collaborate, contribute, and learn from this fantastic team during my stay at LANL.

--Grant Atkins (@grantcatkins)

Monday, November 12, 2018

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are probably doing a presentation of that paper. The third is actually a pre-print on arXiv. These three have different numbers of citations, which they should do.

The silly side is also reflected in the search result. How does a paper published in less than a year receive more than 1900 citations? You may say that may be a super popular paper. But if you look into the citations. Some do not make sense. For example, the first paper that "cites" the DeepLab paper was published in 2015! How could it cite a paper published in 2018?

Actually, the first paper's citation rate is also problematic. A paper published in 2015 was cited more than 6500 times! And another paper published in 2014 was cited more than 16660 times!

Something must be wrong about Google Scholar! The good news that the number looks higher, which makes everyone happy! :)

Jian Wu

Saturday, November 10, 2018

2018-11-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?

Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles. "
Similar things happened in Nature subsidiary journal (link) and other journals (link).

The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu

2018-11-10: Scientific news and reports should cite original papers

Image result for sciencealert

I highly encourage all scientific news or reports cite corresponding articles. ScienceAlert usually does a good job on this. This piece of scientific news from ScienceAlert discovers two Rogue planets.  Most planets we discovered rotate around a star. A Rogue planet does not rotate around a star, but the center of the Galaxy. Because planets do not emit light, Rogue planets are extremely hard to detect. This piece of news cites a recently published paper on arXiv. Although anybody can publish papers on arXiv. Papers published by reputable organizations should be reliable.

A reliable citation is beneficial for all parties. It makes the scientific news more trustable. It gives credits to the original authors. It could also connect readers to a place to explore other interesting science.

Jian Wu

Friday, November 9, 2018

2018-11-09: Grok Pattern

Image result for logstash logo
Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data. 

In this article, Ran Ramati gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search).

The StreamSets webpage gives a list of Grok pattern examples:

The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools:

I am kind of surprised that although they cited the Logstash website, they did not compare Logstash with its peers.

 Jian Wu

Thursday, November 8, 2018

2018-11-08: Decentralized Web Summit: Shaping the Next Web

In my wallet I have a few ₹500 Indian currency notes that say, "I PROMISE TO PAY THE BEARER THE SUM OF FIVE HUNDRED RUPEES" followed by the signature of the Governor of the Reserve Bank of India. However, this promise was broken two years ago from today, since then these bills in my pocket are nothing more than rectangular pieces of printed paper. So, I decided to utilize my origami skills and turn them into butterflies.

On November 8, 2016, at 8:00 PM (Indian Standard Time), Indian Prime Minister Narendra Modi announced the demonetization (effective in four hours after midnight) of the two biggest currency notes (₹1,000 and ₹500) in circulation at that time. Together these two notes represented about 86% of the total cash economy of India at that time. More than 65% of the Indian population still lives in rural and remote areas where availability of electricity, the Internet, and other utilities is not reliable yet. Hence, cash is a very common means of business in daily life there. It was morning here in Norfolk (USA) and I was going through the news headlines when I saw this announcement. I could not believe for a while that the news was real and not a hoax. I did not even know that there is a concept called demonetization that governments can practice. Irrespective of my political views and irrespective of the intents and goals behind the decision (whatever good or bad they might have been) I was shocked to realize that the system has so much centralization of power in place that a single person can decide sufferings for about 18% of the global population overnight and cause a chaos in the system. I wished for a better and more resilient system, I wanted a system with decentralization of power by design where no one entity has a significant share of power and influence. I wanted a DECENTRALIZED SYSTEM!

When the Internet Archive (IA) announced plans for the Decentralized Web (DWeb) Summit, I was on board to explore what can we do to eliminate centralization of control and power in systems on the Web. With a generous support from the Protocol Labs, AMF, and NSF IIS-1526700 grants I was able to travel to the West Coast to experience four days full of fun and many exciting events. I got the opportunity to meet many big names who brought us the Web we experience today and many of those who are working towards shaping the future of the Web with their vision, ideas, experience, code, art, legal understanding, education, or social values. They all had a different perspective to share with the rest, but all seemed to agree on one goal of fixing the current Web where freedom of expression is under an ever-growing threat, governments control the voice of dissent, big corporations use personal data of the Internet users for monetary benefits and political influence, and those in power try to suppress the history they might be uncomfortable with.

There was so much going on in parallel that perhaps no two people have experienced the same sequence of events. Also, I am not even pretending to tell everything I have observed there. In this post I will be describing my experience of the following four related events briefly that happened between July 31 and August 3, 2018.

  • IndieWebCamp SF
  • Science Fair
  • Decentralized Web Summit
  • IPFS Lab Day

IndieWebCamp SF

The IndieWeb is a people-focused alternative to the "corporate web". Their objectives include: 1) Your content is yours, 2) You are better connected, and 3) You are in control. Some IndieWeb people at Mozilla decided to host IndieWebCamp SF, a bootcamp the day before #DWebSummit starts and shared open invitation to all participants. I was quick to RSVP there which was going to be my first interaction with the IndieWeb.

On my way from the hotel to the Mozilla's SF office the Uber driver asked me why I came to SF. I replied to her, "to participate in an effort to decentralize the Web". She seemed puzzled and said, "my son was mentioning something about it, but I don't know much". "Have you heard about Bitcoin?", I asked her to get an idea how to explain. "I have heard this term in the news, but don't really know much about it", she said. So, I started the elevator pitch and in the next eight or so minutes (about four round trips of Burj Khalifa's elevator from the ground to the observation deck) I was able to explain some of the potential dangers of centralization in different aspects of our social life and what are some of the alternatives.

The bootcamp had both on-site and remote participants and was well organized. We started with keynotes from Miriam Avery, Dietrich Ayala, and Ryan Barrett then some people introduced themselves, why were they attending the DWeb Summit, and what ideas they had for the IndieWeb bootcamp. Some people had lightning demos. I demonstrated InterPlanetary Wayback (IPWB) briefly. I got to meet some people behind some projects I was well aware of (such as Universal Viewer and Dat Project) and also got to know about some projects I didn't know before (such as Webmention and Scuttlebutt). We then scheduled BarCamp breakout sessions and had lunch.

During and after the lunch I had an interesting discussion and exchanged ideas with Edward Silverton from the British Library and a couple of people from Mozilla's Mixed Reality team about the Universal Viewer, IIIF, Memento, and multi-dimensional XR on the Web.

Later I participated in two sessions "Decentralized Web Archiving" and "Free Software + Indieweb" (see the schedule for notes on various sessions). The first one was proposed by me in which I explained the state of Web archiving, current limitations and threats, and the need to move it to a more persistent and decentralized infrastructure. I have also talked about IPWB and how it can help in distributed web archiving (see notes for details and references). In the latter session we talked about different means to support Free Software and open-source developers (for example bug bounty, crowdfunding, and recurring funding), compared and contrasted different models and their sustainability as compared with closed-source software backed by for-profit organizations. We also touched on some licensing complications briefly.

I had to participate in the Science Fair at IA, so I had to get there a little earlier than the start time of the session. With that in mind, Dietrich (from the Firefox team) and I left the session a little before it was formally wrapped up as the SF traffic in the afternoon was going to make it a rather long commute.

Science Fair

The taxi driver was an interesting person with whom Dietrich and I shared the ride from the Mozilla SF office to the Internet Archive, talking about national and international politics, history, languages, music, and whatnot until we reached our destination where food trucks and stalls were serving the dinner. It was more windy and chilly out there than I anticipated in my rather thin jacket. Brewster Kahle, the founder of the IA, who had just came out of the IA building, welcomed us and navigated us to the registration desk where a very helpful team of volunteers gave us our name badges and project sign holders. I acquired a table right outside the entrance of the IA's building, placed the InterPlanetary Wayback sign on it, and went to the food truck to grab my dinner. When I came back I found that the wind has blown my project sign off the table, so I moved it inside of the building where it was a lot cozier and crowded.

The Science Fair event was full of interesting projects. You may explore the list of all the Science Fair projects along with their description and other details. Alternatively, flip through the pages of the following photo albums of the day.

Many familiar and new faces visited my table, discussed the project, and asked about its functionality, architecture, and technologies. On the one hand I met people who were already familiar with our work and on the other hand some needed a more detailed explanation from scratch. I even met people who asked with a surprise, "why would you make your software available to everyone for free?" This needed a brief overview of how the Open Source Software ecosystem works and why one would participate in it.

This is not a random video. This clip was played to invite Mike Judge, Co-creator of HBO's Silicon Valley on the stage for a conversation with Cory Doctorow in the Opening Night Party after Brewster's welcome note (due to the streaming rights issue the clip is missing in IA's full session recording). I can't think of a better way to begin the DWeb Summit. This was my first introduction with Mike (yes, I had not watched the Silicon Valley show before). After an interesting Q&A session on the stage, I got the opportunity to talk to him in person, took a low-light blurred selfie with him, mentioned Indian demonetization story (which, apparently, he was unaware of), and asked him to make a show in the future about potential threats on DWeb. Web 1.0 emerged as a few entities having control on publishing with the rest of the people being consumers of that content. Web 2.0 enabled everyone to participate in the web both as creators and consumers, but privacy and censorship controls gone in the hands of governments and a few Internet giants. If Web 3.0 (or DWeb) could fix this issue too, what would potentially be the next threat? There should be something which we may or may not be able to think of just yet, right?

Mike Judge and Sawood Alam

Decentralized Web Summit

For the next two days (August 1–2) the main DWeb Summit was organized in the historical San Francisco Mint building. There were numerous parallel sessions going on all day long. At any given moment perhaps there was a session suitable for everyone's taste and no one could attend everything they would wish to attend. A quick look at the full event schedule would confirm this. Luckily, the event was recorded and those recordings are made available, so one can watch various talks asynchronously. However, being there in person to participate in various fun activities, observe artistic creations, experience AR/VR setups, and interacting with many enthusiastic people with many hardware, software, and social ideas are not something that can be experienced in recorded videos.

If the father of the Internet with his eyes closed trying to create a network with many other participants with the help of a yellow yarn, some people trying to figure out what to do with colored cardboard shapes, and some trying to focus their energy with the help of specific posture are not enough then flip through these photo albums of the event to have a glimpse into many other fun activities we had there.

Initially, I tried to plan my agenda but soon I realized it was not going to work. So, I randomly picked one from the many parallel sessions of my interest, spent an hour or two there, and moved to another room. In the process I interacted with many people from different backgrounds participating both in their individual or organizational capacity. Apart from usual talk sessions we discussed various decentralization challenges and their potential technical and social solutions in our one-to-one or small group conversations. An interesting mention of additive economy (a non-zero-sum economy where transactions are never negative) reminded me of our gamification idea we explored when working on the Preserve Me! project and I ended up having a long conversation with a couple of people about it during a breakout session.

If Google Glass was not cool enough then meet Abhik Chowdhury, a graduate student, working on a smart hat prototype with a handful of sensors, batteries, and low-power computer boards placed in a 3D printed frame. He is trying to find a balance in on-board data processing, battery usage, and periodic data transfer to an off-the-hat server in an efficient manner, while also struggling with the privacy implications of the product.

It was a conference where "Crypto" meant "Cryptocurrency", not "Cryptography" and every other participant was talking about Blockchain, Distributed/Decentralized Systems, Content-addressable Filesystem, IPFS, Protocols, Browsers, and a handful other buzz-words. Many demos there were about "XXX but decentralized". Participants included the pioneers and veterans of the Web and the Internet, browser vendors, blockchain and cryptocurrency leaders, developers, researchers, librarians, students, artists, educators, activists, and whatnot.

I had a lightning talk entitled, "InterPlanetary Wayback: A Distributed and Persistent Archival Replay System Using IPFS", in the "New Discoveries" session. Apart from that I spend a fair amount of my time there talking about Memento and its potential role in making decentralized and content-addressable filesystems history-aware. During a protocol related panel discussion, I worked with a team of four people (including members from the Internet Archive and MuleSoft) to pitch the need of a decentralized naming system that is time-aware (along the lines of IPNS-Blockchain) and can resolve a version of a resource at a given time in the past. I also talked to many people from Google Chrome, Mozilla Firefox, and other browser vendors and tried to emphasize the need of native support of Memento in web browsers.

Cory Doctorow's closing keynote on "Big Tech's problem is Big, not Tech" was perhaps one of the most talked about talk of the event, which received many reactions and commentary. The recorded video of his talk is worth watching. Among many other things in his talk, he encouraged people to learn programming and to understand functions of each software we use. After his talk, an artist asked me how can she or anyone else learn programming? I told her, if one can learn a natural language, then programming languages are way more systematic, less ambiguous, and easier to learn. There are really only three basic constructs in a programming language that include variable assignments, conditionals, and loops. Then I verbally gave her a very brief example of mail merge using all of these three constructs that yields gender-aware invitations using a message template for a list of friends to be invited in a party. She seemed enlightened and delighted (while enthusiastically sharing her freshly learned knowledge with other members of her team) and exchanged contacts with me to learn more about some learning resources.

IPFS Lab Day

It looks like people were too energetic to get tired of such jam-packed and eventful days as some of them have planned post-DWeb events of special interest groups. I was invited by Protocol Labs to give an extended talk in one such IPFS-centric post-DWeb event called Lab Day 2018 on August 3. Their invitation arrived the day after I had booked my tickets and reserved the hotel room, so I ended up updating my reservations. This event was in a different location and the venue was decorated with a more casual touch with bean bags, couches, chairs, and benches near the stage and some containers for group discussions. You may take a glimpse of the venue in these pictures.

They welcomed us with new badges, some T-shirts, and some best-seller books to take home. The event had a good lineup of lightning talks and some relatively longer presentations, mostly extended forms of similar presentations in the main DWeb event. Many projects and ideas presented there were in their early stages. These sessions were recorded and published later after necessary editing.

I presented my extended talk entitled, "InterPlanetary Wayback: The Next Step Towards Decentralized Archiving". Along with the work already done and published about IPWB, I also talked about what is yet to be done. I explored the possibility of an index-free, fully decentralized collaborative web archiving system as the next step. I proposed some solutions that would require some changes in IPFS, IPNS, IPLD, and other technologies around to accommodate the use case. I encouraged people to discuss with me if they have any better ideas to help solve these challenges. The purpose was to spread the word out so that people keep web archiving related use cases in mind while shaping the next web. Some people from the core IPFS/IPNS/IPLD developer community approached me and we had an extended discussion after my talk. The recording of my talk and slides are made available online.

It was a fantastic event to be part of and I am looking forward to more such events in the future. IPFS community and people at Protocol Labs are full of fresh ideas and enthusiasm and they are a pleasure to work with.


Decentralized Web has a long way to go and DWeb Summit is a good place to bring people from various disciplines with different perspectives together every once in a while to synchronize all the distributed efforts and to identify the next set of challenges. While I could not attend the first summit (in 2016) I really enjoyed the second one and would love to participate in future events. Those two short days of the main event had more material than I can perhaps digest in two weeks, so my only advice would be to extend the duration of the event instead of having multiple parallel session with overlapping interests.

I extend my heartiest thanks to organizers, volunteers, fund providers, and everyone involved in making this event happen and making it a successful one. I wish going forward not just the Web, but many other organizations, including governments, become more decentralized so that I do not open my wallet once again to realize it has some worthless pieces of currency bills that were demonetized over night.


Sawood Alam