2018-11-30: Archives Unleashed: Vancouver Datathon Trip Report

The Archives Unleashed Datathon #Vancouver was a two day event from November 1 to November 2, 2018 hosted by the Archives Unleashed team in collaboration with Simon Fraser University Library and Key, SFU's big data initiative. This was second in a series of Archives Unleashed datathons to be funded by The Andrew W. Mellon Foundation. This is the first time for me, Mohammed Nauman Siddique of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University to travel to the datathon at Vancouver.
 Day 1
An appropriately rainy Vancouver mornings to set up #hackarchives SFU. Badges ready, WiFi waiting, coffee’s on... — Ian Milligan (@ianmilligan1) November 1, 2018
The event kicked off with Ian Milligan welcoming all the participants at the Archives Unleashed Datathon #Vancouver. It was followed by welcome speech from Gwen Bird, University Librarian at SFU and Peter Chow-White, Director and Professor at GeNA lab. After the welcome, Ian…

2018-11-30: The Illusion of Multitasking Boosts Performance

Today, I read the article on The title is "The Illusion of Multitasking Boosts Performance". At first, I thought it argues for single-task at once, but after reading it, I found that it is not. It actually supports multi-tasking, but in the sense that the worker "believes" the work he is working on is a combination of multi-tasks.

The original paper published in Psychological Science has a title "The Illusion of Multitasking and Its Positive Effect on Performance". 

In my opinion, the original article's title is accurate, but the press release reveals part of the story and actually distorted the original meaning of the article. The reader actually got an illusion that multi-tasking is producing a negative effect.

Jian Wu

2018-11-15: LANL Internship Report

On May 27 I landed in sunny Sante Fe, New Mexico to start my 6 month internship at Los Alamos National Laboratory (LANL) for the Digital Library Research and Prototyping Team under the guidance of Herbert Van de Sompel and WSDL alumnus Martin Klein.

Work Accomplished A majority of my time was used to work on the Scholarly Orphans project, which is a joint project between LANL and ODU, sponsored by the Andrew Mellon Foundation. This project explores from an institution perspective how it can discover, capture, and archive scholarly artifacts that an institution's researcher deposits in various productivity portals. After months of working on the project, Martin Klein showcased the Scholarly Orphans pipeline at TPDL 2018.

A Web-Centric Pipeline for Archiving Scholarly Artifacts from Martin Klein

My main task for this pipeline was to create and manage two components: the artifact tracker and pipeline orchestrator. Communication between different components was completed using Activit…

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are pr…

2018-11-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?

One publisher, more than 7000 retractions
Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles. "
Similar things happened in Nature subsidiary journal (link) and other journals (link).

The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu

2018-11-10: Scientific news and reports should cite original papers

I highly encourage all scientific news or reports cite corresponding articles. ScienceAlert usually does a good job on this. This piece of scientific news from ScienceAlert discovers two Rogue planets.  Most planets we discovered rotate around a star. A Rogue planet does not rotate around a star, but the center of the Galaxy. Because planets do not emit light, Rogue planets are extremely hard to detect. This piece of news cites a recently published paper on arXiv. Although anybody can publish papers on arXiv. Papers published by reputable organizations should be reliable.

A reliable citation is beneficial for all parties. It makes the scientific news more trustable. It gives credits to the original authors. It could also connect readers to a place to explore other interesting science.

Jian Wu

2018-11-09: Grok Pattern

Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data. 

In this article, Ran Ramati gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search).

The StreamSets webpage gives a list of Grok pattern examples:

The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools:

I am kind of surprised that …

2018-11-08: Decentralized Web Summit: Shaping the Next Web

In my wallet I have a few ₹500 Indian currency notes that say, "I PROMISE TO PAY THE BEARER THE SUM OF FIVE HUNDRED RUPEES" followed by the signature of the Governor of the Reserve Bank of India. However, this promise was broken two years ago from today, since then these bills in my pocket are nothing more than rectangular pieces of printed paper. So, I decided to utilize my origami skills and turn them into butterflies.

On November 8, 2016, at 8:00 PM (Indian Standard Time), Indian Prime Minister Narendra Modi announced the demonetization (effective in four hours after midnight) of the two biggest currency notes (₹1,000 and ₹500) in circulation at that time. Together these two notes represented about 86% of the total cash economy of India at that time. More than 65% of the Indian population still lives in rural and remote areas where availability of electricity, the Internet, and other utilities is not reliable yet. Hence, cash is a very common means of business in daily l…