Monday, December 3, 2018

2018-12-03: Using Wikipedia to build a corpus, classify text, and more

Wikipedia is an online encyclopedia, available in 301 different languages, and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas.
Using Wikipedia as a corpus:
Wikipedia has been used to create corpora that can be used for text classification or annotation. In “Named entity corpus construction using Wikipedia and DBpedia ontology” (LREC 2014), YoungGyum Hahm et al. created a method to use Wikipedia, DBpedia, and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language.
Fabian Suchanek used Wikipedia, WordNet, and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “YAGO: A large ontology from Wikipedia and Wordnet” (Web Semantics 2008), describes how this dataset was created.
Using Wikipedia to classify entities:
In the paper, Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach” (VLDB Endowment 2013), Abhishek Gattani et al. created a method that accepts text from social media, such as Twitter, and then extracts important entities, matches the entity to Wikipedia links, filters, classifies the text, and then creates tags for the text. The data used is called a knowledge base (KB). Wikipedia was used as a KB and its graph structure is converted into a taxonomy. For example, if we have the following tweet “Obama just gave a speech in Hawaii”, then the entity extraction selects the two tokens “Obama” and “Hawaii”. Then the resulting tokens are paired with a Wikipedia link (Obama, and (Hawaii, This step is called entity linking. Finally, the classification and tagging of the tweet are set to “US politics, President Obama, travel, Hawaii, vacation”, which is referred to social tagging. The actual process to go from tweet to tag takes ten steps. The overall architecture is shown in Figure 1.
  1. Preprocess: detect the language (English), and select nouns and noun phrases
  2. Extract pair of (string, Wiki link): using the text in the tweet, the text is matched to Wikipedia links and is paired, where the pair of (string, Wikipedia) is called a mention
  3. Filter and score mentions: remove certain pairs and score the rest
  4. Classify and tag tweet: use mentions to classify and tag the tweet
  5. Extract mention features
  6. Filter mentions
  7. Disambiguate: select between topics, e.g. is apple categorized to a fruit or a technology?
  8. Score mentions
  9. Classify and tag tweet: use mentions to classify and tag the tweet
  10. Apply editorial rules
This dataset used in this paper was described in “Building, maintaining, and using knowledge bases: a report from the trenches” (SIGMOD 2013) by Omkar Deshpande et al. In addition to using Wikipedia, the Web and social context were used for the process of tagging the tweet more correctly. After collecting tweets, they gather web context for tweets, which is getting the link included in the tweet if exists and extracting its content, title, and other information. Then entity extraction is performed, followed by link, classify, and tag. Next, the tweet with the tag is used to create a social context of the user, hashtag, and web domains. This information is saved and used for new tweets that need to be tagged. They also used the web and social context for each node in the KB, and this is saved for future usage.
Abhik Jana et al. added Wikipedia links on the keywords in scientific abstracts in WikiM: Metapaths Based Wikification of Scientific Abstracts” (JCDL 2017). This method helped the reader determine if they are interested in reading the full article. They first step was to detect important keywords in the abstract, which they call mentions, using tf-idf. Then a list of candidate Wikipedia links, which they call candidate entries, were selected for each mention. The candidate entries are ranked based on similarity. Finally, a single candidate entry with the highest similarity score is selected for each mention.
Using Wikipedia to cluster documents:
Xiaohuo Hu et al. used Wikipedia in clustering documents in “Exploiting Wikipedia as External Knowledge for Document Clustering” (KDD 2009). In this work, documents are enriched with Wikipedia concepts and category information. Both exact concept match and related concepts are included. Then similar documents are combined based on document content, content from Wikipedia is added, and category information is added. This method was used on three datasets: TDT2, LA Times, and 20-newsgroups. Different methods were used to cluster the documents:
  1. Cluster-based on word vector
  2. Cluster-based on concept vector
  3. Cluster-based on category vector
  4. Cluster-based on the combination of word vector and concept vector
  5. Cluster-based on the combination of word vector and category vector
  6. Cluster-based on the combination of concept vector and category vector
  7. Cluster-based on the combination of word vector, concept vector, and category vector
They found that with all three datasets, clustering based on word and category vector (method #5) and clustering based on word, concept, and category vector (method #7) always had the best results.
Using Wikipedia to annotate documents:
Wikipedia was used to annotate documents, such as in the paper “Wikipedia as an ontology for describing Documents” (ICWSM 2008) by Zareen Sab Sayed et al. Wikipedia text and links were used to identify topics related to some terms in a given document. In this work, three methods were tested using the article text, the article text and categories with spreading activation, and the article text and links with spreading activation. However, the accuracy of the work depends on some factors such as that a Wikipedia page might link to a non-relevant article, the presence of links between related concepts, and the extent of having a concept appear in Wikipedia.
Using Wikipedia to create recommendations:
Wiki-Rec uses Wikipedia to create semantically based recommendations. This technique is discussed in the paper “Wiki-rec: A semantic-based recommendation system using Wikipedia as an ontology” (ISDA 2010) by Ahmed Elgohary et al. They predicted terms common to a set of documents. In this work, the user reads a document and evaluates it. Then using Wikipedia, all the concepts in the document are annotated and stored. After that, the user's profile is updated based on the new information. By matching the user's profile with other user's profiles that contain similar interests, a list of recommended documents is presented to the user. The overall system model is shown in Figure 2.
Using Wikipedia to match ontologies:
Other work, such as “WikiMatch -Using Wikipedia for ontology match” (OM 2012) by Sven Hurtling and Heiko Paulheim, used Wikipedia information to determine if two ontologies are similar, even if they are in different languages. In this work, the Wikipedia search engine is used to get articles related to a term. Then for the articles, all language links are retrieved. Two concepts are compared by comparing the articles' titles. However, this approach is time-consuming because of querying Wikipedia.
In conclusion, Wikipedia is not only an information source, it has also been used as a corpus to classify entities, cluster documents, annotate documents, create recommendations, and match ontologies.
-Lulwah M. Alkwai

2018-12-03: Acidic Regression of WebSatchel

Mat Kelly reviews WebSatchel, a browser based personal preservation tool.                                                                                                                                                                                                                                                                                                                                                                            ⓖⓞⓖⓐⓣⓞⓡⓢ

Shawn Jones (@shawnmjones) recently made me aware of a personal tool to save copies of a Web page using a browser extension called "WebSatchel". The service is somewhat akin to the offerings of browser-based tools like Pocket (now bundled with Firefox after a 2017 acquisition) among many other tools. Many of these types of tools use a browser extension that allows the user to send a URI to a service that creates a server-side snapshot of the page. This URI delegation procedure aligns with Internet Archive's "Save Page Now", which we have discussed numerous times on this blog. In comparison, our own tool, WARCreate, saves "by-value".

With my interest in any sort of personal archiving tool, I downloaded the WebSatchel Chrome extension, created a free account, signed in, and tried to save the test page from the Archival Acid Test (which we created in 2014). My intention in doing this was to evaluate the preservation capabilities of the tool-behind-the-tool, i.e., that which is invoked when I click "Save Page" in WebSatchel. I was shown this interface:

Note the thumbnail of the screenshot captured. The red square in the 2014 iteration of the Archival Acid Test (retained at the same URI-R for posterity) is indicative of a user interacting with the page for the content to load and thus be accessible for preservation. With respect to only evaluating the tool's capture ability, the red in the thumbnail may not be indicative of the capture. A repeat of this procedure to ensure that I "surfaced" the red square on the live web (i.e., interacted with the page before telling WebSatchel to grab it) resulted in a thumbnail where all squares were blue. As expected, this may be indicative that WebSatchel is using the browser's screenshot extension API at the time of URI submission rather than creating a screenshot of their own capture. The limitation of the screenshot to the viewport (rather than the whole page) also indicates this.


I then clicked the "Open Save Page" button and was greeted with a slightly different result. This captured resided at

curling that URI results in an inappropriately used HTTP 302 status code that appears to indicate a redirect to a login page.

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 19:44:59 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5
Content-Type: text/html

Note the lack of scheme in the Location header. RFC2616 (HTTP/1.1) Section 14.30 requires the location to be an absolute URI (per RFC3896 Section 4.3). In an investigation to legitimize their hostname leading redirect pattern, I also checked the more current RFC7231 Section 7.1.2, which revises the value of Location response to be a URI reference in the spirit of RFC3986. This updated HTTP/1.1 RFC allows for relative references, as already done in practice prior to RFC7231. WebSatchel's Location pattern causes browsers to interpret their hostname as a relative redirect per the standards, causing a redirect to

$ curl -I
HTTP/1.1 302 302
Date: Mon, 03 Dec 2018 20:13:04 GMT
Server: Apache/2.4.34 (Unix) LibreSSL/2.6.5

...and repeated recursively until the browser reports "Too Many Redirects".

Interacting with the Capture

Despite the redirect issue, interacting with the capture retains the red square. In the case where all squares were blue on the live Web, the aforementioned square was red when viewing the capture. In addition to this, two of the "Advanced" tests (advanced relative to 2014 crawler capability, not particularly new to the Web at the time) were missing, representative of an iframe (without anything CORS-related behind the scenes) and an embedded HTML5 object (using the standard video element, nothing related to Custom Elements).

"Your" Captures

I hoped to also evaluate archival leakage (aka Zombies) but the service did not seem to provide a way for me to save my capture to my own system, i.e., your archives, remotely (and solely) hosted. In investigating a way to liberate my captures, I noticed that the default account is simply a trial of a service, which ends a month after creating the account and a relatively steep monthly pricing model. The "free" account is also listed as being limited to 1 GB/account, 3 pages/day and access removed to their "page marker" feature, WebSatchel's system for a sort-of text highlighting form of annotation.


WebSatchel has browser extensions for Firefox, Chrome, MS Edge, and Opera but the data liberation scheme leaves a bit to be desired, especially for personal preservation. As a quick final test, without holding my breadth for too long, I use my browser's DevTools to observe the HTTP response headers for the URI of my Acid Test capture. As above, attempting to access the capture via curl would require circumventing the infinite redirect and manually going through an authentication procedure. As expected, nothing resembling Memento-Datetime was present in the response headers.

—Mat (@machawk1)

Friday, November 30, 2018

2018-11-30: Archives Unleashed: Vancouver Datathon Trip Report

The Archives Unleashed Datathon #Vancouver was a two day event from November 1 to November 2, 2018 hosted by the Archives Unleashed team in collaboration with Simon Fraser University Library and Key, SFU's big data initiative. This was second in a series of Archives Unleashed datathons to be funded by The Andrew W. Mellon Foundation. This is the first time for me, Mohammed Nauman Siddique of the Web Science and Digital Libraries research group (WS-DL) at Old Dominion University to travel to the datathon at Vancouver.

 Day 1

The event kicked off with Ian Milligan welcoming all the participants at the Archives Unleashed Datathon #Vancouver. It was followed by welcome speech from Gwen Bird, University Librarian at SFU and Peter Chow-White, Director and Professor at GeNA lab. After the welcome, Ian talked about the Archives Unleashed Project, why we care about the web archives, purpose of organizing the datathons, and the roadmap for future datathons.
Ian's talk was followed by Nick Ruest walking us through the details of  the Archives Unleashed Toolkit and the Archives Unleashed Cloud. For more information about the Archives Unleashed Toolkit and the Archives Unleashed Cloud Services you can follow them on Twitter or check their website.
For the purpose of the datathon, Nick had already loaded all the datasets onto six virtual machines provided by Compute Canada. We were provided with twelve options for our datasets courtesy of  University of Victoria, University of British Columbia, Simon Fraser University, and British Columbia Institute of Technology. 
Next, the floor was open for us to decide our projects and form teams. We had to arrange our individual choices on the white board with information about the dataset we wanted to use in blue, tools we intended to use in pink, and research questions we cared about in yellow. The teams started to form quickly based on the datasets and purpose of the project. The first team, led by Umar Quasim, wanted to work on ubc-bc-wildfires dataset which was a collection of webpages related to wildfires in British Columbia. They wanted to understand and find relationships between the events and media articles related to wildfires. The second team, led by Brenda Reyes Ayala, wanted to work on improving the quality of archives pages using the uvic-anarchist-archives dataset. The third team, led by Matt Huculak, wanted to investigate on the politics of British Columbia using the uvic-bc-2017-candidates dataset. The fourth team, led by Kathleen Reed, wanted to work on ubc-first-nations-indigenous-communities to investigate about the history and its discourse in media about first nations indigenous communities.

I worked with Matt Huculak, Luis Menese, Emily Memura, and Shahira Khair on the British Columbia candidates dataset. Thanks to Nick, we had already been provided with the derivative files for our datasets which included list of all the captured domain names with their archival count, extracted text from all the WARC files with basic file metadata, and a Gephi file with network graph. It was the first time that the Archives Unleashed Team had provided the participating teams with derivative files, which saved us hours of wait time which would have been wasted in extracting all the information from the dataset WARC files. We continued to work on our projects through the day with a break for lunch. Ian moved around the room to check on all the teams, motivate us with his light humor, and providing us any help needed to get going on our projects.

Around 4 pm, the floor was open for Day 1 talk session. The talk started with Emily Memura (PhD student at University of Toronto) presenting her research on understanding the use and impact of web archives. Emily's talk was followed by Matt Huculak (Digital Scholarship Librarian at University of Victoria) who talked about the challenges faced by libraries in creating web collections using Archive-It. He emphasized on the use of regular expressions in Archive-It and problems it poses to non-technical librarians and web archivists. Nick Ruest presented Warclight and its framework, the latest service released by the Archives Unleashed Team which was followed by a working demo of the service. Last but not the least, I presented my research work on Congressional Deleted Tweets talking about why we care about the deleted tweets, difficulties involved in curating the dataset for Members of Congress, and results about the distribution of deleted tweets in multiple services which can be used to track deleted tweets.

We called it a day at 4:30 pm only to meet again for dinner at 5 pm at Irish Heather in Downtown Vancouver. At dinner Nick, Carl Cooper, Ian, and I had a long conversation ranging from politics to archiving to libraries. After dinner, we called it a day only to meet again fresh the next day.

Day 2

The morning of Day 2 at Vancouver greeted us with a clear view of mountains across the Vancouver harbor which called for a perfect start to our morning. We continued on our project with the occasional distraction of clicking pictures of the beautiful view that lay in front of us. We did some brainstorming on our network graph and bubble chart visualizations from Gephi to understand the relationship between all the URLs in our dataset. We also categorized all the captured URLs into political party URLs, social media URLs and rest. While reading the list of crawled domains present in the dataset, we discovered a bias towards a particular domain which made up approximately 510k mementos out of approximately 540k mementos. The standout domain we refer to was, which was owned by Brian Taylor who ran as an independent candidate. We set out to investigate  the reason behind that bias in our dataset, only to parse out and analyze the status codes from response headers of each WARC file. We realized that out of approximately 540k mementos only 10k mementos were of status code 200 OK and the rest were either 301s, 302s or 404s. Our investigation of all the URLs that showed up for led us to the conclusion that it was a calendar trap for crawlers.    

Most relevant topics word frequency count in BC Candidates dataset 

During lunch, we had three talks scheduled for Day 2. The first speaker was Umar Quasim from the University of Alberta who talked about the current status of web archiving in their university library and discussed some their future plans. The second presenter, Brenda Reyes Ayala, Assistant Professor at University of Alberta, talked about measuring archival damage and the metrics to evaluate them which had been discussed in her PhD dissertation. Lastly, Samantha Fritz talked about the future of the Archives Unleashed toolkit and cloud service. She mentioned in her talk that starting from 2019 computations using the Archives Unleashed toolkit will be a paid service.

Team BC 2017 Politics

We were first to start our presentation on the BC Candidates dataset. with a talk about the dataset we had at our disposal and different visualizations we had used to understand our dataset. We talked about relationships between different URLs and their connections. We also highlighted the issue of and the crawler trap issue. Our dataset comprised of 510k mementos of 540k mementos from a single domain The reason for the large memento count from a single domain was due to a calendar crawler trap which was evident on analyzing all the URLs which had been crawled for this domain. Of the 510k mementos crawled, only six of them were 302s and seven of them were 200s, while the rest of the URLs returned a status code of 404. In a nutshell, we had a meager seven mementos with useful information from approximately 510k mementos crawled for this domain. We highlighted the fact that the dataset with approximately 540k mementos had only approximately 10k mementos with relevant information. Based on our brainstorming over the last two days, we summarized lessons learned and an advice for future historians who are curating seeds for creating collections on Archive-It.

Team IDG

Team IDG started off by talking about difficulties faced in settling for their final dataset by waling us through the different datasets they tried before settling for the final dataset (ubc-hydro-cite-c) used in their project. They presented visualization on top keywords based on frequency count and relationship between different keywords. They also highlighted the issue of extracting text from the tables and talked about their solution. They walked us through all the steps involved in plotting their events on a map. It started it using the table of processed text to create a geo-coding for their dataset and plot in onto a map showing the occurrences of the events. They also showed a timeline of how the events evolved over time by plotting it onto the map.

Team Wildfyre

Team Wildfyre opened their talk with description of their dataset and other datasets they used in their project. They talked about research questions and tools used in their project. They presented multiple visualizations showing top keywords, top named entity and geo-coded map of the events. They also had a heat map for distribution of datasets based on the domain names available in their dataset. They pointed out that even when analyzing named entities in the wild fire dataset, the most talked about entity during these events was Justin Trudeau.

Team Anarchy


Team Anarchy had split their project into two smaller projects. The first project undertaken by Ryan Deschamps was about finding linkages between all the URLs in the dataset. He presented a concentric circles graph talking about the linkage between pages from depth 0 to 5. They found that starting from the base URL to depth level 5 took them to a spam or a government website in most cases. He also talked about the challenges faced in extracting images from the WARC files and comparing them their live version counterparts. The second project undertaken by Brenda was about capturing archived pages and measuring the degree of difference from the live version of these pages. She showed multiple examples with varying degree of difference between the archived and their live pages.

Once the presentations were done, Ian asked us all to write out our votes and the winner would be decided based on popular vote. Congratulations to Team IDG for winning the Archives Unleashed Datathon #Vancouver. For closing comments Nick talked about what to take away from these events and how to build a better web archiving research community. After all the suspense, the next edition of Archives Unleashed Datathon was announced.

More information about Archives Unleashed Datathon #WashingtonDC can be found using their website or following the Archives Unleashed team on Twitter.

This was my first time at a Archives Unleashed Datathon. I went with the idea of meeting researchers, librarians, and historians all under one roof who propel the web archiving research domain. The organizers try to strike a perfect balance by inviting different research communities with the web archiving community with diverse background, and experience. It was an eye-opening trip for me, where I learned from my fellow participants about their work, how libraries build collections for web archives and the difficulties and challenges faced by them. Thanks, to Carl Cooper, Graduate Trainee at Bodlian Libraries- Oxford University for strolling down Vancouver Downtown with me. I am really excited and look forward to attending the next edition of Archives Unleashed Datathon at Washington DC.

View of Downtown Vancouver
Thanks again to the organizers (Ian Milligan, Rebecca Dowson, Nick Ruest, Jimmy Lin, and Samantha Fritz), their partners and SFU library for hosting us. Looking forward to see you all at future Archive Unleashed datathons.

Mohammed Nauman Siddique

2018-11-30: The Illusion of Multitasking Boosts Performance

Illustration showing hands completing many tasks at once
Today, I read the article on The title is "The Illusion of Multitasking Boosts Performance". At first, I thought it argues for single-task at once, but after reading it, I found that it is not. It actually supports multi-tasking, but in the sense that the worker "believes" the work he is working on is a combination of multi-tasks.

The original paper published in Psychological Science has a title "The Illusion of Multitasking and Its Positive Effect on Performance". 

In my opinion, the original article's title is accurate, but the press release reveals part of the story and actually distorted the original meaning of the article. The reader actually got an illusion that multi-tasking is producing a negative effect.

Jian Wu

Thursday, November 15, 2018

2018-11-15: LANL Internship Report

Los Alamos National Laboratory
On May 27 I landed in sunny Sante Fe, New Mexico to start my 6 month internship at Los Alamos National Laboratory (LANL) for the Digital Library Research and Prototyping Team under the guidance of Herbert Van de Sompel and WSDL alumnus Martin Klein.

Work Accomplished

A majority of my time was used to work on the Scholarly Orphans project, which is a joint project between LANL and ODU, sponsored by the Andrew Mellon Foundation. This project explores from an institution perspective how it can discover, capture, and archive scholarly artifacts that an institution's researcher deposits in various productivity portals. After months of working on the project, Martin Klein showcased the Scholarly Orphans pipeline at TPDL 2018.

Scholarly Orphans pipeline diagram

My main task for this pipeline was to create and manage two components: the artifact tracker and pipeline orchestrator. Communication between different components was completed using ActivityStream2 (AS2) messages and Linked Data Notification (LDN) inboxes for sending and receiving messages. AS2 messages describe events users have accomplished providing a "human friendly but machine-processable" JSON format. LDN inboxes provide endpoints for messages to be received, advertising these endpoints via link headers. Applications (senders) can discover these endpoints and send messages to these endpoints (receivers). In this case each component was a sender and a receiver. For example, the orchestrator sent an AS2 message to the tracker component's inbox to start a process to track a user for a list of portals, the tracker responds and sends an AS2 message with results to the orchestrator's inbox which is then saved in a database.

This pipeline was designed to be a distributed network, where the orchestrator knows where each component inbox is before sending messages. The tracker, capture, and archiver components are told by the orchestrator where to send their AS2 messages and also where their generated AS2 event message will be accessible. An example of an AS2 message from the orchestrator component to the tracker component shows an event object with an endpoint "to" telling the tracker where to send the message and a "tracker:eventBaseUrl" to append a uuid for where the event generated by the tracker will be accessible. After the tracker has found events for the user it will generate a new AS2 message and send it to the orchestrator "to" endpoint.

Building the tracker and orchestrator components allowed me to learn a great deal about W3C Web Standards mostly dealing with the Semantic Web. I was required to learn about various programmatic technologies during my work which included: Elasticsearch as a database, Celery task scheduling, using Docker-Compose in a production environment, Flask and uWSGI as a python web server, and working with OAI-PMH interfaces.

I was also exposed to the various technologies the Prototyping Team had developed previously and included these technologies in various components of the Scholarly Orphans pipeline. These included: Memento, Memento Tracer, Robust Links, and Signposting.

The prototype interface of the Scholarly Orphans project is hosted at for a limited time. On the website you can see the various steps of the pipeline, the AS2 event messages, the WARCs generated from the capture process, and the replay of the WARCs via the archiver process for each of the researcher's productivity portal events. The tracker component of the Scholarly Orphans pipeline was made available via Github found here:

New Mexico Lifestyle


Over the course of my stay I stayed in a house located in Los Alamos shared by multiple Ph.D. students studying in diverse fields such as Computer Vision, Nuclear Engineering, Computer Science, and Biology. The views of the mountains were always amazing and only ever accompanied by rain during the monsoon season. A surprising discovery during the summer was that there always seemed to be a forest fire somewhere in New Mexico. 
Los Alamos, NM


During my stay and adventures I found out the level of spiciness that apparently every New Mexican had become accustomed to by adding the local Green Chile to practically any and/or every meal. 


Within the first two weeks of landing I had already planned a trip to Southern NM. Visiting Roswell, NM I discovered aliens were very real.
Roswell, NM International UFO Museum
Going further south I got to visit Carlsbad, NM the home of the Carlsbad Caverns which were truly incredible.
Carlsbad, NM Carlsbad Caverns
I was able to visit Colorado for a few days and went on a few great hikes. On August 11, I got to catch the Rockies vs. Dodgers MLB game where I got to see for the first time a walk-off home run by the Rockies

I also managed a weekend road trip to Zion Canyon, Utah allowing me to hike some great trails like Observation Point Trail, The Narrows, and Emerald Pools.
Zion Canyon, Utah - Observation Point Trail


If you're a visiting researcher not hired by the lab consider living in a shared home with multiple other students. This can help alleviate you of boredom and also help you to find people to plan trips with. Otherwise you will usually be excluded from the events planned by the lab for other students.

If you're staying in Los Alamos, plan to make weekend trips out to Santa Fe. Los Alamos is beautiful and has some great hikes, but can be short on entertainment frequently.

Final Impressions

I feel very blessed to have been offered this 6 month internship. At first I was reluctant to move out to the West, however it allowed me to travel to many great locations with new friends. My internship has allowed me to be exposed to various subjects relating to WS-DL research which will surely improve, expand, and influence my own research in the future.

A special thanks to Herbert Van de Sompel, Martin Klein, Harihar Shankar, and Lyudmila Balakireva for allowing me to collaborate, contribute, and learn from this fantastic team during my stay at LANL.

--Grant Atkins (@grantcatkins)

Monday, November 12, 2018

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are probably doing a presentation of that paper. The third is actually a pre-print on arXiv. These three have different numbers of citations, which they should do.

The silly side is also reflected in the search result. How does a paper published in less than a year receive more than 1900 citations? You may say that may be a super popular paper. But if you look into the citations. Some do not make sense. For example, the first paper that "cites" the DeepLab paper was published in 2015! How could it cite a paper published in 2018?

Actually, the first paper's citation rate is also problematic. A paper published in 2015 was cited more than 6500 times! And another paper published in 2014 was cited more than 16660 times!

Something must be wrong about Google Scholar! The good news that the number looks higher, which makes everyone happy! :)

Jian Wu

Saturday, November 10, 2018

2018-11-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?

Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles. "
Similar things happened in Nature subsidiary journal (link) and other journals (link).

The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu