Thursday, November 15, 2018

2018-11-15: LANL Internship Report

Los Alamos National Laboratory
On May 27 I landed in sunny Sante Fe, New Mexico to start my 6 month internship at Los Alamos National Laboratory (LANL) for the Digital Library Research and Prototyping Team under the guidance of Herbert Van de Sompel and WSDL alumnus Martin Klein.

Work Accomplished

A majority of my time was used to work on the Scholarly Orphans project, which is a joint project between LANL and ODU, sponsored by the Andrew Mellon Foundation. This project explores from an institution perspective how it can discover, capture, and archive scholarly artifacts that an institution's researcher deposits in various productivity portals. After months of working on the project, Martin Klein showcased the Scholarly Orphans pipeline at TPDL 2018.

Scholarly Orphans pipeline diagram

My main task for this pipeline was to create and manage two components: the artifact tracker and pipeline orchestrator. Communication between different components was completed using ActivityStream2 (AS2) messages and Linked Data Notification (LDN) inboxes for sending and receiving messages. AS2 messages describe events users have accomplished providing a "human friendly but machine-processable" JSON format. LDN inboxes provide endpoints for messages to be received, advertising these endpoints via link headers. Applications (senders) can discover these endpoints and send messages to these endpoints (receivers). In this case each component was a sender and a receiver. For example, the orchestrator sent an AS2 message to the tracker component's inbox to start a process to track a user for a list of portals, the tracker responds and sends an AS2 message with results to the orchestrator's inbox which is then saved in a database.

This pipeline was designed to be a distributed network, where the orchestrator knows where each component inbox is before sending messages. The tracker, capture, and archiver components are told by the orchestrator where to send their AS2 messages and also where their generated AS2 event message will be accessible. An example of an AS2 message from the orchestrator component to the tracker component shows an event object with an endpoint "to" telling the tracker where to send the message and a "tracker:eventBaseUrl" to append a uuid for where the event generated by the tracker will be accessible. After the tracker has found events for the user it will generate a new AS2 message and send it to the orchestrator "to" endpoint.

Building the tracker and orchestrator components allowed me to learn a great deal about W3C Web Standards mostly dealing with the Semantic Web. I was required to learn about various programmatic technologies during my work which included: Elasticsearch as a database, Celery task scheduling, using Docker-Compose in a production environment, Flask and uWSGI as a python web server, and working with OAI-PMH interfaces.

I was also exposed to the various technologies the Prototyping Team had developed previously and included these technologies in various components of the Scholarly Orphans pipeline. These included: Memento, Memento Tracer, Robust Links, and Signposting.

The prototype interface of the Scholarly Orphans project is hosted at for a limited time. On the website you can see the various steps of the pipeline, the AS2 event messages, the WARCs generated from the capture process, and the replay of the WARCs via the archiver process for each of the researcher's productivity portal events. The tracker component of the Scholarly Orphans pipeline was made available via Github found here:

New Mexico Lifestyle


Over the course of my stay I stayed in a house located in Los Alamos shared by multiple Ph.D. students studying in diverse fields such as Computer Vision, Nuclear Engineering, Computer Science, and Biology. The views of the mountains were always amazing and only ever accompanied by rain during the monsoon season. A surprising discovery during the summer was that there always seemed to be a forest fire somewhere in New Mexico. 
Los Alamos, NM


During my stay and adventures I found out the level of spiciness that apparently every New Mexican had become accustomed to by adding the local Green Chile to practically any and/or every meal. 


Within the first two weeks of landing I had already planned a trip to Southern NM. Visiting Roswell, NM I discovered aliens were very real.
Roswell, NM International UFO Museum
Going further south I got to visit Carlsbad, NM the home of the Carlsbad Caverns which were truly incredible.
Carlsbad, NM Carlsbad Caverns
I was able to visit Colorado for a few days and went on a few great hikes. On August 11, I got to catch the Rockies vs. Dodgers MLB game where I got to see for the first time a walk-off home run by the Rockies

I also managed a weekend road trip to Zion Canyon, Utah allowing me to hike some great trails like Observation Point Trail, The Narrows, and Emerald Pools.
Zion Canyon, Utah - Observation Point Trail


If you're a visiting researcher not hired by the lab consider living in a shared home with multiple other students. This can help alleviate you of boredom and also help you to find people to plan trips with. Otherwise you will usually be excluded from the events planned by the lab for other students.

If you're staying in Los Alamos, plan to make weekend trips out to Santa Fe. Los Alamos is beautiful and has some great hikes, but can be short on entertainment frequently.

Final Impressions

I feel very blessed to have been offered this 6 month internship. At first I was reluctant to move out to the West, however it allowed me to travel to many great locations with new friends. My internship has allowed me to be exposed to various subjects relating to WS-DL research which will surely improve, expand, and influence my own research in the future.

A special thanks to Herbert Van de Sompel, Martin Klein, Harihar Shankar, and Lyudmila Balakireva for allowing me to collaborate, contribute, and learn from this fantastic team during my stay at LANL.

--Grant Atkins (@grantcatkins)

Monday, November 12, 2018

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate

Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are probably doing a presentation of that paper. The third is actually a pre-print on arXiv. These three have different numbers of citations, which they should do.

The silly side is also reflected in the search result. How does a paper published in less than a year receive more than 1900 citations? You may say that may be a super popular paper. But if you look into the citations. Some do not make sense. For example, the first paper that "cites" the DeepLab paper was published in 2015! How could it cite a paper published in 2018?

Actually, the first paper's citation rate is also problematic. A paper published in 2015 was cited more than 6500 times! And another paper published in 2014 was cited more than 16660 times!

Something must be wrong about Google Scholar! The good news that the number looks higher, which makes everyone happy! :)

Jian Wu

Saturday, November 10, 2018

2018-11-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?

Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles. "
Similar things happened in Nature subsidiary journal (link) and other journals (link).

The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu

2018-11-10: Scientific news and reports should cite original papers

Image result for sciencealert

I highly encourage all scientific news or reports cite corresponding articles. ScienceAlert usually does a good job on this. This piece of scientific news from ScienceAlert discovers two Rogue planets.  Most planets we discovered rotate around a star. A Rogue planet does not rotate around a star, but the center of the Galaxy. Because planets do not emit light, Rogue planets are extremely hard to detect. This piece of news cites a recently published paper on arXiv. Although anybody can publish papers on arXiv. Papers published by reputable organizations should be reliable.

A reliable citation is beneficial for all parties. It makes the scientific news more trustable. It gives credits to the original authors. It could also connect readers to a place to explore other interesting science.

Jian Wu

Friday, November 9, 2018

2018-11-09: Grok Pattern

Image result for logstash logo
Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data. 

In this article, Ran Ramati gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search).

The StreamSets webpage gives a list of Grok pattern examples:

The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools:

I am kind of surprised that although they cited the Logstash website, they did not compare Logstash with its peers.

 Jian Wu

Thursday, November 8, 2018

2018-11-08: Decentralized Web Summit: Shaping the Next Web

In my wallet I have a few ₹500 Indian currency notes that say, "I PROMISE TO PAY THE BEARER THE SUM OF FIVE HUNDRED RUPEES" followed by the signature of the Governor of the Reserve Bank of India. However, this promise was broken two years ago from today, since then these bills in my pocket are nothing more than rectangular pieces of printed paper. So, I decided to utilize my origami skills and turn them into butterflies.

On November 8, 2016, at 8:00 PM (Indian Standard Time), Indian Prime Minister Narendra Modi announced the demonetization (effective in four hours after midnight) of the two biggest currency notes (₹1,000 and ₹500) in circulation at that time. Together these two notes represented about 86% of the total cash economy of India at that time. More than 65% of the Indian population still lives in rural and remote areas where availability of electricity, the Internet, and other utilities is not reliable yet. Hence, cash is a very common means of business in daily life there. It was morning here in Norfolk (USA) and I was going through the news headlines when I saw this announcement. I could not believe for a while that the news was real and not a hoax. I did not even know that there is a concept called demonetization that governments can practice. Irrespective of my political views and irrespective of the intents and goals behind the decision (whatever good or bad they might have been) I was shocked to realize that the system has so much centralization of power in place that a single person can decide sufferings for about 18% of the global population overnight and cause a chaos in the system. I wished for a better and more resilient system, I wanted a system with decentralization of power by design where no one entity has a significant share of power and influence. I wanted a DECENTRALIZED SYSTEM!

When the Internet Archive (IA) announced plans for the Decentralized Web (DWeb) Summit, I was on board to explore what can we do to eliminate centralization of control and power in systems on the Web. With a generous support from the Protocol Labs, AMF, and NSF IIS-1526700 grants I was able to travel to the West Coast to experience four days full of fun and many exciting events. I got the opportunity to meet many big names who brought us the Web we experience today and many of those who are working towards shaping the future of the Web with their vision, ideas, experience, code, art, legal understanding, education, or social values. They all had a different perspective to share with the rest, but all seemed to agree on one goal of fixing the current Web where freedom of expression is under an ever-growing threat, governments control the voice of dissent, big corporations use personal data of the Internet users for monetary benefits and political influence, and those in power try to suppress the history they might be uncomfortable with.

There was so much going on in parallel that perhaps no two people have experienced the same sequence of events. Also, I am not even pretending to tell everything I have observed there. In this post I will be describing my experience of the following four related events briefly that happened between July 31 and August 3, 2018.

  • IndieWebCamp SF
  • Science Fair
  • Decentralized Web Summit
  • IPFS Lab Day

IndieWebCamp SF

The IndieWeb is a people-focused alternative to the "corporate web". Their objectives include: 1) Your content is yours, 2) You are better connected, and 3) You are in control. Some IndieWeb people at Mozilla decided to host IndieWebCamp SF, a bootcamp the day before #DWebSummit starts and shared open invitation to all participants. I was quick to RSVP there which was going to be my first interaction with the IndieWeb.

On my way from the hotel to the Mozilla's SF office the Uber driver asked me why I came to SF. I replied to her, "to participate in an effort to decentralize the Web". She seemed puzzled and said, "my son was mentioning something about it, but I don't know much". "Have you heard about Bitcoin?", I asked her to get an idea how to explain. "I have heard this term in the news, but don't really know much about it", she said. So, I started the elevator pitch and in the next eight or so minutes (about four round trips of Burj Khalifa's elevator from the ground to the observation deck) I was able to explain some of the potential dangers of centralization in different aspects of our social life and what are some of the alternatives.

The bootcamp had both on-site and remote participants and was well organized. We started with keynotes from Miriam Avery, Dietrich Ayala, and Ryan Barrett then some people introduced themselves, why were they attending the DWeb Summit, and what ideas they had for the IndieWeb bootcamp. Some people had lightning demos. I demonstrated InterPlanetary Wayback (IPWB) briefly. I got to meet some people behind some projects I was well aware of (such as Universal Viewer and Dat Project) and also got to know about some projects I didn't know before (such as Webmention and Scuttlebutt). We then scheduled BarCamp breakout sessions and had lunch.

During and after the lunch I had an interesting discussion and exchanged ideas with Edward Silverton from the British Library and a couple of people from Mozilla's Mixed Reality team about the Universal Viewer, IIIF, Memento, and multi-dimensional XR on the Web.

Later I participated in two sessions "Decentralized Web Archiving" and "Free Software + Indieweb" (see the schedule for notes on various sessions). The first one was proposed by me in which I explained the state of Web archiving, current limitations and threats, and the need to move it to a more persistent and decentralized infrastructure. I have also talked about IPWB and how it can help in distributed web archiving (see notes for details and references). In the latter session we talked about different means to support Free Software and open-source developers (for example bug bounty, crowdfunding, and recurring funding), compared and contrasted different models and their sustainability as compared with closed-source software backed by for-profit organizations. We also touched on some licensing complications briefly.

I had to participate in the Science Fair at IA, so I had to get there a little earlier than the start time of the session. With that in mind, Dietrich (from the Firefox team) and I left the session a little before it was formally wrapped up as the SF traffic in the afternoon was going to make it a rather long commute.

Science Fair

The taxi driver was an interesting person with whom Dietrich and I shared the ride from the Mozilla SF office to the Internet Archive, talking about national and international politics, history, languages, music, and whatnot until we reached our destination where food trucks and stalls were serving the dinner. It was more windy and chilly out there than I anticipated in my rather thin jacket. Brewster Kahle, the founder of the IA, who had just came out of the IA building, welcomed us and navigated us to the registration desk where a very helpful team of volunteers gave us our name badges and project sign holders. I acquired a table right outside the entrance of the IA's building, placed the InterPlanetary Wayback sign on it, and went to the food truck to grab my dinner. When I came back I found that the wind has blown my project sign off the table, so I moved it inside of the building where it was a lot cozier and crowded.

The Science Fair event was full of interesting projects. You may explore the list of all the Science Fair projects along with their description and other details. Alternatively, flip through the pages of the following photo albums of the day.

Many familiar and new faces visited my table, discussed the project, and asked about its functionality, architecture, and technologies. On the one hand I met people who were already familiar with our work and on the other hand some needed a more detailed explanation from scratch. I even met people who asked with a surprise, "why would you make your software available to everyone for free?" This needed a brief overview of how the Open Source Software ecosystem works and why one would participate in it.

This is not a random video. This clip was played to invite Mike Judge, Co-creator of HBO's Silicon Valley on the stage for a conversation with Cory Doctorow in the Opening Night Party after Brewster's welcome note (due to the streaming rights issue the clip is missing in IA's full session recording). I can't think of a better way to begin the DWeb Summit. This was my first introduction with Mike (yes, I had not watched the Silicon Valley show before). After an interesting Q&A session on the stage, I got the opportunity to talk to him in person, took a low-light blurred selfie with him, mentioned Indian demonetization story (which, apparently, he was unaware of), and asked him to make a show in the future about potential threats on DWeb. Web 1.0 emerged as a few entities having control on publishing with the rest of the people being consumers of that content. Web 2.0 enabled everyone to participate in the web both as creators and consumers, but privacy and censorship controls gone in the hands of governments and a few Internet giants. If Web 3.0 (or DWeb) could fix this issue too, what would potentially be the next threat? There should be something which we may or may not be able to think of just yet, right?

Mike Judge and Sawood Alam

Decentralized Web Summit

For the next two days (August 1–2) the main DWeb Summit was organized in the historical San Francisco Mint building. There were numerous parallel sessions going on all day long. At any given moment perhaps there was a session suitable for everyone's taste and no one could attend everything they would wish to attend. A quick look at the full event schedule would confirm this. Luckily, the event was recorded and those recordings are made available, so one can watch various talks asynchronously. However, being there in person to participate in various fun activities, observe artistic creations, experience AR/VR setups, and interacting with many enthusiastic people with many hardware, software, and social ideas are not something that can be experienced in recorded videos.

If the father of the Internet with his eyes closed trying to create a network with many other participants with the help of a yellow yarn, some people trying to figure out what to do with colored cardboard shapes, and some trying to focus their energy with the help of specific posture are not enough then flip through these photo albums of the event to have a glimpse into many other fun activities we had there.

Initially, I tried to plan my agenda but soon I realized it was not going to work. So, I randomly picked one from the many parallel sessions of my interest, spent an hour or two there, and moved to another room. In the process I interacted with many people from different backgrounds participating both in their individual or organizational capacity. Apart from usual talk sessions we discussed various decentralization challenges and their potential technical and social solutions in our one-to-one or small group conversations. An interesting mention of additive economy (a non-zero-sum economy where transactions are never negative) reminded me of our gamification idea we explored when working on the Preserve Me! project and I ended up having a long conversation with a couple of people about it during a breakout session.

If Google Glass was not cool enough then meet Abhik Chowdhury, a graduate student, working on a smart hat prototype with a handful of sensors, batteries, and low-power computer boards placed in a 3D printed frame. He is trying to find a balance in on-board data processing, battery usage, and periodic data transfer to an off-the-hat server in an efficient manner, while also struggling with the privacy implications of the product.

It was a conference where "Crypto" meant "Cryptocurrency", not "Cryptography" and every other participant was talking about Blockchain, Distributed/Decentralized Systems, Content-addressable Filesystem, IPFS, Protocols, Browsers, and a handful other buzz-words. Many demos there were about "XXX but decentralized". Participants included the pioneers and veterans of the Web and the Internet, browser vendors, blockchain and cryptocurrency leaders, developers, researchers, librarians, students, artists, educators, activists, and whatnot.

I had a lightning talk entitled, "InterPlanetary Wayback: A Distributed and Persistent Archival Replay System Using IPFS", in the "New Discoveries" session. Apart from that I spend a fair amount of my time there talking about Memento and its potential role in making decentralized and content-addressable filesystems history-aware. During a protocol related panel discussion, I worked with a team of four people (including members from the Internet Archive and MuleSoft) to pitch the need of a decentralized naming system that is time-aware (along the lines of IPNS-Blockchain) and can resolve a version of a resource at a given time in the past. I also talked to many people from Google Chrome, Mozilla Firefox, and other browser vendors and tried to emphasize the need of native support of Memento in web browsers.

Cory Doctorow's closing keynote on "Big Tech's problem is Big, not Tech" was perhaps one of the most talked about talk of the event, which received many reactions and commentary. The recorded video of his talk is worth watching. Among many other things in his talk, he encouraged people to learn programming and to understand functions of each software we use. After his talk, an artist asked me how can she or anyone else learn programming? I told her, if one can learn a natural language, then programming languages are way more systematic, less ambiguous, and easier to learn. There are really only three basic constructs in a programming language that include variable assignments, conditionals, and loops. Then I verbally gave her a very brief example of mail merge using all of these three constructs that yields gender-aware invitations using a message template for a list of friends to be invited in a party. She seemed enlightened and delighted (while enthusiastically sharing her freshly learned knowledge with other members of her team) and exchanged contacts with me to learn more about some learning resources.

IPFS Lab Day

It looks like people were too energetic to get tired of such jam-packed and eventful days as some of them have planned post-DWeb events of special interest groups. I was invited by Protocol Labs to give an extended talk in one such IPFS-centric post-DWeb event called Lab Day 2018 on August 3. Their invitation arrived the day after I had booked my tickets and reserved the hotel room, so I ended up updating my reservations. This event was in a different location and the venue was decorated with a more casual touch with bean bags, couches, chairs, and benches near the stage and some containers for group discussions. You may take a glimpse of the venue in these pictures.

They welcomed us with new badges, some T-shirts, and some best-seller books to take home. The event had a good lineup of lightning talks and some relatively longer presentations, mostly extended forms of similar presentations in the main DWeb event. Many projects and ideas presented there were in their early stages. These sessions were recorded and published later after necessary editing.

I presented my extended talk entitled, "InterPlanetary Wayback: The Next Step Towards Decentralized Archiving". Along with the work already done and published about IPWB, I also talked about what is yet to be done. I explored the possibility of an index-free, fully decentralized collaborative web archiving system as the next step. I proposed some solutions that would require some changes in IPFS, IPNS, IPLD, and other technologies around to accommodate the use case. I encouraged people to discuss with me if they have any better ideas to help solve these challenges. The purpose was to spread the word out so that people keep web archiving related use cases in mind while shaping the next web. Some people from the core IPFS/IPNS/IPLD developer community approached me and we had an extended discussion after my talk. The recording of my talk and slides are made available online.

It was a fantastic event to be part of and I am looking forward to more such events in the future. IPFS community and people at Protocol Labs are full of fresh ideas and enthusiasm and they are a pleasure to work with.


Decentralized Web has a long way to go and DWeb Summit is a good place to bring people from various disciplines with different perspectives together every once in a while to synchronize all the distributed efforts and to identify the next set of challenges. While I could not attend the first summit (in 2016) I really enjoyed the second one and would love to participate in future events. Those two short days of the main event had more material than I can perhaps digest in two weeks, so my only advice would be to extend the duration of the event instead of having multiple parallel session with overlapping interests.

I extend my heartiest thanks to organizers, volunteers, fund providers, and everyone involved in making this event happen and making it a successful one. I wish going forward not just the Web, but many other organizations, including governments, become more decentralized so that I do not open my wallet once again to realize it has some worthless pieces of currency bills that were demonetized over night.


Sawood Alam

Friday, October 19, 2018

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.

  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

encodings = ['ISO-8859-1','ascii']
for e in encodings:
        fh =['xmlfile'],'r',encoding=e)
    except UnicodeDecodeError:
        logging.debug('got unicode error with %s, trying a different encoding' % e)
        logging.debug('opening the file with encoding: %s' % e)

f ='xmlfile',encoding=e)
soup = BeautifulSoup(,'html.parser')

Note that we use instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

Jian Wu