Thursday, July 21, 2016

2016-07-21: Dockerizing ArchiveSpark - A Tale of Pair Hacking

"Some doctors prescribe application of sandalwood paste to remedy headache, but making the paste and applying it is no less of a headache." -- an Urdu proverb
This is the translation of a couplet from an Urdu poem which is often used as a proverb. This couplet nicely reflects my feeling when Vinay Goel from the Internet Archive was demonstrating how suitable ArchiveSpark was for our IMLS Museums data analysis during the Archives Unleashed 2.0 Datathon, in the Library of Congress, Washington DC on June 14, 2016. ArchiveSpark allows easy data extraction, derivation, and analysis from standard web archive files (such as CDX and WARC). On the back of my head I was thinking, it seems nice, cool, and awesome to use ArchiveSpark (or Warcbase) for the task, and certainly a good idea for serious archive data analysis, but perhaps an overkill for a two day hackathon event. Installing and configuring these tools would have required us to setup a Hadoop cluster, Jupyter notebook, Spark, and a bunch of configurations for the ArchiveSpark itself. After doing all that, we would have to setup an HDFS storage and import a few terabytes of archived data (CDX and WARC files) into it. It would have easily taken up a whole day for someone new to these tools, leaving almost no time for the real data analysis. That is why we decided to use standard Unix text processing tools for CDX analysis.

Pair Hacking

Fast-forward to the next week, we were attending JCDL 2016 in Rutgers University, New Jersey. On June 22, during the half an hour coffee break I asked Helge Holzmann, the developer of ArchiveSpark, to help me understand the requirements and steps involved for a basic ArchiveSpark setup on a Linux machine so that I can create a Docker image to eliminate some friction for new users. We sat down together and discussed the minimal configurations that would make the tool work on a regular file system on a single machine without the complexities of a Hadoop cluster and HDFS. Based on his instructions, I wrote a Dockerfile that can be used to build a self contained, pre-configured, and ready to spin Docker image. After some tests and polish I published the ArchiveSpark Docker image publicly. This means, running an ArchiveSpark instance is now as simple as running the following command (assuming, Docker is installed on the machine):

$ docker run -p 8888:8888 ibnesayeed/archivespark

This command essentially means, run a Docker container from the ibnesayeed/archivespark image and map the internal container port 8888 to the host port 8888 (to make it accessible from outside the container). This will automatically download the images from Docker Hub if not in the local cache (which will be the case for the first run). Once the service is up an running (which will take a few minutes for the first time depending on the download speed, but subsequent runs will take a couple of seconds), the notebook will be accessible from a web browser at http://localhost:8888/. The default image is pre-loaded with some example files, including a CDX file, a corresponding WARC file, and a notebook file to get started with the system. To work on your own data set, please follow the instructions to mount host directories of CDX, WARC, and notebook files inside the container.

As I tweeted about this new development, I got immediate encouraging responses from different people using Linux, Mac, and Windows machines.

Under the hood

For those who are interested in knowing what is happening under the hood of this Docker image, I will walk through the Dockerfile itself to explain how it is built.

We have used the official jupyter/notebook image as the base image. This means we are starting with a Docker image that includes all the necessary libraries and binaries to run the Jupyter notebook. Next, I added my name and email address as maintainer of the image. Then we installed JRE using standard apt-get command. Next, we downloaded Spark binary with Hadoop from a mirror and extracted in a specific directory (this location will later be used in a configuration file). Then we downloaded the ArchiveSpark kernel and extracted it in a location where Spark expects the kernels to reside. Next, we overwrite the configuration file of the ArchiveSpark kernel using a customized kernel.json file. This custom configuration file overwrites some placeholders of the default config file, specifies the Spark directory (where Spark was extracted), and modifies it to run in a non-cluster mode on a single machine. Next three lines add sample files/folders (example.ipynb file, cdx folder, and warc folder respectively) in the container and create volumes where host files/folders can be mounted at run time to work on real data. Finally, the default command "jupyter notebook --no-browser" is added which will run by default when a container instance is spun without a custom command.


In conclusion, we see this dockerization of ArchiveSpark as a contribution to the web archiving community that eliminates the setup and getting started friction from a very useful archive analysis tool. We believe that this simplification will encourage increased usage of the tool in web archive related hackathons, quick personal archive explorations, research projects, demonstrations, and classrooms. We believe that there is a need and usefulness of dockerizing and simplifying other web archiving related tools (such as Warcbase) to give new users a friction-free choice to get started with different tools. Going forward, some improvements that can be made to the ArchiveSpark Docker image include (but not limited to) running the notebook inside the container under a non-root user, adding a handful of ready-to-run sample notebook files for common tasks in the image, and making the image configurable at run time (for example to allow local or Hadoop cluster mode and HDFS or plain file system store) while keeping the defaults that work well for simple usage.


Sawood Alam

Monday, July 18, 2016

2016-07-18: Tweet Visibility Dynamics in a Tweet Conversation Graph

We conducted another study in the same spirit as the first, as part of our research (funded by IMLS) to build collections for stories or events. This time we sought to understand how to extract not just a single tweet, but the conversation of which the tweet belongs to. We explored how the visibility of tweets in a conversation graph changes based on the tweet selected.
Fig 1: A Hypothetical Tweet Conversation Graph consisting of 8 tweets. An arrowhead points in the direction of a reply. For example, t8 replied t5.
It all began when we started collecting tweets about the Ebola virus. After collecting the tweets, Dr. Nelson expressed an interest in seeing not just the collected tweets, but the collected tweets in context of the tweet conversations they belong to. For example, if through our tweet collection process, we collected tweet t8 (Fig. 1), we were interested at the very least in discovering t5 (replied by t8), t2 (replied by t5), and t1 (replied by t2). A more ambitious goal was to discover the entire graph which contained t8 (Fig. 1: t1 - t8). In order to achieve this, I began by attempting to understand the nature of the tweet graph from two perspectives - the browser's view of the tweets and the Twitter API's view.

Fig 2: Root, Parent and Child tweets.
  1. Root tweet: is a tweet which is not a reply to another tweet. But may be replied by other tweets. For example, t1 (Fig. 1).
  2. Parent tweet: is a tweet with replies, called children. A parent tweet can also be a child tweet to the tweet it replied. For example, t2 (Fig. 1) is a parent to t4 - t6, but a child to t1.
  3. Child tweet: is a tweet which is a reply to another tweet. The tweet it replied is called its parent. For example, t8 (Fig. 1) is the child of t5.
  4. Ancestor tweets: refers to all parent tweets which precede a parent tweet. For example, the ancestor tweets of t8 are t1, t2 and t5.
  5. Descendant tweets: refers to all child tweets which follow a parent tweet. For example, the descendants of t2 are t4, t5, t6 and t8.
Tweet Visibility Dynamics in a Tweet Conversation Graph - Twitter API's Perspective:
The API provides an entry called in_reply_to_status_id in a tweet json. With this entry, every tweet in the chain of replies before a tweet, can be retrieved. This means this option does not let you get tweets which are replies to a current tweet. For example, if we selected tweet t1 (a root tweet), with the API, since t1 did not reply another tweet (has no parent), we will not be able to retrieve any other tweet, because we can only retrieve tweets in one direction (Fig. 3 left). If we selected a tweet t2, the in_reply_to_status_id of t2 points to t1, so we can retrieve t1 (Fig. 3 right).

Fig 3: Through the API, from t1, no tweets can be retrieved, from t2, we can retrieve its parent reply tweet, t1
t5's in_reply_to_status_id points to t2, so we retrieve t2 and then t1 (Fig. 4 left). From t8, we retrieve t5, which retrieves t2, which retrieves t1 (Fig. 4 right)So with the last tweet in a tweet conversation reply chain, we can get all the tweet parents (and parent's descendants).

Fig 4: Through the API, from t8 we can retrieve t5, and from t5 we can retrieve t2, and from t2 we can retrieve t1

To summarize the API's view of tweets in a conversation, given a selected tweet, we can see the parent tweets (plus parent ancestors - above), but NOT children tweets (plus children descendants - below), and NOT sibling tweets (sideways).
Tweet Visibility Dynamics in a Tweet Conversation Graph - Browser's Perspective:
By browsing Twitter, we observed that given a selected tweet in a conversation chain, we can see the tweet it replied (parents and parent's ancestors), as well as the tweet's replies (children and children's descendants). For example, given t8, we will be able to retrieve t5, t2, and t1 just like the API (Fig. 5). 
Fig 5: From t8 we can access t5, t2 and t1

However, unlike the API, if we had t1, we will be able to retrieve t1 - t8, since t1 is the root tweet (Fig. 6).

Fig 6: From t1 we can access t2 - t8
To summarize the Browser's view of tweets in a conversation, given a selected tweet, we can see the parent tweets, (plus parent ancestors - above) and children tweets (plus children descendants - below), but NOT sibling tweets (sideways).
Our findings are summarized in the following slides:

Informal Time Analysis of Extracting Tweets
We also considered a simple informal analysis (as opposed to asymptotic analysis based on Big-O) to estimate how long (in seconds) it might take to extract tweets by using the Twitter API vs the browser (by responsibly scraping Twitter). This analysis only considers counting the number of request issued in other to access tweets.

Informal Time Analysis for extracting tweets with the API:

The statuses API access point (used to get tweets by ID) imposes a rate limit of 180 requests per 15 minutes (1 request every 5 seconds). Given a tweet t(i) in a chain of tweets, the amount of time (seconds) to get the previous tweets in the conversation chain is:
5(i-1) seconds.
Informal Time Analysis for extracting tweets with the Browser:
Consider a scraping implementation in which we retrieve tweets as follows:
  1. Load Twitter webpage for a tweet
  2. Sleep randomly based on value of 𝛿 in [1, 𝛿], where 𝛿 > 1 
  3. Scroll to load new tweet content until we reach maxScrollForSingleRequest, (maxScrollForSingleRequest > 0). Exit when no new content loads.
  4. Repeat 3.
Based on the implementation described above, given a tweet t(i) with a maximum sleep time represented by a random variable 𝛿 in [1, 𝛿] seconds, and a constant maximumScrollForSingle, which represents the maximum number of scrolls we make per request, the estimated amount of time to get the conversation is at most:
E[𝛿] + (E[𝛿] × maxScrollForSingleRequest) seconds; where E[𝛿] = (1+𝛿)/2
Since 𝛿 ~ U{1, 𝛿}, (𝛿 is described by the Uniform distribution (discrete) and E[𝛿] is the expected value).

Our findings are of consequence particularly to tweet Archivists who should understand the visibility dynamics of the tweet conversation graph.

Thursday, July 7, 2016

2016-07-07: Signposting the Scholarly Web

The web site for "Signposting the Scholarly Web" recently went online.  There is a ton of great content available and since it takes some time to process it all, I'll give some of the highlights here.

First, this is the culmination of ideas that have been brewing for some time (see this early 2015 short video, although some of the ideas can arguably be traced to this 2014 presentation).  Most recently, our presentation at CNI Fall 2015, our 2015 D-Lib Magazine article, and our 2016 tech report advanced the concepts.

Here's the short version: the purpose is to make a standard, machine-readable method for web robots and other clients to "follow their nose" as they encounter scholarly material on the web.  Think of it as similar (in purpose if not technique) to Facebook's Open Graph or FOAF, but for publications, slides, data sets, etc. 

Currently there are three basic functions in Signposting:
  1. Discovering rich, structured, bibliographic metadata from web pages.  For example, if my user agent is at a landing page, publication page, PDF, etc., then Signposting allows me to discover where the BibTeX, MARC, DC, or whatever metadata format the publisher makes available.  Lots of DC records "point to" scholarly web pages, but this defines how the pages can "point back" to their metadata.
  2. Provide bi-directional linkage between a web page and its DOI.  OK, technically it doesn't have to be a DOI but that's the most common case.  One can dereference a DOI (e.g., and be redirected to the URI at the publisher's site (in this case:  But there isn't a standardized, machine-readable method for discovering the DOI from the landing page, PDF, data set, etc. at the publisher's site (note: rel="canonical" serves a different purpose).  The problem is few people actually link to DOIs, instead they link to the final (and not stable) URL.  For example, this popular news story about cholesterol research links to the article at the publisher's site, but not the DOI.  For this purpose, we introduce rel="identifier", which allows items in a scholarly object to point back to their DOI (or PURLs, handles, ARKs, etc.). 
  3. Delineating what's part of the scholarly object and what is not.  Some links are clearly intended to be "part" of the scholarly object: the PDF, the slides, the data set, the code, etc.  Some links are useful, but not part of the scholarly object: navigational links, citation services, bookmarking services, etc.  You can think of this as a greatly simplified version of OAI-ORE (and if you're not familiar with ORE, don't worry about it; it's powerful but complex).  Knowing what is part of the scholarly object will, among other things, allow us to assess how well it is has been indexed, archived, etc.
Again, there's a ton of material at the site, both in terms of modeling common patterns as well as proposed HTTP responses for different purposes.  But right now it all comes down to providing links for three simple things: 1) the metadata, 2) the DOI (or other favorite identifier), 3) items "in the object". 

Please take a look at the site, join the Signposting list and provide feedback there about the current three patterns, additional patterns, possible use cases, or anything else. 


Friday, July 1, 2016

2016-07-01: Fulbright Enrichment Seminar - Lab to Market: Entrepreneurship and Technological Innovation Enrichment (May 24 - 28, 2016)

One of the most valuable experiences that I have in my life is being a Fulbright scholar. Before winning this scholarship, I was an employee at BPS-Statistics Indonesia. I started working there right after I received my B.S. in Computational Statistics from Institute of Statistics in Jakarta, Indonesia. I worked for 3 years when suddenly I felt like I was stuck in a comfort zone.  Science and knowledge, especially those related to technology, are growing very rapidly. There is a lo@WebSciDLt of new information out there, which I could not get if I did not get out of my office building. I need to upgrade my education for a better career in the future. I started applying for many scholarships to study abroad, got many rejections before finally I was invited to an interview for Fulbright. After a very long selection process (it took almost a year since I submitted my application), I was fortunate enough to have a Fulbright scholarship. Now, I am pursuing an M.S. in Computer Science at Old Dominion University in Norfolk, Virginia. 
I feel so blessed because Fulbright not only gives me the opportunity to travel abroad and continue my education for free, it also comes with many other benefits. One of them is the opportunity to participate in an enrichment seminar hosted by the U.S. Department of State’s Bureau of Educational and Cultural Affairs (ECA) to increase mutual understanding between the people of the U.S. and the people of other countries. There are 11 enrichment seminars conducted between December 2015 - May 2016, which covered various topics: U.S. Politics and Elections, Global Health Innovations, Democratization of Education, and Entrepreneurship and Technological Innovation. I had the honor of attending the last seminar, titled "The 2016 Lab to Market: Entrepreneurship and Technological Innovation Enrichment", which was hosted in Pittsburgh, Pennsylvania from May 24 - 28, 2016. This seminar focused on how to use technological advances to support scientific and business disciplines. It included discussions with the entrepreneurs who successfully brings technological products and service to the marketplace and getting involved in innovation and ideation project.

Day 1: Arrival, Registration, and Opening Dinner

Welcome Announcement
      I arrived at The Omni William Penn Hotel at 3.00 pm and directly ushered to the Three Rivers Room to do the registration and got the "welcome package" which contained a t-shirt, name tag, and a guidebook. There are seven U.S. Fulbright scholars who volunteered to assist the seminar attendees in doing the registration. One of them asked me about from which part of Indonesia I came from, then she marked my hometown on the google map to create the distribution map of Fulbright scholars who attend the seminar. 
Distribution map of the seminar attendees. 
      At 5.30 pm we departed for the opening dinner which was hosted at Senator John Heinz History Center. While having dinner, we listened to the keynote remarks from Thomas Petzinger Jr., the co-founder of LaunchCyte LLC about his experience in creating a startup and lessons that he learned (the triumphs and struggles) related to entrepreneurship. 
Keynote remarks from Thomas Petzinger, Jr.
Opening Dinner

Day 2: Panel Discussion, Site Visits, and Small Group Dinner

Discussion panel about social innovations
  We started the day by participating in a panel discussion about "Social Innovations: Tech Solutions for Global Good." The panelists were Andrew Butcher from GTECH Strategies, Corrine Clinch from Rorus Inc., and Lee Kimball from Thread International. Acting as the moderator was Tim Zak, Director of the Institute for Social Innovation at Carnegie Mellon University. The panelists shared their story and experience when transferring their academic knowledge into something that will give a real benefit to people. It is about how technologies can integrate into people's life. Here is a very nice remark that I heard in this session: "The key to success in an entrepreneur field is the ability to adapt and recover quickly, every time we fail. The real failure only happens when we give up." Moreover, Tim Zak also said, "If you want to change the world, get out of your lab and go meet customers, influencers, and investors."

After lunch, we departed for Site Visits. There are 3 sites that we could choose to visit: TechShop, Human Engineering Research Laboratories, and CREATE Lab. I joined the group that went to the TechShop. We got the opportunity to see real technological innovations and interact with people who spend their daily life dealing with innovations and creations.
TechShop Visit

Here, I also would like to share the pictures taken by my colleagues who went to CREATE Lab and Human Engineering Research Laboratories.       

A video posted by Adhitya Virtus (@avirtus) on

In the evening, the Fulbrighters are divided into six groups that will depart for six different restaurants for the group dinner. I chose to go to Alihan's Mediterranean Cuisine, which is located close to the hotel where I stayed. I used this opportunity to mingle with other Fulbright scholars and we exchanged stories and experiences about living and studying in the U.S. We also talked about current issues that are happening in our respective countries. It is always interesting to hear about something that happens in other countries from their citizen's perspectives. 
Dinner at Alihan's Mediterranean Cuisine
      We had a free evening after dinner. I and my Indonesian fellows used this opportunity to explore Pittsburgh. Initially, we only wanted to hang out at Starbucks or other coffee shops. Unfortunately, in Pittsburgh, everything closes after 8.00 pm, except the bar. Since none of us drink alcohol, we were just wandering in Pittsburgh streets before we met other Fulbright groups who invited us to go with them to Mount Washington. We took the skylift to reach the top of Mount Washington and was honored with a very beautiful view of Pittsburgh in the night. 
Pittsburgh at night
Pittsburgh views from the top of Mount Washington

Day 3: Panel Discussion, Fulbright Talks, and Cultural Activities

Panel discussion: from concept to commercialization
Again, we started the day by having a panel discussion. Today, the topic is about how to bring an academic research to commercial market. We were lucky to hear the first-hand experience from great panelists who have taken their research and inventions and built a commercial business (startups). They are Eric J. Beckman from CoHera Medical, Kathryn Farraro from BioStratica, and Noah Snyder from Interphrase Materials. We heard a mindblowing story about how an invention like a glue can be used by a surgeon to join human tissues and repair knee ligaments. The panelists were asked what is the biggest challenge that they faced so far, and does leave academia and being entrepreneurs still leaving them intellectually challenged? They answered that the biggest challenge that they have is being the entrepreneurs itself. They can still challenge their intellectuality by doing hands-on research on the things that they want without being restricted by grants. 
I summarized the discussion into these points:
  1. Think of another application for your technology that appeals to a wider market with easier accessibility
  2. Startup team is like family. You do not really need a rockstar. Chemistry and good people are what you need. If you have these, you can do everything.
  3. Avoid an excessive use of email in communication between startup team members. Talk in person is always the best.
  4. Fight against your imposter syndrome.
  5. Persistence is always the key.

Before the lunch break, we had a brief session with Kristen Van Vleck, a staff member from Institute of International Education (IIE). IIE is a non-profit organization who is responsible for organizing and administering the Fulbright scholarship. Ms. Vleck reminded us of our responsibilities and benefits as a Fulbright scholar, our visa status, and the procedures that we have to undertake after finishing our degree in the U.S. She also encouraged us to get involved in alumni networks and participate in volunteering activity to gain more experience related to cultural exchange.
Briefing from an IIE staff  
At 1.30 pm, we departed for cultural activities to Carnegie Museum of Art and Natural History. The museum tour was divided into two sessions. The first session was exploring the Museum of Art. In this part of the museum, we saw the art collections from various themes, from a beautiful landscape painting to some elusive structures, whose meaning are very hard to comprehend. The second session was exploring the Museum of Natural History, which offers several exhibitions to visit such as "Dinosaur in Their Time" and "Minerals and Gems."
Museum visits
Inside of Carnegie Museum of Art and Natural History

Day 4: Academic Entrepreneurship Overview (Science vs. Business), Ideation and Innovation Projects.

Startup Lifecycle
The morning session was started by hearing a talk from Babs Carryer, Director of Education and Outreach for the University of Pittsburgh's Innovation Institute. She gave a wonderful speech about how to bridge the gap between science and business, such as how to conduct SWOT (strength, weakness, opportunity, threat) analysis, how to analyze the customers, and what they need from us. She explained about startup lifecycle which covers the steps that we undertake to develop and idea into a startup that has financial worth. She said, more startups fail from lack of customers than from a failure of product development. It happens because scientists usually think in a reverse way: develop a solution first, then find a problem that could be solved by the solution.

Problem after solution
There is one funny experiment that we did during her talks. Since the seminar participants consist of STEM students and business/economy students, she asked the STEM people to write what they think about business people, and vice versa. It turns out that actually, both STEM and business people share some similar characteristics.
Business people vs Scientist
After the coffee break, we were divided into 10 groups and given a task to build a startup project. So, we have to pretend that we are researchers who want to propose something as a solution to global problems. In my group, we had a fierce discussion to identify what problem we want to address, how we will address it, and who our target customers are. We spent 3 hours to generate the idea and made the presentation slides of our project. The presentation went pretty well and I was amazed by the novel ideas presented by all groups, such as converting garbage into energy, turning pineapple leaves into fine clothing material, conducting a self-test to detect Zika virus, and creating and edible plastic bottle to solve sea and land pollution's problem.
Project Ideation and Innovation
We ended the day by having a closing dinner and taking some individual and group pictures at Grand Concourse Restaurant, Pittsburgh.
      I'm so thankful and grateful to get this wonderful opportunity. I learned so much about technology and innovation and got a new insight about science-business combination: "Research is not an invention. An invention is not a product. A product is not a business" - Dr. Bud PetersonIt's an unforgettable experience. I met 131 other Fulbrighters from 64 countries, which is a very valuable networking resources that will benefit my future career. 
      As a closing remark, I just want to quote a famous saying that we always hear in every Fulbright seminar or conference: "Once a Fulbrighter, always a Fulbrighter."

- Erika -

Thursday, June 30, 2016

2016-06-30: JCDL 2016 Doctoral Consortium Trip Report

Traditionally, Joint Conference on Digital Libraries (JCDL) has hosted a workshop session called Doctoral Consortium (DC) specific to PhD students of digital library research field and this year (JCDL 2016) was no exception. The workshop was intended for students that are in the early stages of their dissertation work. Several WS-DL group members attended and reported past DC workshops and this time it was my turn.

This year's doctoral consortium (June 19, 2016) was chaired by George BuchananJ. Stephen Downie, and Uma Murthy. Committee members included Sally Jo CunninghamMichael NelsonMartin Klein, and Edie Rasmussen. A total of six PhD students participated in the workshop and presented their work. The doctoral consortium session is generally not open for public participation, however, Michele C. Weigle (JCDL 2016 program chair), Mat Kelly (last year's DC participant), and Alexandar Nwala (potential future participant) also attended the session. Each presenter was given about 20 minutes for talk and 10 minutes for questions and comments.

I, Sawood Alam form Old Dominion University was the first presenter of the session with my research topic, "Web Archive Profiling for Efficient Memento Aggregation". I elaborated on the problem space of my research work by giving an example of collection building, indexing, updating indexes, and profiling or summarization of the collection for better collection understanding and efficient lookup routing. With the help of real life examples and events, I established the importance of small web archives and the need of efficient means of their aggregation. I further explained the methodology and various approaches of web archive profiling depending on available resources and desired detail. I briefly described the evaluation plan and preliminary results published/accepted in TPDL15, IJDL16, and TPDL16. Finally, I presented my tentative timeline of work and publication plans. My work is supported in part by the IIPC.

Adeleke Olateju Abayomi from the University of KwaZulu-Natal, South Africa presented her work entitled, "An Investigation of the Extent of Automation of Public Libraries in South-West Nigeria". It was a survey based research for which she conducted interviews and questionnaires with randomly selected librarians for the study. Attendees of the workshop asked questions about the scope of the automation she was studying, was it limited to the background library management process or the public facing services as well and whether the library patrons were also interviewed? Committee members suggested her to also include case studies from near by countries that have invested in library automation to strengthen her arguments.

Bakare Oluwabunmi Dorcas from the University of KwaZulu-Natal, South Africa presented her research work on "The Usage of Social Media Technologies among Academic Librarians in South Western Nigeria". She used a survey based research approach by conducting interviews and questionnaires with academic librarians of six universities. She mentioned how librarians are using social media such as Facebook groups for provision of library and information service delivery to library clienteles. The outcome from the study is expected to improve practice, inform policy and extend theory in the field of Social Media Technologies (SMT) use in academic libraries based on a developing country context. She was suggested to examine whether the material posted on social media is periodically archived elsewhere.

Prashant Chandrasekar from Virginia Tech presented his work on "A DL System to Facilitate Behavioral Studies of Social Networks". He is working on designing a framework that would enable researchers to conduct hypothesis testing on information related to the study of human behavior in a clinical trial that involves social networks. The system is being built with the immediate aim to serve the needs of a teams of researchers that are part of the Social Interactome project. However, the design of the framework and the scenarios of use will be generalized to all psychologists/sociologists.

Lei Li from the University of Pittsburgh and China Scholarship Council presented "A Judgement Model for Academic Content Quality on Social Media". Initially, there was some lack of clarity on what she meant by "academic content" on social media, which she clarified that her study is around ResearchGate posts and comments. The general comment from the committee on her quality assessment approach can be summarized as "it would be great to establish something of this sort, but she should limit the scope for her PhD work." Edie Rasmussen nicely put the challenge of creating a data set for quality measure as, "Life's too short to generate your own data set." which was in line with Yasmin AlNoamany saying, "I will never do it again!" while describing a manually labeled data set during her recent PhD defense. Dr. Michael Nelson suggested Lei Li to pick a good example and walk through it to elaborate on the process.

The final presentation of the day was from Jessica Ogden from the University of Southampton. Jessica presented her ongoing PhD research entitled "Interrogating the Politics and Performativity of Web Archives" which is centered on web archival practice, specifically looking at selection and collection practices across different web archiving communities in the field. Jessica prefaced the presentation with some information regarding her academic background to provide context for how the interdisciplinary project is being approached. Some philosophical questions were raised regarding the nature of web archives (and assumptions about the Web itself), as well as importance placed on the documenting the assumptions made during selection and collection of web archives (which are often left undocumented). For more details on Jessica’s research and the presentation at JCDL 2016 DC see her blog post.

Once all six participants presented their work, committee members highlighted some general comments such as every presenter did a good job of wrapping their talk in the allotted time and leaving enough time for questions and comments. They also noted that slides should not have too many words in them so that the presenter ends up reading them verbatim, on the other hand they should not be on the other extreme either where every slide has nothing but pictures. After these comments, the session was open for everyone to provide any feedback or ask general questions from presenters or the committee members. I noted that this year's doctoral consortium was dominated by "social media" based study.

On our way back to the hotel Alexander said, "the committee members had such a deep understanding of the subject and provided very useful comments." Mat and I replied, "yes they did indeed." It is strongly recommended that if possible, every PhD candidate should participate in a Doctoral Consortium workshop of the respective field at least once to gain some insights and perspective from the people outside their thesis committee members.

You may also like to read the main JCDL 2016 conference coverage by Alexander and the WADL 2016 workshop coverage by Mat.

Update (July 2, 2016): Added Bakare's slides and updated the description of her talk.

Sawood Alam