2015-07-02: JCDL2015 Main Conference

Large, Dynamic and Ubiquitous –The Era of the Digital Library

JCDL 2015 (#JCDL2015) took place at the University of Tennessee Conference Center in Knoxville, Tennessee. The conference was four days long; June 21-25, 2015. This year three students from our WS-DL CS group at ODU had their papers accepted as well as one poster (see trip reports for 2014, 2013, 2012, 2011). Dr. Weigle (@weiglemc), Dr. Nelson (@phonedude_mln), Sawood Alam (@ibnesayeed), Mat Kelly (@machawk1) and I (@LulwahMA) went to the conference. We drove from Norfolk, VA. Four of our previous members of our group, Martin Klein (UCLA, CA) (@mart1nkle1n), Joan Smith (Linear B Systems, inc., VA) (@joansm1th), Ahmed Alsum (Stanford University, CA) (@aalsum) and Hany SalahEldeen (Microsoft, Seattle)(@hanysalaheldeen), also met us there. The trip was around 8 hours. We enjoyed the mountain views and the beautiful farms. We also caught parts of a storm on our way, but it only lasted for two hours or so.

The first day of the conference (Sunday June 21, 2015) consisted of four tutorials and the Doctoral Consortium. The four tutorials were: Topic Exploration with the HTRC Data Capsule for Non-Consumptive Research, Introduction to Digital Libraries, Digital Data Curation Essentials for Data Scientists, Data Curators and Librarians and Automatic Methods for Disambiguating Author Names in Bibliographic Data Repositories.

Mat Kelly (ODU, VA)(@machawk1) covered the Doctoral Consortium.

The main conference started on Monday June 22, 2015 and opened with Paul Logasa Bogen II (Google, USA) (@plbogen). He started by welcoming the attendees and then he mentioned that this year had 130 registered attendees from 87 different organizations, and 22 states and 19 different countries.

Then the program chairs: Geneva Henry (University Libraries, GWU, DC), Dion Goh (Wee Kim Wee School of Communication and Information, Nanyang Technical University, Singapore) and Sally Jo Cunningham (Waikato University, New Zealand) added on the announcements and number of accepted papers in JCDL2015. Of the conference submissions, 18 (30%) of full research papers are accepted, and 30 (50%) of short research papers are accepted, and 18 (85.7%) of posters and demos are accepted. Finally, the speaker announced the nominees for best student paper and best overall paper.

The best paper nominees were:
The best student paper nominees were:
Then Unmil Karadkar (The University of Texas at Austin, TX) presented the keynote speaker Piotr Adamczyk (Google Inc, London, UK). Piotr's talk was titled “The Google Cultural Institute: Tools for Libraries, Archive, and Museums”. He presented some of Google attempts to add to the online cultural heritage. He introduced Google Culture Institute website that consisted of three main projects: the Art Project, Archive (Historic Moments) and World Wonders. He showed us the Google Art Project (Link from YouTube: Google Art Project) and then introduced an application to search museums and navigate and look at art. Next, he introduced the Google Cardboard (Link from YouTube:”Google Cardboard Tour Guide”) (Link from YouTube: “Expeditions: Take your students to places a school bus can’t”) where you can explore different museums by looking into a cardboard container that can house a user's electronic device. He mentioned that more museums are allowing Google to capture images of their museums and allowing others to explore it using Google Cardboard and that Google would like to further engage with cultural partners. His talk was similar to a talk he gave in 2014 titled "Google Digitalizing Culture?".

Then we started off with the two simultaneous sessions "People and Their Books" and "Information Extraction". I attended the second session. The first paper was “Online Person Name Disambiguation with Constraints” presented by Madian Khabsa (PSU, PA). The goal of his work is to map the name mentioned to the real world people. They found that 11%-17% of the queries in search engines are personal names. He mentioned that two issues are not addressed: adding constraints to the clustering process and adding the data incrementally without clustering the entire database. The challenge they faced was redundant names. When constraints are added they can be useful in digital library where user can make corrections. Madian described constraint-based clustering algorithm for person name disambiguation.

Sawood Alam (ODU, VA) (@ibnesayeed) followed Madian with his paper “Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages”. He mentioned that general online book readers are not suitable for scanned dictionary. They proposed an approach of indexing scanned pages of a dictionary that enables direct access to appropriate pages on lookup. He implemented an application called Dictionary Explorer where he indexed monolingual and multilingual dictionaries, with speed of over 20 pages per minute per person.

Next, Sarah Weissman (UMD, Maryland) presented “Identifying Duplicate and Contradictory Information in Wikipedia”. Sara identified sentences in wiki articles that are identical. She randomly selected 2k articles and manually identified them. She found that 45% are identical, 30% are templates, 13.15% are copy editing, 5.8% are factual drift, 0.3% are references and 4.9% are other pages.

The last presenter in this session is Min-Yen Kan (National University of Singapore, Singapore) (@knmnyn) presenting “Scholarly Document Information Extraction using Extensible Features for Efficient Higher Order Semi-CRFs”. He introduced the notion of extensible features for higher order semi-CRFs that allow memorization to speed up inference algorithms.

The papers in the other concurrent session that I was unable to attend were:

After the Research at Google-sponsored Banquet Lunch, Sally Jo Cunningham (University of Waikato, NZ) introduced the first panel "Lifelong Digital Libraries" and then the first speaker Cathal Gurrin (Dublin City University, Ireland)(@cathal). His presentation was titled "Rich Lifelong Libraries". He started off with using wearable devices and information loggers to automatically record your life in details. He gave examples of devices such as Google Glass or Apple’s iWatch that are currently in the market that record every moment. He has gathered a digital memory of himself since 2006 by using a wearable camera. The talk he gave was similar to a talk he gave at 2012 titled "The Era of Personal Life Archives".

The second speaker was Taro Tezuka (University of Tsukuba, Japan). His presentation was titled "Indexing and Searching of Conversation Lifelogs". He focused on search capability and that it is as important as storage capability in lifelong applications. He mentioned that cleaver indexing of recorded content is necessary for implementing a useful lifelong search systems. He also showed the LifeRecycle which is a system for recording and retrieving conversation lifelogs, by first recording the conversation, then providing speech recognition, after that store the result and finally search and show the result. He mentioned that the challenges that faces a person to allow being recorded is security issues and privacy.

The last speaker of the panel was Håvard Johansen (University of Tromso, Norway). First they started with definitions of lifelogs. He also discussed the use of personal data for sport analytic, by understanding how to construct privacy preserving lifelogging. After the third speaker the audience asked/discussed some privacy issues that may concern lifelogging.

The third and fourth session were simultaneous as well. The third session was "Big Data, Big Resources". The first presenter was Zhiwu Xie (Virginia Tech, VA) (@zxie) with his paper “Towards Use And Reuse Driven Big Data Management”. This work focused on integrating digital libraries and big data analytics in the cloud. Then they described its system model and evaluation.

Next, “iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling” was presented by Gerhard Gossen (L3S Research Center, Germany) (@GerhardGossen). iCrawl combines crawling of the Web and Social Media for a specified topic. The crawler works by collecting web and social content in a single system and exploits the stream of new social media content for guideness. The target users for this web crawling toolbox is web-science, qualitative humanities researchers. The approach was to start with a topic and follow its outgoing links that are relevant.

G. Craig Murray (Verisign Labs) presented instead of Jimmy Lin (University of Maryland, College Park) (@lintool). The paper was titled “The Sum of All Human Knowledge in Your Pocket: Full-Text Searchable Wikipedia on a Raspberry Pi”. Craig discussed how it is useful to have Wikipedia that you can access without Internet by connecting to Raspberry Pi device via bluetooth or wifi. He passed along the Raspberry Pi device to the audience, and allowed them to connect to it wirelessly. The device is considered better than Google since it offers offline search and full text access. It also offers full control over search algorithms and is considered a private search. The advantage of the data being on a separate device instead of on the phone is that it is cheaper per unit storage and offers full Linux stack and hardware customizability.

The last presenter in this session was Tarek Kanan (Virginia Tech, VA), presenting “Big Data Text Summarization for Events: a Problem Based Learning Course”. Problem/project Based Learning (PBL) is a student-centered teaching method, where student teams learn by solving problems. In this work 7 teams of student each with 30 students apply big data methods to produce corpus summaries. They found that PBL helped students in a computational linguistics class automatically build good text summaries for big data collections. The student also learned many of the key concepts of NLP.

The fourth session I missed was "Working the Crowd", Mat Kelly (ODU, VA) (@machawk1) recorded the session.

After that, Conference Banquet was served at the Foundry on the Fair Site.

On Tuesday June 23, 2015 after breakfast the Keynote speaker Katherine Skinner (Educopia Institute, GA). Her talk was titled “Moving the needle: from innovation to impact”. She discussed how to engage others to make use of digital libraries and archiving, getting out there and being an important factor to the community as we should be. She asked what digital libraries could accomplish as a field if we shifted our focus from "innovation" to "impact".

After that, there were two other simultaneous sessions "Non-Text Collection" and "Ontologies and Semantics". I attended the first session where there was one long paper presented and four short papers. The first speaker in this session was Yuehan Wang (Peking University, China). His paper was “WikiMirs 3.0: A Hybrid MIR System Based on the Context, Structure and Importance of Formulae in a Document”. The speaker discussed the challenges of extracting mathematical formula from the different representations. They propose an upgraded Mathematical Information Retrieval system named WikiMirs3.0. The system can extract mathematical formulas from PDF and can type in queries. The system is publicly available at: www.icst.pku.edu.cn/cpdp/wikimirs3/.

Next, Kahyun Choi (University of Illinois at Urbana-Champaign, IL) presented “Topic Modeling Users’ Interpretations of Songs to Inform Subject Access in Music Digital Libraries”. Her paper focused on addressing if topic modeling can discover subject from interpretations, and the way to improve the quality of topics automatically. Their data set was extracted from songmeanings.com which contained almost went four thousand songs with at least five interpretation per song. Topic models are generated using Latent Dirichlet Allocation (LDA) and the normalization of the top ten words in each topic was calculated. For evaluating a sample as manually assigned to six subjects and found 71% accuracy.

Frank Shipman (Texas A&M University, TX) presented “Towards a Distributed Digital Library for Sign Language Content”. In this work they try to locate content relating to sign language over the Internet. They propose a description of a software components of a distributed digital library of sign language content, called SLaDL. This software detects sign language content in a video.

The final speaker of this session was Martin Klein (UCLA, CA) (@mart1nkle1n), presenting “Analyzing News Events in Non-Traditional Digital Library Collections”. In his work they found indicators relevant for building non-traditional collection. From the two collection, an online archive of TV news broadcasts and an archive of social media captures, they found that there is an 8 hour delay between social media and TV coverages that continues at a high frequency level for a few days after a major event. In addition, they found that news items have potential to influence other collections.

The session I missed was "Ontologies and Semantics", the papers presented were:

After lunch, there were two other simultaneous sessions "User Issues" and "Temporality". I attended "Temporality" session where there were two long papers. The first paper was presented by Thomas Bogel (Heidelberg University, Germany) titled “Time Well Tell: Temporal Linking of News Stories”. Thomas presented a framework to link news articles based on temporal expressions that occur in the articles. In this work they recover the arrangement of events covered in an article, in the big picture a network of article will be timely ordered.

The second paper was “Predicting Temporal Intention in resource Sharing” presented by Hany SalahEldeen (ODU, VA) (@hanysalaheldeen). Links on web pages on Twitter could change over time and might not match users intention. In this work they enhance prior temporal intention model by adding linguistic feature analysis, semantic similarity and balancing the training dataset. In this current module they had a 77% accuracy on predicting the intention of the user.

The session I missed "User Issues" had four papers:

Next, there was a panel on “Organizational Strategies for Cultural Heritage Preservation”. Paul Logasa Bogen, II (Google, WA) (@plbogen) introduced four speakers in this panel. There were Katherine Skinner (Educopia Institute, Atlanta), Stacy Kowalczyk (Dominican University, IL)(@skowalcz), Piotr Adamczyk (Google Cultural Institute, Mountain View) and Unmil Karadkar (The University of Texas at Austin, Austin) (@unmil). In this panel they discussed the preservation goal, the challenges that faces organizations practice preservation centralized or decentralized preservation and how to balance these approaches. In the final minutes there were questions from the audience regarding privacy and ownership in Cultural Heritage collections.

Following that was Minute Madness, which was a session where each poster presenter has two chances (60 seconds then 30 seconds) to talk about their poster in attempt to lure attendees to come by during the poster session.

The final session of the day was the "Reception and Posters". Where posters/demos are viewed and everyone in the audience got three stickers that were used to vote for best poster/demo.

On Wednesday June 24, 2015, there was one session "Archiving, Repositories, and Content" and three different workshops: "4th International Workshop on Mining Scientific Publications (WOSP 2015)", "Digital Libraries for Musicology (DLfM)" and "Web Archiving and Digital Libraries (WADL 2015)".

The session of the day "Archiving, Repositories, and Content" had four papers. The first paper in the last session of the conference was Stacy Kowalczyk (Dominican University, IL)(@skowalcz) presenting “Before the Repository: Defining the Preservation Threats to Research Data in the Lab”. She mentioned that lost media is a big threat and this threat is required to be addressed by preservation. She conducted a survey to quantify the risk to the preservation of research data. By getting a sample of 724 National Science Foundation awardees completing the survey, she found that the human error was the greatest threat to preservation followed by equipment malfunction.

Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author) presented “How Well Are Arabic Websites Archived?”. In this work we focused on determining if Arabic websites are archived and indexed. We collected a simple of Arabic websites and discovered that 46% of the websites are not archived and that 31% are not indexed. We also analyzed the dataset to find that almost 15% had an Arabic country code top level domain and almost 11% had an Arabic geographical location. We recommend that if you want an Arabic webpage to be archived then you should list in DMOZ and host it outside an Arabic country.

Next, Ke Zhou (University of Edinburgh, UK) presented his paper “No More 404s: Predicting Referenced Link Rot in Scholarly Articles for Pro-Active Archiving” (from the Hiberlink project). This paper addresses the issue of having a 404 on a reference in a scholar article. They found that there are two types content drift and link rot, and that there are around 30% of rotten web references. This work suggests that authors to archive links that are more likely to be rotten.

Then Jacob Jett (University of Illinois at Urbana-Champaign, IL) presented his paper “The Problem of “Additional Content” in Video Games". In this work they first discuss the challenges that video games nowadays faces due to its additional content such as modification and downloadable contents. They try to address the challenges by proposing a solution by capturing additional contents.

After the final paper of the main conference lunch was served along with the announcement of best poster/demo by counting the number of the audience votes. This year there were two best poster/demo awards and they were to Ahmed Alsum (Stanford University, CA) (@aalsum) for “Reconstruction of the US First Website”, and to Mat Kelly (ODU, VA) (@machawk1) for “Mobile Mink: Merging Mobile and Desktop Archived Webs”, by Wesley Jordan, Mat Kelly, Justin F. Brunelle, Laura Vobrak, Michele C. Weigle, and Michael L. Nelson (learn more about Mobile Mink). 

Next, was the announcement for the awards for best student paper and best overall paper. The best student paper was awarded to Lulwah Alkwai (ODU, VA) (@LulwahMA) (your author), Michael L. Nelson, and Michele C. Weigle for our paper “How Well Are Arabic Websites Archived?”, and the Vannevar Bush best paper was awarded to Pertti Vakkari and Janna Pöntinen for their paper “Result List Actions in Fiction Search”.

After that there was the "Closing Plenary and Keynote", where J. Stephen Downie talked about “The HathiTrust Research Center Providing Analytic Access to the HathiTrust Digital Library’s 4.7 Billion Pages”. HathiTrust is trying to preserve the cultural records. It currently digitalized 13,496,147 volumes, 6,778,492 books and many more. There are any current projects that HathiTrust are working on, such as HathiTrust BookWorm which you can search for a specific term, the number of occurrence and its position. This presentation was similar to a presentation titled "The HathiTrust Research Center: Big Data Analytics in Secure Data Framework" presented in 2014 by Robert McDonald.

Finally, JCDL 2016 was announced to be located in Newark, NJ, June 19-23.

After that, I attended the "Web Archiving and Digital Libraries" workshop, where Sawood Alam (ODU, VA)(@ibnesayeed) will cover the details in a blog post.

by Lulwah Alkwai,

Special thanks to Mat Kelly for taking the videos and helping to edit this post.