Monday, September 28, 2015

2015-09-28: TPDL 2015 in Poznan, Poland

The Old Market Square in Poznan
On September 15 2015, Sawood Alam and I (Yasmin AlNoamany) attended the 2015 Theory and Practice of Digital Libraries (TPDL) Conference in Poznan, Poland. This year, WS-DL had four accepted papers in TPDL for three students (Mohamed Aturban (who could not attend the conference because of visa issues), Sawood Alam, and Yasmin AlNoamany). Sawood and I arrived in Poznan on Monday, Sept. 14. Although we were tired from travel, we could not resist walking to the the best area in Poznan, the old market square. It was fascinating to see those beautiful colorful houses at night with the reflection of the water on them after it rained with the beautiful European music by many artists who were playing in the street.

The next morning we headed to the conference, which was held in Poznań Supercomputing and Networking Center. The organization of the conference was amazing and the general conference co-chairs, Marcin Werla and Cezary Mazurek, were always there to answer our questions. Furthermore, the people at the reception of the conference were there for us the whole time to help us with transportation, especially with the communication with taxi drivers; we do not speak Polish and they do not speak English. On every day of the conference, there were coffee break where we had hot and cold drinks and snacks. It is worth mentioning that I had the best coffee I have ever tasted in Poland :-). The main part of the TPDL 2015 conference was streamed live and recorded. The recordings will be processed and made publicly available on-line on PlatonTV portal.

Sawood (on the left) and Jose (on the right)
We met Jose Antonio Olvera, who interned in WS-DL lab in summer 2014, at the entrance. At the conference, Jose had an accepted poster “Evaluating Auction Mechanisms for the Preservation of Cost-Aware Digital Objects Under Constrained Digital Preservation Budgets” that was presented at the evening of the first day in the poster session. It was nice meeting him, since I was not there when he interned in our lab.
The first day of the main conference, September 15th, started with a Keynote speech by David Giaretta, whom I was honored to speak to many times during the conference and had him among the audience of my presentations, talked about "Data – unbound by time or discipline – challenges and new skills needed". At the beginning, Giaretta introduced himself with a summary about his background. His speech mainly was about data preservation and the challenges that this field faces, such as link rots, which Giaretta considered a big horror. He mentioned many examples about the possibility of data loss. Giaretta talked about big data world and presented the 7 (or 8 (or 9)) V’s of big data: volume, velocity, variety, volatility, veracity, validity, value, variability, and visualization. I loved these quotes from his speech:
  • "Preservation is judged by continuing usability, then come value". 
  • "Libraries are gateways to knowledge". 
  • "Metadata is classification".
  • "emulate or migrate".
He talked about how it is valuable and expensive to preserve the scientific data, then raised an issue about reputation for keeping things over time and long term funding. Funding is a big challenge in digital preservation, so he talked about vision and opportunities for funding. Giaretta concluded his keynote with the types of digital objects that needs to be preserved, such as simple documents and images, scientific data, complex objects, and the changing over time (such as the annotations). He raised this question: "what questions can one ask when confronted with some completely unfamiliar digital objects?" Giaretta ended his speech with an advice: "Step back and help the scientists to prepare data management plans, the current data management plan is very weak".

After the keynote we went to a coffee break, then the first session of the conference "Social-technical perspectives of digital information" started. The session was led by WS-DL’s Sawood Alam presenting his work "Archive Profiling Through CDX Summarization", which is a product of an IIPC funded project. He started with a brief introduction about the memento aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. He described two earlier profiling efforts: the complete knowledge profile by Sanderson and minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He also talked about the newly introduced CDXJ serialization format for profiles and illustrated its usefulness in serializing profiles on scale with the ability of merging and splitting arbitrary profiles easily. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. The code to generate profiles and benchmark can be found in a GitHub repository.

Next, there was a switch between the second and the third presentations and since Sawood was supposed to present on the behalf of Mohamed Aturban, the chair of the session gave Sawood enough time to breathe between the two presentations.

The second presentation was "Query Expansion for Survey Question Retrieval in the Social Sciences" by Nadine Dulisch from GESIS and Andreas Oskar Kempf from ZBW. Andreas started with a case study for the usage of survey questions, which were developed by operational organizations, in social science. He presented the importance of social science survey data for social scientists.  Then, Nadine talked about the approaches they applied for query expansion retrieval. She showed that statistical-based expansion was better than intellectual-based expansion. They presented the results of their experiments based on Trec_eval. They evaluated thesaurus-based and co-occurrence-based expansion approaches for query expansion to improve retrieval quality in digital libraries and research data archives. They found that automatically expanded queries using extracted co-occurring terms could provide better results than queries manually reformulated by a domain expert.

Sawood presented "Quantifying Orphaned Annotations in". In this paper, Aturban et al. analyzed 6281 highlighted text annotations in annotation system. They also used the Memento Aggregator to look for archived versions of the annotated pages. They found that 60% of the highlighted text annotations are orphans (i.e. annotations are attached to neither the live web nor memento(s)) or in danger of being orphaned (i.e. annotations are attached to the live web but not to memento(s)). They found that if a memento exists, there is a 90% chance that it recovers the annotated webpage. Using public archives, only 3% of all highlighted text annotations were reattached, otherwise they would be orphaned. They found that for the majority of the annotations, no memento existed in the archives. Their findings highlighted the need for archiving pages at the time of annotation.

After the end of the general session, we took a lunch break where we gathered with Jose Antonio Olvera and many of the conference attendees to exchange our research ideas.

After the lunch break, we attended the second session of the day, "Multimedia information management and retrieval and digital curation". The session started with "Practice-oriented Evaluation of Unsupervised Labeling of Audiovisual Content in an Archive Production Environment” presented by Victor de Boer. In their work, Victor et al. evaluated the automatic labeling of the audiovisual content to improve efficiency and inter-annotator agreement by generating annotation suggestions automatically from textual resources related to the documents to be archived. They performed pilot studies to evaluate term suggestion methods through precision and recall by taking terms assigned by archivists as ‘ground-truth’. The found that the quality of automatic term-suggestion are sufficiently high.

The second presentation was "Measuring Quality in Metadata Repositories" by Dimitris Gavrilis. He started his presentation by mentioning that this is a hard topic, then he explained why this research is important. He explained the specific criteria that determine the data quality: completeness, validity, consistency, timeliness, appropriateness, and accuracy constituents. In their paper, Dimitris et al. introduced a metadata quality evaluation model (MQEM) that provides a set of metadata quality criteria as well as contextual parameters concerning metadata generation and use. The MQEM allows the curators and the metadata designers to assess the quality of their metadata and to run queries on existing datasets. They evaluated their framework on two different use cases: application design and content aggregation.

After the session, we took a break and I got illness which prevented me from attending the discussion panel session, which was entitled "Open Access to Research Data: is it a solution or a problem?", and the poster session. I went back to the hotel to rest and prepare for the next day's presentation. I am embedding the tweets about the panel and the poster session.

The next day I felt fine, so we went early to have breakfast in the beautiful old market square, then headed to the conference. The opening of the second day was by Cezary Mazurek who introduced the sessions of the second day and thanked the sponsors of the conference. Then he left us with a beautiful soundtrack of music, which was related to the second keynote speaker.

The Keynote speech was "Digital Audio Asset Archival and Retrieval: A Users Perspective" by Joseph Cancellaro, active composer, musician, and the chair of the Interactive Art and Media Department of Columbia College in Chicago. Cancellaro started by a short bio about himself. The first part of his presentation handled issues of audio asset and the constant problematic for sound designers and non-linear environments (naming convention (meta tag), search tools, storage (failure), retrieval (failure), DSP (Digital signal processing), etc. He also mentioned how do they handle these issues in his department, for example for naming conventions, they add tags to the files. He explained the simple sound asset SR workflow. Preservation to Cancellaro is “not losing any more audio data". His second part of the presentation was about storage, retrieval, possible solutions, and content creation. He mentioned some facts about storage and retrieval:
  • The decrease in technology costs have reduced the local issues of storage capacity (this is always a concern in academia). 
  • Bandwidth is still an issue in real-time production. 
  • Non-linear sound production is a challenge for linear minded composers and sound designers.
He mentioned that searching for sound objects is a blocking point for many productions, then continued "when I ask my students about the search options for the sound track they have, all what I hear are crickets". At the end,  Dr. Cancellaro presented agile concept as a solution for content management systems (CMS). He presented the basic digital audio theory: sound as a continuous analog event is captured at specific data point.

After the keynote, we took a coffee break, then the sessions of the second day started with "Influence and Interrelationships among Chinese Library and Information Science Journals in Taiwan" by Ya-Ning Chen. In this research, the authors investigate the citation relation between the different journals based on a data set collected from 11 Chinese LIS journals (2,031 articles during from 2001 to 2012) in Taiwan. The authors measured the indexer and the indegree, outdegree, and the self-feeding ratios between the journals. They also measured the degree and betweenness centrality of SNA to investigate the information flow between Chinese LIS journals in Taiwan. They created a 11 × 11 matrix that express the journal-to-journal analysis. They created a sciogram of Interrelationships among Chinese LIS Journals in Taiwan which summarized the citation relation between the journals they studied.

Next was a presentation entitled "Tyranny of Distance: Understanding Academic Library Browsing by Refining the Neighbour Effect" by Dana Mckay and George Buchanan.
Dana and George explained the importance of browsing books as a part of informations seeking, and how this is not well-supported for e-books. They used different datasets to examine the patterns of co-borrowing. They examined different aspects of the neighbour effect on browsing behavior. Finally they presented their findings to improve the browsing of digital libraries.

The last presentation of this session was a study on Storify entitled "Characteristics of Social Media Stories" by Yasmin AlNoamany. Based upon analyzing 14,568 stories from Storify, AlNoamany et al. specified the structural characteristics of popular (i.e., receiving the most views) human-generated stories to build a template that will be used later in generating (semi-)automatic stories from the archived collections. The study investigated many question regarding to the features of the stories, such as the length of the story, the number of elements, the decay rate of the stories, etc. At the end, the study differentiated the popular stories and the unpopular stories based on the main feature of both of them. Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features. Popular stories tend to have more web elements (medians of 28 vs. 21), longer timespan (5 hours vs. 2 hours), longer editing time intervals, and less decay rate.


After the presentation, we had lunch, when some attendees extended the talk about my research. It was a useful discussion regarding to the future of my research, especially integrating data from the archived collections with the storytelling services.

The "user studies for and evaluation of digital library systems and applications" session started after the break with "On the Impact of Academic Factors on Scholar Popularity: A Cross-Area Study” presentation by Marcos Gonçalves. Gonçalves et al. presented a cross-area study on the impact of key academic factors on scholar popularity for understanding how different factors impact scholar popularity. They conducted their study based on scholars affiliated to different graduate programs in Brazil and internationally, with more than 1,000 scholars and 880,000 citations, over a 16-year period. They found that scholars in technological programs (e.g., Computer Science, Electrical Engineering, Bioinformatics) tend to be the most "popular" ones in their universities. They also found that international popularity in still much higher than that obtained by Brazilian scholars.

After the first presentation, there was a panel on "User studies and Evaluation" by George Buchanan, Dana McKay, and Giannis Tsakonas, and moderated by Seamus Ross as a replacement of two presentations due to the absence of the presenters. The panel started with a question from Seamus Ross: Are user studies in digital libraries soft? Each one of the panelists presented his point of view on the importance of user studies. Buchanan said that user studies matter, then Dana followed up that we want to create something that all the people can use. Tsakonas said he did studies that never developed into systems. Seamus Ross asked the panelists: what makes a person a good user study person? Dana answered with a joke; "choose someone like me". Dana works as User Experience Manager and Architect at the academic library of Swinburne University of Technology, so she has experience with users needs and user studies. I followed up the discussion that we do users studies to know what the people need or to evaluate a system, then I asked if Mechanical Turk (MTurk) experiments is a form of user studies. At the end, Seamus Ross concluded the panel with some advice on conducting user studies, such as considering a feedback loop in the process of user study.

After the panel, we had a coffee break. I had a great discussion about user evaluation in the context of my research with Brigitte Mathiak, who gave me much useful advice about evaluating the stories that will be created automatically from the web archives. Later on my European trip I gave a presentation at Magdeburg-Stendal University of Applied Sciences that gives big picture of my research.

In the last session, I attended Brigitte Mathiak presented "Are there any Differences in Data Set Retrieval compared to well-known Literature Retrieval?". In the beginning, Mathiak explained the motive of their work. Based on two user studies, a lab study with seven participants and telephone interviews with 46 participants, they investigated the requirements that users have for a data set retrieval system in the social sciences and in Digital Libraries. They found that choosing the data set is more important to researcher than choosing a piece of literature. Moreover, meta data quality and quantity is even more important for data sets.

At the evening, We had the conference dinner which was held at Concordia Design along with beautiful music. At the dinner, the conference chairs announced two awards: the best paper award for Matthias Geel and Moira Norrie on "Memsy: Keeping Track of Personal Digital Resources across Devices and Services" and the best poster/demo award for Clare Llewellyn, Claire Grover, Beatrice Alex, Jon Oberlander and Richard Tobin on "Extracting a Topic Specific Dataset from a Twitter Archive".

The third day started early at 9:00 am with sessions about digital humanities, in which I presented my study about “Detecting Off-Topic Pages in Web Archives”. The paper investigate different methods for automatically detecting when an archived page goes off-topic. It presented six different methods that mainly depend on comparing the archived copy of a page (a memento) with the first memento of this page. Testing the methods was done on different Archived collections from Archive-It. The suggested best method was a combination between a textual method (cosine similarity using TF-IDF) and a structural method (word count). The best combined methods for detecting the off-topic pages gave an average precision 0.92 on 11 different collections. The output of this research is a tool for detecting the off-topic pages in the archive. The code can be downloaded and tested from Github, and more information can be found from my recent presentation at the Internet Archive.


The next paper presented in the digital humanities session was "Supporting Exploration of Historical Perspectives across Collections" by Daan Odijk. In their work, Odjjk et al. introduced tools for selecting, linking, and visualizing the second World War (WWII) collection from collections of the NIOD, the National Library of the Netherlands, and Wikipedia. They also link digital collections via implicit events, i.e. if two articles are close in time and similar in content, they are considered to be related. Furthermore, they provided exploratory Interface to explore the connected collections. They used Manhattan distance for textual similarity over document terms in a TF.IDF weighted vector space and measured temporal similarity using a Gaussian decay function. They found that textual similarity performed better than temporal similarity, and combining textual and temporal similarity improved the nDCG score.

The third paper entitled "Impact Analysis of OCR Quality Tasks in Digital Archives" presented by Myriam C. Traub. Traub et al. performed user studies on digital archived to classify the research tasks and describe the potential impact of OCR quality on these tasks through interviewing scholars from digital humanities. They analyzed the questions and categorized the research tasks. Myriam said that few scholars could quantify the impact of OCR errors on their own research tasks. They found that OCR is unlikely to be perfect. They could not find solutions but they could suggest strategies that lead to the solutions. At the end, Myriam suggested that the tools should be open-source and there should be evaluation matrices.

At the end, I attended the last Keynote speech by Costis Dallas – "The post-repository era: scholarly practice, information and systems in the digital continuum", which was about on digital humanists' practices in the age of curation. Then the conference ended with the closing sessions, in which they announced the TPDL 2016 in Hannover, Germany.

After the conference, Sawood and I took the train from Poznan to Potsdam, Germany to meet Dr. Michael A. Herzog, the Vice Dean for Research and Technology Transfer, Department of Economics and head of Research Group SPiRIT. We were invited to talk about our research in a Digital Preservation lecture at Magdeburg-Stendal University of Applied Sciences in Magdeburg. Sawood wrote a nice blog post about our talks.


1 comment:

  1. Recording of the TPDL 2015 sessions is now available at

    "Archive Profiling Through CDX Summarization" by Alam et al. starts at 0:00:00 in

    "Quantifying Orphaned Annotations in" by Aturban et al. (presented by Sawood Alam) starts at 1:00:00 in

    "Characteristics of Social Media Stories" by AlNoamany et al. starts at 0:47:00 in

    "Detecting Off-Topic Pages in Web Archives" by AlNoamany et al. starts at 0:00:00