|The Old Market Square in Poznan|
Poznań Supercomputing and Networking Center. The organization of the conference was amazing and the general conference co-chairs, Marcin Werla and Cezary Mazurek, were always there to answer our questions. Furthermore, the people at the reception of the conference were there for us the whole time to help us with transportation, especially with the communication with taxi drivers; we do not speak Polish and they do not speak English. On every day of the conference, there were coffee break where we had hot and cold drinks and snacks. It is worth mentioning that I had the best coffee I have ever tasted in Poland :-). The main part of the TPDL 2015 conference was streamed live and recorded. The recordings will be processed and made publicly available on-line on PlatonTV portal.
|Sawood (on the left) and Jose (on the right)|
David Giaretta, whom I was honored to speak to many times during the conference and had him among the audience of my presentations, talked about "Data – unbound by time or discipline – challenges and new skills needed". At the beginning, Giaretta introduced himself with a summary about his background. His speech mainly was about data preservation and the challenges that this field faces, such as link rots, which Giaretta considered a big horror. He mentioned many examples about the possibility of data loss. Giaretta talked about big data world and presented the 7 (or 8 (or 9)) V’s of big data: volume, velocity, variety, volatility, veracity, validity, value, variability, and visualization. I loved these quotes from his speech:
- "Preservation is judged by continuing usability, then come value".
- "Libraries are gateways to knowledge".
- "Metadata is classification".
- "emulate or migrate".
Archive Profiling Through CDX Summarization", which is a product of an IIPC funded project. He started with a brief introduction about the memento aggregator and the need of profiling the long tail of archives to improve the efficiency of the aggregator. He described two earlier profiling efforts: the complete knowledge profile by Sanderson and minimalistic TLD only profile by AlSum. He described the limitations of the two profiles and explored the middle ground for various other possibilities. He also talked about the newly introduced CDXJ serialization format for profiles and illustrated its usefulness in serializing profiles on scale with the ability of merging and splitting arbitrary profiles easily. He evaluated his findings and concluded that his work so far gained up to 22% routing precision with less than 5% cost relative to the complete knowledge profile without any false negatives. The code to generate profiles and benchmark can be found in a GitHub repository.
Next, there was a switch between the second and the third presentations and since Sawood was supposed to present on the behalf of Mohamed Aturban, the chair of the session gave Sawood enough time to breathe between the two presentations.
The second presentation was "Query Expansion for Survey Question Retrieval in the Social Sciences" by Nadine Dulisch from GESIS and Andreas Oskar Kempf from ZBW. Andreas started with a case study for the usage of survey questions, which were developed by operational organizations, in social science. He presented the importance of social science survey data for social scientists. Then, Nadine talked about the approaches they applied for query expansion retrieval. She showed that statistical-based expansion was better than intellectual-based expansion. They presented the results of their experiments based on Trec_eval. They evaluated thesaurus-based and co-occurrence-based expansion approaches for query expansion to improve retrieval quality in digital libraries and research data archives. They found that automatically expanded queries using extracted co-occurring terms could provide better results than queries manually reformulated by a domain expert.
Sawood presented "Quantifying Orphaned Annotations in Hypothes.is". In this paper, Aturban et al. analyzed 6281 highlighted text annotations in Hypothes.is annotation system. They also used the Memento Aggregator to look for archived versions of the annotated pages. They found that 60% of the highlighted text annotations are orphans (i.e. annotations are attached to neither the live web nor memento(s)) or in danger of being orphaned (i.e. annotations are attached to the live web but not to memento(s)). They found that if a memento exists, there is a 90% chance that it recovers the annotated webpage. Using public archives, only 3% of all highlighted text annotations were reattached, otherwise they would be orphaned. They found that for the majority of the annotations, no memento existed in the archives. Their findings highlighted the need for archiving pages at the time of annotation.
After the end of the general session, we took a lunch break where we gathered with Jose Antonio Olvera and many of the conference attendees to exchange our research ideas.
After the lunch break, we attended the second session of the day, "Multimedia information management and retrieval and digital curation". The session started with "Practice-oriented Evaluation of Unsupervised Labeling of Audiovisual Content in an Archive Production Environment” presented by Victor de Boer. In their work, Victor et al. evaluated the automatic labeling of the audiovisual content to improve efficiency and inter-annotator agreement by generating annotation suggestions automatically from textual resources related to the documents to be archived. They performed pilot studies to evaluate term suggestion methods through precision and recall by taking terms assigned by archivists as ‘ground-truth’. The found that the quality of automatic term-suggestion are sufficiently high.
The second presentation was "Measuring Quality in Metadata Repositories" by Dimitris Gavrilis. He started his presentation by mentioning that this is a hard topic, then he explained why this research is important. He explained the specific criteria that determine the data quality: completeness, validity, consistency, timeliness, appropriateness, and accuracy constituents. In their paper, Dimitris et al. introduced a metadata quality evaluation model (MQEM) that provides a set of metadata quality criteria as well as contextual parameters concerning metadata generation and use. The MQEM allows the curators and the metadata designers to assess the quality of their metadata and to run queries on existing datasets. They evaluated their framework on two different use cases: application design and content aggregation.
After the session, we took a break and I got illness which prevented me from attending the discussion panel session, which was entitled "Open Access to Research Data: is it a solution or a problem?", and the poster session. I went back to the hotel to rest and prepare for the next day's presentation. I am embedding the tweets about the panel and the poster session.
Discussion Panel: Open Access to Research Data: is it a solution or a problem? #tpdl2015 pic.twitter.com/DYSWYHnLSr— Sawood Alam (@ibnesayeed) September 15, 2015
Joseph Cancellaro, active composer, musician, and the chair of the Interactive Art and Media Department of Columbia College in Chicago. Cancellaro started by a short bio about himself. The first part of his presentation handled issues of audio asset and the constant problematic for sound designers and non-linear environments (naming convention (meta tag), search tools, storage (failure), retrieval (failure), DSP (Digital signal processing), etc. He also mentioned how do they handle these issues in his department, for example for naming conventions, they add tags to the files. He explained the simple sound asset SR workflow. Preservation to Cancellaro is “not losing any more audio data". His second part of the presentation was about storage, retrieval, possible solutions, and content creation. He mentioned some facts about storage and retrieval:
- The decrease in technology costs have reduced the local issues of storage capacity (this is always a concern in academia).
- Bandwidth is still an issue in real-time production.
- Non-linear sound production is a challenge for linear minded composers and sound designers.
Next was a presentation entitled "Tyranny of Distance: Understanding Academic Library Browsing by Refining the Neighbour Effect" by Dana Mckay and George Buchanan.
Dana and George explained the importance of browsing books as a part of informations seeking, and how this is not well-supported for e-books. They used different datasets to examine the patterns of co-borrowing. They examined different aspects of the neighbour effect on browsing behavior. Finally they presented their findings to improve the browsing of digital libraries.
The last presentation of this session was a study on Storify entitled "Characteristics of Social Media Stories" by Yasmin AlNoamany. Based upon analyzing 14,568 stories from Storify, AlNoamany et al. specified the structural characteristics of popular (i.e., receiving the most views) human-generated stories to build a template that will be used later in generating (semi-)automatic stories from the archived collections. The study investigated many question regarding to the features of the stories, such as the length of the story, the number of elements, the decay rate of the stories, etc. At the end, the study differentiated the popular stories and the unpopular stories based on the main feature of both of them. Based on Kruskal-Wallis test, at the p ≤ 0.05 significance level, the popular and the unpopular stories are different in terms of most of the features. Popular stories tend to have more web elements (medians of 28 vs. 21), longer timespan (5 hours vs. 2 hours), longer editing time intervals, and less decay rate.
After the presentation, we had lunch, when some attendees extended the talk about my research. It was a useful discussion regarding to the future of my research, especially integrating data from the archived collections with the storytelling services.
Giannis Tsakonas, and moderated by Seamus Ross as a replacement of two presentations due to the absence of the presenters. The panel started with a question from Seamus Ross: Are user studies in digital libraries soft? Each one of the panelists presented his point of view on the importance of user studies. Buchanan said that user studies matter, then Dana followed up that we want to create something that all the people can use. Tsakonas said he did studies that never developed into systems. Seamus Ross asked the panelists: what makes a person a good user study person? Dana answered with a joke; "choose someone like me". Dana works as User Experience Manager and Architect at the academic library of Swinburne University of Technology, so she has experience with users needs and user studies. I followed up the discussion that we do users studies to know what the people need or to evaluate a system, then I asked if Mechanical Turk (MTurk) experiments is a form of user studies. At the end, Seamus Ross concluded the panel with some advice on conducting user studies, such as considering a feedback loop in the process of user study.
After the panel, we had a coffee break. I had a great discussion about user evaluation in the context of my research with Brigitte Mathiak, who gave me much useful advice about evaluating the stories that will be created automatically from the web archives. Later on my European trip I gave a presentation at Magdeburg-Stendal University of Applied Sciences that gives big picture of my research.
Concordia Design along with beautiful music. At the dinner, the conference chairs announced two awards: the best paper award for Matthias Geel and Moira Norrie on "Memsy: Keeping Track of Personal Digital Resources across Devices and Services" and the best poster/demo award for Clare Llewellyn, Claire Grover, Beatrice Alex, Jon Oberlander and Richard Tobin on "Extracting a Topic Specific Dataset from a Twitter Archive".
The best paper in #tpdl2015 is "Memsy: Keeping Track of Personal Digital Resources across Devices and Services"— Yasmina Anwar (@yasmina_anwar) September 16, 2015
The third day started early at 9:00 am with sessions about digital humanities, in which I presented my study about “Detecting Off-Topic Pages in Web Archives”. The paper investigate different methods for automatically detecting when an archived page goes off-topic. It presented six different methods that mainly depend on comparing the archived copy of a page (a memento) with the first memento of this page. Testing the methods was done on different Archived collections from Archive-It. The suggested best method was a combination between a textual method (cosine similarity using TF-IDF) and a structural method (word count). The best combined methods for detecting the off-topic pages gave an average precision 0.92 on 11 different collections. The output of this research is a tool for detecting the off-topic pages in the archive. The code can be downloaded and tested from Github, and more information can be found from my recent presentation at the Internet Archive.
The next paper presented in the digital humanities session was "Supporting Exploration of Historical Perspectives across Collections" by Daan Odijk. In their work, Odjjk et al. introduced tools for selecting, linking, and visualizing the second World War (WWII) collection from collections of the NIOD, the National Library of the Netherlands, and Wikipedia. They also link digital collections via implicit events, i.e. if two articles are close in time and similar in content, they are considered to be related. Furthermore, they provided exploratory Interface to explore the connected collections. They used Manhattan distance for textual similarity over document terms in a TF.IDF weighted vector space and measured temporal similarity using a Gaussian decay function. They found that textual similarity performed better than temporal similarity, and combining textual and temporal similarity improved the nDCG score.
Myriam C. Traub. Traub et al. performed user studies on digital archived to classify the research tasks and describe the potential impact of OCR quality on these tasks through interviewing scholars from digital humanities. They analyzed the questions and categorized the research tasks. Myriam said that few scholars could quantify the impact of OCR errors on their own research tasks. They found that OCR is unlikely to be perfect. They could not find solutions but they could suggest strategies that lead to the solutions. At the end, Myriam suggested that the tools should be open-source and there should be evaluation matrices.
After the conference, Sawood and I took the train from Poznan to Potsdam, Germany to meet Dr. Michael A. Herzog, the Vice Dean for Research and Technology Transfer, Department of Economics and head of Research Group SPiRIT. We were invited to talk about our research in a Digital Preservation lecture at Magdeburg-Stendal University of Applied Sciences in Magdeburg. Sawood wrote a nice blog post about our talks.