2012-06-17: JCDL 2012 Conference
On Saturday, my colleague, Justin Brunelle, and I took off
on a road trip to attend this year’s JCDL conference in Washington, D.C. We
arrived at the nation’s capital earlier that evening and began preparing our
presentations after settling in at the George Washington University Inn. Both of us were
accepted to present our work at the conference’s Doctoral Consortium. Justin
has already blogged about the consortium and our experience in his brilliant
blog post.
The conference started on the following Monday (June 10th).
The registration went smoothly and we all took our seats at the Betts Theatre
in the Marvin Center which sits in the heart of George Washington University.
Barrie Howard and Karim Boughida (the conference co-chairs) gave the welcoming
remarks and were followed by Leo M. Chalupa, the Vice
President of Research at the university.
Michael Nelson, opened up the session
and introduced the keynote speaker, Jason Scott. Winning the award for the Most “Optimistic” Title (“All you Cared About Is Gone and All your Friends are Dead:Fun Frolic of Preservation Activism”), he described his work as a technology
historian starting from textfiles.com and computer bulletin boards systems. Working
to migrate our perception from just collecting pieces of history to
collecting stories, he interviewed John Sheets from Bell Labs to tell his story.
Jason also discussed corporations closing down services with either no export
facilities or not enough time to migrate.
He further elaborated by giving examples on Yahoo closing Geocities and Yahoo
Pets; Magnolia bookmarking service going down; Tabblo social networking service
bought by HP; and finally MobileMe by Apple. He concluded by speaking about his
team (The Archive Team) and their initiative and efforts in preserving the web and
in crowdsourcing the archiving task by using Archive Team Warrior.
The conference in general was designed to host two tracks simultaneously. Thus, it was fairly hard to choose which sessions to attend. Even
though all the titles seemed interesting, I decided to attend the sessions that
are closer to my field of expertise. Robert Sanderson was the chair of the
first session, and he introduced Cathy Marshall from Microsoft Research. Cathy presented her full paper entitled, “On the Institutional Archiving of Social Media”. In it she described the work she
has done to shed light on the controversy surrounding archiving social media. She
started by examining a video filmed by the Library of Congress interviewed young
students depicting the problem. After
that she conducted 6 surveys to users inquiring about their attitude and
practices, and asking them for their opinions. She utilized Amazon’s Mechanical Turk in this study and asked questions about what they thought about the
announcement that the Library of Congress has all the Tweets from the day Twitter started (in 2006) from the day it
started (in 2006). Following that, she investigated their opinions regarding
publishing this feed to the researchers now or to the public or to the public
after 50 years. Her study concluded that the concept of ownership in social
media is fuzzy and social norms are evolving in regards to reuse.
The next presentation was by Maristella Agosti from Padua University in Italy. The talk entitled, “To Envisage and Design the Transition from a Digital Archive System Developed from Domain Experts to One for Non-Domain Users”.
As the title suggests, she stressed the need to change our attitude from
specialized users to open public in digital culture heritage collections. She
also suggested that those collection models should be open to researchers from
outside the field and to make this culture open to people from the public
domain. She described the different
levels of the model and started from two different open collections from Dublin
and Padua in a pilot study. Following this presentation, our colleague, Kalpesh Padia, presented his paper entitled, “Visualizing Digital Collections at Archive-It”. In this presentation, he illustrated his work in producing a
visualization tool to describe collections on Archive-It (the subscription
service from the Internet Archive).
The next presentation was by Jillian Wallis and was entitled, “Data, Data Use & Inquiry: A New Point of View on Data Curation”. She took
a case study of two branches of science (Astronomy and Environmental)
and interviewed researchers from both fields about the data they collected and
used in their analysis. She also inquired about its size, and where it was
collected from in order to give a better idea on how will this data be reused
across different disciplines. She defined what the foreground and background data
is and explained how background data for one researcher could be foreground for
another. She asked a question: Are we
undervaluing data by discarding and not citing it? Finally, she concluded by
stating that the use of data is highly dependent on research type and thus its
citation. Next Dharitri Misra presented a paper entitled, “Digital Preservation and Knowledge Discovery Based on Documents from an International Health Science Program”. In this paper, she showed that biomedical documents are important for
the obvious data it unfolds. They are also important in detecting hidden trends
in masses of documents and how these trends are very useful. Her work described
a case study of U.S.-Japan CMSP where they have more than 50 years of medical
data.
Following a break, another pair of sessions started. I chose
a panel session that attracted my attention which was entitled, “Big Data Is
Already Here, and It’s Not Always What We Think”. This session was conducted by
representatives from the Library of Congress. Leslie Johnston discussed the
migration of perception from the term “records” to “data” and how library
collections could be mined like data. She defined “big data” as the amount of
data that might be relatively hard to manage and she demonstrated how this
definition has become very fluid across the years. She gave three case studies,
including the Historic Newspaper collection which has compiled 5 million pages
from 25 states and has digitized them from microfilm. This collection gets 5
million views per day. She continued to the next case which was the most
anticipated and controversial collection, the Twitter dataset. Leslie said she
possessed all the Twitter datasets from 2006 to 2010, and it spans 21 billion
tweets. They are stored in JSON format and reach 20TB in size. This collection
has been stored with a large amount of metadata that would be invaluable to
researchers in the field. She gladly announced that this data will be released
to the researchers in the field later this year. Following Leslie, James Snyder (the director of the Motion Picture and Broadcasting Division) discussed his work and his division’s collection of more than 6.75 million unique items from various file formats. He explained that his division
is responsible for the preservation of these collections for the duration of
125 years of the copyrights period. Jane Mandelbaum followed James and discussed
accessibility and how users eventually will be able to find the information
they need from the portals of the Library of Congress. She emphasized that we
need a better understanding of the data and the metadata in order to maximize
the benefit. Trevor Owens (also from the Library of Congress) said he is currently working
on the NDIIPP project and discussed visualization and exploration of data.
Finally, the panel accepted some questions in regards to their projects.
After the final break of the day, I attended a session
directed by Sally Jo Cunningham. The first paper was presented by Timothy Cole
and was entitled, “Descriptive Metadata, Iconclass, and Digitized Emblem Literature”. He began by explaining what emblems are and
how they differ in size, forms, and shapes. The objectives in his paper were to
expand the corpus of digitized emblems and build a cross repository portal for
emblem studies. He emphasized multi-granular types of discovery and access
aiming for more international interoperability. Finally he presented Emblematica online, which is a web-based resource for searching and browsing
emblems. Next, Jin Ha Lee and Xiao Hu presented their paper next entitled,
“Generating Ground Truth for Music Mood Classification Using Mechanical Turk”.
This is a really interesting study which began by analyzing and differentiating
between mood, affect, and emotion. Jin Ha explained the MIREX project in
classifying music mood since 2007 and its differences. Since mood
classification is a user-based analysis, this nature makes this classification
more difficult, hard to replicate, and thus appeared the need to have ground
truth data. Xiao explained they utilized 5 clusters and performed their
experiments on Amazon’s mechanical turk by selecting good workers. They picked
1,250 songs and registered 2 judgments each. This experiment ran for 19 days and
cost approximately $60.00. They tested the performance against 9 algorithms and
displayed the results on Russell’s model of mood affect.
The next session was presented by Yinlin Chen and was entitled,
“Categorization of Computing Education Resources into ACM Computing Classification System”. In his presentation, Paul talked about the Ensemble Project and his approach which used existing ACM DL metadata to build
classifiers for harvested resources. Tadashi Nomoto followed Paul and presented
his paper entitled, “Re-ranking Bibliographic Records for Personalized Library Search”. He presented an approach of accessing bibliographic records in a way
to response to the user’s preferences.
Shortly after the break is when the minute madness started! The
minute madness is a 60 second short teaser where each author can “advertise”
for his poster or demo in the next session.
The posters session was held side by side with the demo session, and it
was a great opportunity to speak with the authors and also to support our
colleague, Mat Kelly, in his demo/poster.
Day two followed the same pattern and after breakfast all
the attendees gathered at the Betts Theatre. Herbert Van de Sompel welcomed the
attendees and introduced the keynote speaker, Carole Goble. Carole is a
professor in the School of Computer Science at the University of Manchester, UK
and the Director of the myGrid Project. With a speech entitled, “The Reality of Reproducibility for in Silico”, she started by describing the reproducibility
concept in the scientific field. She argued that it is a form of virtual
witnessing of the scientific method and that its documentation incorporated the
materials used and the method of proof.
From Nature Magazine, she presented that nearly 47 out of 53 landmark
publications are irreproducible (which was an astonishing number). She also stated the role of libraries in software sustainability including providing special codes,
data collections, service based science, and cloud hosted services. Claiming
that dependency is the root of decay, she explained the difference between
replication/repetition and reproduction, she also described the areas where
librarianship is crucial in preserving the scientific workflow, restoring it,
and finally reconstructing it.
Shortly after the break, another pair of sessions started. I
attended the first two presentations from Session 7 and later switched to attend the panel
that was conducted contemporaneously to Session 7. The first presentation was by
Jurgen Bernard and was entitled, “Content-based Layouts for Exploratory Metadata Search in Scientific Research Data”. In this paper, Jurgen attempted to answer
the question: Can we build relations of metadata based on the content? He
conducted several experiments analyzing and measuring similarities between the
results of different scientists experimenting with the same phenomena. He
presented a way to visualize similarities of results based on the metadata in
the “Metadata Entity Glyph”. Paul Bogen led the next presentation entitled, “A quantitative Evaluation of Techniques for Detection of Abnormal Change Events in Blogs”.
In this paper, Paul tried to analyze the patterns of change in blogs by
conducting a survey on popular blogs from a large collection of social
bookmarks to detect abnormal changes. Using his method, he argued that the
results show statistically significant improvement over traditional threshold
techniques for the same collection. Following this presentation, I hurried to
join the round table discussion of, “Digging into Data” in hopes of learning more about this international grant competition and to hear questions to the
representatives of the eight funding research agencies from the US, UK, Canada, and the Netherlands.
Shortly after the lunch break, two simultaneous sessions
began. I attended Session 9 which was hosted by George Buchanan. He introduced Anderson Ferreira and his work entitled, “Active Association Sampling for Author Name Disambiguation”. In this presentation, Anderson described the problem of
author name ambiguity. He suggested a possible method of solution based on
supervised machine learning techniques aiming to reduce the set of
examples needed to produce the training data. Following Anderson, Madian Khabsa
gave a very interesting presentation entitled, “AckSeer: A Repository and Search Engine for Automatically Extracted Acknowledgements from Digital Libraries”. In his
research, Madian explained the architecture of the search engine and the
repository he developed to mine acknowledgements. Following Madian, Sujatha Das Gollapalli gave a presentation entitled, “Similar Researcher Search in Academic Environments”, which addressed the researcher recommendation and similarity problem. Nuno Freire followed Suitha and presented his paper entitled, ”An Analysis of the Named Entity Recognition Problem in Digital Libraries Metadata”. He discussed the task of
information extraction dealing with the references to entities made by names that occur in the texts.
After the break, the last group of sessions was started by Jannik Strötgen and he presented his award nominated paper, “Event-centric Search and Exploration in Document Collections”. Jannik began his
presentation by arguing that current search engines are great in extracting a
set of documents, but fairly inadequate in extracting event related documents. Temporal
and semantic events have always been treated as words, while time and space have been well
defined and could be compared to enhance the concept of “event”. He followed by
explaining that queries are multidimensional and that by adding the geo and temporal dimensions
to the textual dimension the results (if ranked by event) will be enhanced. After
the presentation, I switched sessions to attend the remaining portion of Session 12, which was directed
by WS-DL alumnus Martin Klein. Feng Liang then presented his paper entitled, “Exploiting Real-time Information Retrieval in Microblogosphere”. In this paper, Fang
discussed the problem of the semantic aspect (what is the most relevant?) versus the temporal
aspect (what is the most recent?). For example, he argued that in Twitter the problem is very challenging due to the short length of the
document, the abundance of shortened URLs, and the tradeoff between recency and
relevance. Utilizing the TREC2011 Tweets dataset and the ICL-divergence model, he
analyzed the query model along with the document model, which provided a
significant improvement in the re-ranking task. Prat Tanapaisankit spoke next with a presentation entitled, “Personalized Query Expansion in QIC System”, in which he tackled the query in context problem. Robert Mercer followed Prat with a presentation entitled, “Investigating Keyphrase Indexing with Text Denoising”, in it he discussed how removing the noise parts from texts performed better or as well as the benchmark indexer.
After this session ended, the attendees headed to the awards banquet which was held at Sequoia Restaurant. Michael Nelson, along with Karim Boughida, was awarded the “Spark
Performance” Award. After that Michael Nelson announced
the winners of the awards for Best Full Paper, Short Paper, and Poster. Then we all spent a lovely evening by the river talking and socializing with the other
researchers and attendees.
The next morning denoted the last day of the conference. It started
with the last pair of sessions which were conducted by Edie Rasmussen and Pertti Vakkari. Sally Jo Cunningham presented the first paper entitled, “Book Selection Behavior in the Physical Library: Implications for the EBook Selection”. Her experiments aimed to gain insights into people’s book selection
strategies which may inform the design of software support for ebook selection.
Jesse Gozali followed Sally and presented a paper entitled, “How Do People Organize Their Photos in Each Event and How Does it Affect Storytelling, Searching, and Interpretation Tasks?” In this paper, he analyzed four different
methods of photo organization and conducted a study to inform this
analysis. Finally, George Buchanan presented Jennifer Pearson’s (his student) paper entitled, “Co-reading: Investigating Collaborative Group Reading”. In the group-reading setting, Jennifer investigated the differences in user behavior with paper versus their behavior with the new
interface they developed on iPad.
The closing keynote speech was presented by George Dyson and was entitled, “The Sensible Moment: 1680-2012”. He took us on an amazing historic
journey of technology. He began in the 1600's when Francis Bacon encoded the alphabet
in five placements and paved the way for binary encoding; and when Thomas Hobbes declared that computing machines should have adding and
remembering capabilities. He mentioned Alfred Smee's work in differentiating the reality, the
thought, and the conscious; and H. G. Wells’s novel, "World Brain" (1938), in which he imagined the Internet; and finally he spoke about the amazing efforts of Alan Turing and John VonNeumann in creating the computing machines we have
today.
Finally, Barrie Howard (the conference co-chair) thanked the
audience and presented the JCDL 2013 co-chairs, who declared the venue will be in Indianapolis, Indiana. I am highly anticipating the 2013 conference and I hope to be able to submit some of my work.
As a final treat, the conference organizers arranged a
private, guided tour of the Library of Congress. It was purely amazing! Now I
am on the train headed back to Norfolk to resume my research. I am completely
content with my experience at the conference, the people I met, and the
enlightening ideas I acquired.
Other Perspectives on JCDL 2012:
Other Perspectives on JCDL 2012:
- Robin Camille Davis's notes on digital preservation and Jason Scott's keynote speech.
- THAT camp's article about Bridging the Gap between the CS DL community and the LIS DL community.
- The Library of Congress's notes from Behind the scenes of the JCDL2012 conference.
- The report about the Digging into Data by CNI and another by JISC.
- Also you can follow news, insights, reports, and more on the Twitter Feed.
Hany SalahEldeen
Special thanks to Erin E. Ralston for editing and refining this post.
Comments
Post a Comment