Monday, June 18, 2012

2012-06-17: JCDL 2012 Conference

On Saturday, my colleague, Justin Brunelle, and I took off on a road trip to attend this year’s JCDL conference in Washington, D.C. We arrived at the nation’s capital earlier that evening and began preparing our presentations after settling in at the George Washington University Inn. Both of us were accepted to present our work at the conference’s Doctoral Consortium. Justin has already blogged about the consortium and our experience in his brilliant blog post.

The conference started on the following Monday (June 10th). The registration went smoothly and we all took our seats at the Betts Theatre in the Marvin Center which sits in the heart of George Washington University. Barrie Howard and Karim Boughida (the conference co-chairs) gave the welcoming remarks and were followed by Leo M. Chalupa, the Vice President of Research at the university.

Michael Nelson, opened up the session and introduced the keynote speaker, Jason Scott. Winning the award for the Most “Optimistic” Title (“All you Cared About Is Gone and All your Friends are Dead:Fun Frolic of Preservation Activism”), he described his work as a technology historian starting from and computer bulletin boards systems. Working to migrate our perception from just collecting pieces of history to collecting stories, he interviewed John Sheets from Bell Labs to tell his story. Jason also discussed corporations closing down services with either no export facilities or not enough time to migrate.  He further elaborated by giving examples on Yahoo closing Geocities and Yahoo Pets; Magnolia bookmarking service going down; Tabblo social networking service bought by HP; and finally MobileMe by Apple. He concluded by speaking about his team (The Archive Team) and their initiative and efforts in preserving the web and in crowdsourcing the archiving task by using Archive Team Warrior.
The conference in general was designed to host two tracks simultaneously. Thus, it was fairly hard to choose which sessions to attend. Even though all the titles seemed interesting, I decided to attend the sessions that are closer to my field of expertise. Robert Sanderson was the chair of the first session, and he introduced Cathy Marshall from Microsoft Research. Cathy presented her full paper entitled, “On the Institutional Archiving of Social Media”. In it she described the work she has done to shed light on the controversy surrounding archiving social media. She started by examining a video filmed by the Library of Congress interviewed young students depicting the problem.  After that she conducted 6 surveys to users inquiring about their attitude and practices, and asking them for their opinions. She utilized Amazon’s Mechanical Turk in this study and asked questions about what they thought about the announcement that the Library of Congress has all the Tweets from the day Twitter started (in 2006) from the day it started (in 2006). Following that, she investigated their opinions regarding publishing this feed to the researchers now or to the public or to the public after 50 years. Her study concluded that the concept of ownership in social media is fuzzy and social norms are evolving in regards to reuse. 
The next presentation was by Maristella Agosti from Padua University in Italy.  The talk entitled, “To Envisage and Design the Transition from a Digital Archive System Developed from Domain Experts to One for Non-Domain Users”.  As the title suggests, she stressed the need to change our attitude from specialized users to open public in digital culture heritage collections. She also suggested that those collection models should be open to researchers from outside the field and to make this culture open to people from the public domain.  She described the different levels of the model and started from two different open collections from Dublin and Padua in a pilot study. Following this presentation, our colleague, Kalpesh Padia, presented his paper entitled, “Visualizing Digital Collections at Archive-It”. In this presentation, he illustrated his work in producing a visualization tool to describe collections on Archive-It (the subscription service from the Internet Archive).

The next presentation was by Jillian Wallis and was entitled, “Data, Data Use & Inquiry: A New Point of View on Data Curation. She took a case study of two branches of science (Astronomy and Environmental) and interviewed researchers from both fields about the data they collected and used in their analysis. She also inquired about its size, and where it was collected from in order to give a better idea on how will this data be reused across different disciplines. She defined what the foreground and background data is and explained how background data for one researcher could be foreground for another.  She asked a question: Are we undervaluing data by discarding and not citing it? Finally, she concluded by stating that the use of data is highly dependent on research type and thus its citation. Next Dharitri Misra presented a paper entitled, “Digital Preservation and Knowledge Discovery Based on Documents from an International Health Science Program”. In this paper, she showed that biomedical documents are important for the obvious data it unfolds. They are also important in detecting hidden trends in masses of documents and how these trends are very useful. Her work described a case study of U.S.-Japan CMSP where they have more than 50 years of medical data.

Following a break, another pair of sessions started. I chose a panel session that attracted my attention which was entitled, “Big Data Is Already Here, and It’s Not Always What We Think”. This session was conducted by representatives from the Library of Congress. Leslie Johnston discussed the migration of perception from the term “records” to “data” and how library collections could be mined like data. She defined “big data” as the amount of data that might be relatively hard to manage and she demonstrated how this definition has become very fluid across the years. She gave three case studies, including the Historic Newspaper collection which has compiled 5 million pages from 25 states and has digitized them from microfilm. This collection gets 5 million views per day. She continued to the next case which was the most anticipated and controversial collection, the Twitter dataset. Leslie said she possessed all the Twitter datasets from 2006 to 2010, and it spans 21 billion tweets. They are stored in JSON format and reach 20TB in size. This collection has been stored with a large amount of metadata that would be invaluable to researchers in the field. She gladly announced that this data will be released to the researchers in the field later this year. Following Leslie, James Snyder (the director of the Motion Picture and Broadcasting Division) discussed his work and his division’s collection of more than 6.75 million unique items from various file formats. He explained that his division is responsible for the preservation of these collections for the duration of 125 years of the copyrights period. Jane Mandelbaum followed James and discussed accessibility and how users eventually will be able to find the information they need from the portals of the Library of Congress. She emphasized that we need a better understanding of the data and the metadata in order to maximize the benefit. Trevor Owens (also from the Library of Congress) said he is currently working on the NDIIPP project and discussed visualization and exploration of data. Finally, the panel accepted some questions in regards to their projects.

After the final break of the day, I attended a session directed by Sally Jo Cunningham. The first paper was presented by Timothy Cole and was entitled, “Descriptive Metadata, Iconclass, and Digitized Emblem Literature”.  He began by explaining what emblems are and how they differ in size, forms, and shapes. The objectives in his paper were to expand the corpus of digitized emblems and build a cross repository portal for emblem studies. He emphasized multi-granular types of discovery and access aiming for more international interoperability. Finally he presented Emblematica online, which is a web-based resource for searching and browsing emblems. Next, Jin Ha Lee and Xiao Hu presented their paper next entitled, “Generating Ground Truth for Music Mood Classification Using Mechanical Turk”. This is a really interesting study which began by analyzing and differentiating between mood, affect, and emotion. Jin Ha explained the MIREX project in classifying music mood since 2007 and its differences. Since mood classification is a user-based analysis, this nature makes this classification more difficult, hard to replicate, and thus appeared the need to have ground truth data. Xiao explained they utilized 5 clusters and performed their experiments on Amazon’s mechanical turk by selecting good workers. They picked 1,250 songs and registered 2 judgments each. This experiment ran for 19 days and cost approximately $60.00. They tested the performance against 9 algorithms and displayed the results on Russell’s model of mood affect.

The next session was presented by Yinlin Chen and was entitled, “Categorization of Computing Education Resources into ACM Computing Classification System”. In his presentation, Paul talked about the Ensemble Project and his approach which used existing ACM DL metadata to build classifiers for harvested resources. Tadashi Nomoto followed Paul and presented his paper entitled, “Re-ranking Bibliographic Records for Personalized Library Search”. He presented an approach of accessing bibliographic records in a way to response to the user’s preferences.

Shortly after the break is when the minute madness started! The minute madness is a 60 second short teaser where each author can “advertise” for his poster or demo in the next session.  The posters session was held side by side with the demo session, and it was a great opportunity to speak with the authors and also to support our colleague, Mat Kelly, in his demo/poster.
Day two followed the same pattern and after breakfast all the attendees gathered at the Betts Theatre. Herbert Van de Sompel welcomed the attendees and introduced the keynote speaker, Carole Goble. Carole is a professor in the School of Computer Science at the University of Manchester, UK and the Director of the myGrid Project. With a speech entitled, “The Reality of Reproducibility for in Silico”, she started by describing the reproducibility concept in the scientific field. She argued that it is a form of virtual witnessing of the scientific method and that its documentation incorporated the materials used and the method of proof.  From Nature Magazine, she presented that nearly 47 out of 53 landmark publications are irreproducible (which was an astonishing number). She also stated the role of libraries in software sustainability including  providing special codes, data collections, service based science, and cloud hosted services. Claiming that dependency is the root of decay, she explained the difference between replication/repetition and reproduction, she also described the areas where librarianship is crucial in preserving the scientific workflow, restoring it, and finally reconstructing it.

Shortly after the break, another pair of sessions started. I attended the first two presentations from Session 7 and later switched to attend the panel that was conducted contemporaneously to Session 7. The first presentation was by Jurgen Bernard and was entitled, “Content-based Layouts for Exploratory Metadata Search in Scientific Research Data”. In this paper, Jurgen attempted to answer the question: Can we build relations of metadata based on the content? He conducted several experiments analyzing and measuring similarities between the results of different scientists experimenting with the same phenomena. He presented a way to visualize similarities of results based on the metadata in the “Metadata Entity Glyph”. Paul Bogen led the next presentation entitled, “A quantitative Evaluation of Techniques for Detection of Abnormal Change Events in Blogs”. In this paper, Paul tried to analyze the patterns of change in blogs by conducting a survey on popular blogs from a large collection of social bookmarks to detect abnormal changes. Using his method, he argued that the results show statistically significant improvement over traditional threshold techniques for the same collection. Following this presentation, I hurried to join the round table discussion of, “Digging into Data” in hopes of learning more about this international grant competition and to hear questions to the representatives of the eight funding research agencies from the US, UK, Canada, and  the Netherlands.
Shortly after the lunch break, two simultaneous sessions began. I attended Session 9 which was hosted by George Buchanan. He introduced Anderson Ferreira and his work entitled, “Active Association Sampling for Author Name Disambiguation”. In this presentation, Anderson described the problem of author name ambiguity. He suggested a possible method of solution based on supervised machine learning techniques aiming to reduce the set of examples needed to produce the training data. Following Anderson, Madian Khabsa gave a very interesting presentation entitled, “AckSeer: A Repository and Search Engine for Automatically Extracted Acknowledgements from Digital Libraries”. In his research, Madian explained the architecture of the search engine and the repository he developed to mine acknowledgements. Following Madian, Sujatha Das Gollapalli gave a presentation entitled, “Similar Researcher Search in Academic Environments”, which addressed the researcher recommendation and similarity problem. Nuno Freire followed Suitha and presented his paper entitled, ”An Analysis of the Named Entity Recognition Problem in Digital Libraries Metadata”. He discussed the task of information extraction dealing with the references to entities made by names that occur in the texts.

After the break, the last group of sessions was started by Jannik Strötgen and he presented his award nominated paper, “Event-centric Search and Exploration in Document Collections”. Jannik began his presentation by arguing that current search engines are great in extracting a set of documents, but fairly inadequate in extracting event related documents. Temporal and semantic events have always been treated as words, while time and space have been well defined and could be compared to enhance the concept of “event”. He followed by explaining that queries are multidimensional and that by adding the geo and temporal dimensions to the textual dimension the results (if ranked by event) will be enhanced. After the presentation, I switched sessions to attend the remaining portion of Session 12, which was directed by WS-DL alumnus Martin Klein. Feng Liang then presented his paper entitled, “Exploiting Real-time Information Retrieval in Microblogosphere”. In this paper, Fang discussed the problem of the semantic aspect (what is the most relevant?) versus the temporal aspect (what is the most recent?). For example, he argued that in Twitter the problem is very challenging due to the short length of the document, the abundance of shortened URLs, and the tradeoff between recency and relevance. Utilizing the TREC2011 Tweets dataset and the ICL-divergence model, he analyzed the query model along with the document model, which provided a significant improvement in the re-ranking task. Prat Tanapaisankit spoke next with a presentation entitled, “Personalized Query Expansion in QIC System”, in which he tackled the query in context problem. Robert Mercer followed Prat with a presentation entitled, “Investigating Keyphrase Indexing with Text Denoising”, in it he discussed how removing the noise parts from texts performed better or as well as the benchmark indexer.

After this session ended, the attendees headed to the awards banquet which was held at Sequoia Restaurant. Michael Nelson, along with Karim Boughida, was awarded the “Spark Performance” Award. After that Michael Nelson announced the winners of the awards for Best Full Paper, Short Paper, and Poster. Then we all spent a lovely evening by the river talking and socializing with the other researchers and attendees.

The next morning denoted the last day of the conference. It started with the last pair of sessions which were conducted by Edie Rasmussen and Pertti Vakkari. Sally Jo Cunningham presented the first paper entitled, “Book Selection Behavior in the Physical Library: Implications for the EBook Selection”. Her experiments aimed to gain insights into people’s book selection strategies which may inform the design of software support for ebook selection. Jesse Gozali followed Sally and presented a paper entitled, “How Do People Organize Their Photos in Each Event and How Does it Affect Storytelling, Searching, and Interpretation Tasks?” In this paper, he analyzed four different methods of photo organization and conducted a study to inform this analysis. Finally, George Buchanan presented Jennifer Pearson’s (his student) paper entitled, “Co-reading: Investigating Collaborative Group Reading”. In the group-reading setting, Jennifer investigated the differences in user behavior with paper versus their behavior with the new interface they developed on iPad.

The closing keynote speech was presented by George Dyson and was entitled, “The Sensible Moment: 1680-2012”. He took us on an amazing historic journey of technology. He began in the 1600's when Francis Bacon encoded the alphabet in five placements and paved the way for binary encoding; and when Thomas Hobbes declared that computing machines should have adding and remembering capabilities. He mentioned Alfred Smee's work in differentiating the reality, the thought, and the conscious; and H. G. Wells’s novel, "World Brain" (1938), in which he imagined the Internet; and finally he spoke about the amazing efforts of Alan Turing and John VonNeumann in creating the computing machines we have today.

Finally, Barrie Howard (the conference co-chair) thanked the audience and presented the JCDL 2013 co-chairs, who declared the venue will be in Indianapolis, Indiana. I am highly anticipating the 2013 conference and I hope to be able to submit some of my work.

As a final treat, the conference organizers arranged a private, guided tour of the Library of Congress. It was purely amazing! Now I am on the train headed back to Norfolk to resume my research. I am completely content with my experience at the conference, the people I met, and the enlightening ideas I acquired.

Other Perspectives on JCDL 2012:

Hany SalahEldeen

Special thanks to Erin E. Ralston for editing and refining this post.

Tuesday, June 12, 2012

2012-06-12: JCDL 2012 Doctoral Consortium

The ODU WS-DL research group kicked off JCDL 2012 at The George Washington University by presenting the first two Doctoral Consortium papers on June 10th, 2012. The Doctoral Consortium is a workshop for PhD students that are in the early stages of defining their research. It is a venue for presenting a potential path through the PhD, as well as a way to receive feedback from peers and other researchers. Past WS-DL students have benefited from the workshop, including Joan Smith, Frank McCown, Martin Klein, Chuck Cartledge, and Ahmed Alsum. Hany SalahEldeen and I (Justin F. Brunelle) were honored and excited to be the next class of WS-DL students to participate.

The first session was the Data Preservation and Curation section, chaired by Maristella Agosti. I presented the first paper entitled "Filling in the Blanks: Capturing Dynamically Generated Content". My work will study capturing, sharing, and archiving Web 2.0 resources that traditional crawlers cannot archive. This will include studying the prevalence and characteristics of unarchivable resources, as well as the capture of client-side events and representations.

Hany presented the second paper, rounding out the one-two punch of Old Dominion University attendees. His paper entitled "User Intention Modeling in Temporal Navigation and Preservation of Shared Resources in Social Media" proposes work that will study how to ensure that resources shared over social media do not change between the time they are shared and the time they are viewed. This will produce a framework that will model user intention upon sharing resources.

Plato L. Smith II finished the first session with proposed research to define best practices, definitions, concepts, and terms within the digital preservation discipline. This work would provide methods of evaluating and measuring impact and risk of solutions, as well as improve relationships between stakeholders.

Session II, chaired by Kazunari Sugiyama, focused on Document Mining and Processing. James Creel began this session by presenting his work in metadata extraction and disambiguation for document labeling. His work utilizes feature extraction, Latent Semantic Analysis, disambiguation, and supervised learning of the system before depositing an item into a repository.

Jinsong Zhang was the second and final presenter. His work identifies "hot topics" in an academic author's field and points out the new and influential papers or publications in that area (ranked by an adapted PageRank algorithm). This work benefits student researchers by aiding their search for prior works.

Session III, immediately following a wonderful lunch, was chaired by Pertti Vakkari and focused on Information Search and Retrieval. Roberto Gonzalez-Ibanez presented his work on the effect of emotions in collaborative information retrieval tasks, and how different feedback influenced the efficiency of an information finding task.

Christopher G. Harris presented his work on applying crowdsourcing and serious games to information retrieval. His work begins with a framework for information retrieval tasks, and has identified and ranked potential stages at which gamification is most beneficial. Most importantly, he identifies aspects of the process that can benefit the most from human checking or input.

Michael Zarro finished Session III with his work on Health Information Search. This work will identify different health information seeking behaviors and help improve the search experience by improving the provided information. This work takes each aspect of the experience, such as the authoritative sources and user motivation and search ability, and improves the search results provided by the system.

Session IV: Information Interaction and Use was chaired by Sally Jo Cunningham. HyunSeung Koh began the session with her research on user interactions in active reading in ebooks. He work studies how users interact with a medium during active reading, and how the same behaviors can be translated into a "design of interactivity" for ebooks.

The last presenter of Session IV, and the last presenter of the day, was Clare Llewellyn. Her work analyzed online arguments and defines a structure of online argumentation for media such as article comments sections or Twitter discussions. This structure is used to identify threads within the argument in order to present only relevant information to the reader.

To round out the day, Luanne Freund chaired an Advisor Panel of Edie Rasmussen, George Buchanan, and Rick Furuta. During a day of presented research, feedback, and collaboration, the panel provided a way to summarize the events of the day and provide broad feedback to the students. This feedback was not only directed at the day's presentations, but also included advice on successfully completing a Ph.D. and where the degree can take individuals after school. Without a doubt, the feedback provided during this event will allow the participants to improve the direction of their academic careers.

Before completing this post, we must thank the Doctoral Consortium Co-Chairs Luanne Freund and Mounia Lalmas, as well as the numerous reviewers and professors that provided feedback during the workshop. Without all of this help, we wouldn't have been able to produce such quality work. Thank you!

--Justin F. Brunelle

Monday, June 4, 2012

2012-06-04: Glue Conference 2012

Glue Conference 2012 took place at the Omni Interlocken Hotel Bloomfield, CO on May 23 and 24th. Gluecon is an information packed developer conference that focuses on cloud, mobile, APIs, big data, and most importantly, developers. Some of the topics included NoSQL, node.js, HTML5, backend-as-a-service, cloud management and security, cloud storage, Hadoop, DevOps, mobile app development, and cloud platforms.

I attended the conference with sponsorship (full ride) from FullContact.  These guys were unbelievably gracious and showed me a great time while I was out there.  I came in contact with them when Bart Lorang, CEO of FullContact contacted me over e-mail and wanted to setup a time to talk with him and his engineering team about a paper I had published at a KDD'11 workshop.  After meeting with the guys and talking shop, I found out that they are solving the same real world problems (at world scale) that I was working on in my graduate research (at individual scale).

Some of the more interesting presentations/demo's of the conference included:

FullContact's Dan Lynn gave a presentation on Storm
The title of the presentation is Storm - The Real-Time Layer Your Big Data Has Been Missing.  The problem with big data that is constantly changing is that your processing jobs are typically done in batch processing, and while this works and is usually perfectly acceptable, batch processing operates over a snapshot in time of your data.  If you want to get the most accurate, and most up-to-date picture of your data, real-time processing is what you want.  Storm is a new framework for real-time computation on big data that operates using new concepts of streams, spouts, tuples and bolts.

EmergentOne makes it ridiculously easy to launch an API. Generate a complete and customized REST API for an existing application in minutes using a GUI interface.  I saw a demo of this hooked up to a world country MySQL database.  Within minutes the guy had created an API that I could hit over the internet.
Tempo is purpose-built database used to store and analyze massive streams of time-series data.  Think the internet of things here where each thing is generating data where the most important attribute is the time-stamp.  From their site "TempoDB is the first purpose-built data layer that enables the scalable storage and instant analysis of your time-series streams, so that you can learn from the past, understand the present, and predict the future." used to guard your website against unauthorized web scraping, competitor data mining, and more, without impeding your end user. was the winner of the Demo Pod which contained 12 new startups that were competing against each other for this title.  (FullContact was the winner of the Demo Pod for GlueCon last year).  While will be welcomed by many a content generator over the internet, it flies in the face of the web-scrapers out there like myself and FullContact who harness the massive amounts of information on the internet in order to aggregate the data into a meaningful product.  I'm still skeptical that they could prevent the scraping used in ArchiveFacebook.
Shout out to Robbie Jack and Kyle for showing me a great time in Boulder, CO the Friday after the conference.  We had fun bar hopping and playing Werewolf at the TechStars Boulder HQ.  I'm definitely going to have to try and come back for next years GlueCon.

Carlton Northern