Monday, October 27, 2014

2014-10-27: 404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent

Herbert and I attended the "404/File Not Found: Link Rot, Legal Citation and Projects to Preserve Precedent" at the Georgetown Law Library on Friday, October 24, 2014.  Although the origins for this workshop are many, catalysts for it probably include the recent Liebler  & Liebert study about link rot in Supreme Court opinions,  and the paper by Zittrain, Albert, and Lessig about and the problem of link rot in the scholarly and legal record and the resulting popular media coverage resulting from it  (e.g., NPR and the NYT). 

The speakers were naturally drawn from the legal community at large, but some notable exceptions included David Walls from the GPO, Jefferson Bailey from the Internet Archive, and Herbert Van de Sompel from LANL. The event was streamed and recorded, and videos + slides will be available from the Georgetown site soon so I will only hit the highlights below. 

After a welcome from Michelle Wu, the director of the Georgetown Law Library, the workshop started with an excellent keynote from the always entertaining Jonathan Zittrain, called "Cites and Sites: A Call To Arms".  The theme of the talk centered around "Core Purpose of .edu", which he broke down into:
  1. Cultivation of Scholarly Skills
  2. Access to the world's information
  3. Freely disseminating what we know
  4. Contributing actively and fiercely to the development of free information platforms

For each bullet he gave numerous anecdotes and examples; some innovative, and some humorous and/or sad.  For the last point he mentioned Memento,, and timed release crypto

Next up was a panel with David Walls (GPO), Karen Eltis (University of Ottawa), and Ed Walters (Fastcase).  David mentioned the Federal Depository Library Program Web Archive, Karen talked about the web giving us "Permanence where we don't want it and transience where we require longevity" (I tweeted about our TPDL 2011 paper that showed for music videos on Youtube, individual URIs die all the time but the content just shows up elsewhere), and Ed generated a buzz in the audience when he announced that in rendering their pages they ignore the links because of the problem of link rot.  (Panel notes from Aaron Kirschenfeld.)

The next panel had Raizel Liebler (Yale) author of another legal link rot study mentioned above and an author of one of the useful handouts about links in the 2013-2014 Supreme Court documentsRod Wittenberg (Reed Tech) talked about the findings of the Chesapeake Digital Preservation Group and gave a data dump about link rot in Lexis-Nexis and the resulting commercial impact (wait for the slides).  (Panel notes from Aaron Kirschenfeld.)

After lunch, Roger Skalbeck (Georgetown) gave a web master's take on the problem, talking about best practices, URL rewriting, and other topics -- as well as coining the wonderful phrase "link rot deniers".  During this talk I also tweeted TimBL's classic 1998 resource "Cool URIs Don't Change". 

Next was Jefferson Bailey (IA) and Herbert.  Jefferson talked about web archiving, the IA, and won approval from the audience for his references to Lionel Hutz and HTTP status dogs.  Herbert's talk was entitled "Creating Pockets of Persistence", and covered a variety of topics, obviously including Memento and Hiberlink.

The point is to examine web archiving activities with an eye to the goal of making access to the past web:
  1. Persistent
  2. Precise
  3. Seamless
Even though this was a gathering of legal scholars, the point was to focus on technologies and approaches that are useful across all interested communities.  He also gave examples from our "Thoughts on Referencing, Linking, Reference Rot" (aka "missing link) document, which was also included in the list of handouts.  The point on this effort is enhance existing links (with archived versions, mirror versions, etc.), but not at the expense of removing the link to the original URI and the datetime of intended link.  See our previous blog post on this paper and a similar one for Wikipedia.

The closing session was Leah Prescott (Georgetown; subbing for Carolyn Cox),  Kim Dulin (Harvard), and E. Dana Neacşu (Colombia).   Leah talked some more about the Chesapeake Digital Preservation Group and how their model of placing materials in a repository doesn't completely map to the model of web archiving (note: this actually has fascinating implications for Memento that are beyond the scope of this post).  Kim gave an overview of Harvard's archive, and Dana gave an overview of a prior archiving project at Columbia.  Note that recently received a Mellon Foundation grant (via Columbia) to add Memento capability.

Thanks to Leah Prescott and everyone else that organized this event.  It was an engaging, relevant, and timely workshop.  Herbert and I met several possible collaborators that we will be following up with. 


-- Michael

Thursday, October 16, 2014

2014-10-16: Grace Hopper Celebration of Women in Computing (GHC) 2014

Photo credit to my friend Mona El Mahdy
I was thrilled and humbled for the second time to attend Grace Hopper Celebration of women in computing (GHC) 2014, the world’s largest gathering for technologists women. GHC is presented by the Anita Borg Institute for Women and Technology, which was founded by Dr. Anita Borg and Dr. Telle Whitney in 1994 to bring together research and career interests of women in computing and encourage the participation of women in computing. The twentieth anniversary of GHC was held in Phoenix, Arizona on October 8-10, 2014. This year, GHC has almost doubled the number of women who have research and business interests from the last year to be 8,000 women from about 67 countries and about 900 organizations to get inspired, gain expertise, get connected, and have fun.

Aida Ghazizadeh from the Department of Computer Science at Old Dominion University also was awarded travel scholarships to attend this year's GHC. I hope ODU will have more participation in the upcoming years.

The conference theme this year was "Everywhere. Everyone.”. Computer technologies are everywhere and everyone should be included for driving innovations. There were multiple technical tracks featuring the latest technologies in many fields such as cloud computing, data science, security, and Swift Playgrounds Programming language by Apple. Conference presenters represented many different fields, such as academia, industry, and government. The non-profit organization "Computing Research Association Committee on Women in Computing (CRA-W)", also offered sessions targeted towards academics and business. I had a chance to attend Graduate Cohort Workshop in 2013, which was held in Boston, MA, and created a blog post about it.

The first day started off with welcoming the 8,000 conference attendees by Dr. Telle Whitney, the president and the CEO of Anita Borg Institute. She mentioned how the GHC started the first time on 1994 in Washington DC to bring together research and career interests of women in computing and encourage the participation of women in computing. "Join in, connect with one another, be inspired by our speakers, be inspired by our award winners, develop your own skill and knowledge at the leadership workshops and at the technical sessions, let's all inspire and increase the ratio,  and make technology for everyone  everywhere,” Whitney said. Then she introduced Alex Wolf, the President of the Association of Computing Machinery (ACM) and a professor in Computing at Imperial College London, UK, for opening remarks.

Ruthe Farmer
Barbara Biungi and Durbana Habib
After the opening keynote, the ABIE Awards for social impact and Change Agent were presented by the awards' sponsors. The recognitions went to Ruthe FarmerBarbara Birungi and Durdana Habib who gave nice and motivated talks. Some highlights from Farmer's talk was:
  • "The next time you witness a technical woman doing something great, please tell her, or better tell others about her."
  • "The core of aspiration in computing is a powerful formula recognition plus community.” 
  • "Technical Women are not outliers."
  • "Heads up to all of you employers out there. There is a legion of young women heading your way that will negotiating their salaries ... so budget accordingly!"

The keynote of the first day was for Shafi Goldwasser, RSA Professor of Electrical Engineering and Computer Science at MIT and 2012 recipient of the Turing Award, about the history and benefits of cryptography and also her work in cryptography. She discussed the challenges in encryption and cloud computing. Here are some highlights from Goldwasser's talk:
  • "With the magic of cryptography, we can get the benefits of technology without the risks."
  • "Cryptography is not just about finding the bad guys, it is really about correctness, and privacy of computation"
  • "I believe that a lot of the challenges for the future of computer science are to think about new representations of data. And these new representations of data will enable us to solve the challenges of the future."

Picture taken from My Ramblings blog
After the opening keynote, we attended the Scholarship Recipients Lunch which was sponsored this year by Apple. We had engineers from Apple on each table to communicate with us during the lunch.

The sessions started after the lunch break. I attended CRA-W track: Finding Your Dream Job Presentations, which had presentations by Jaeyeon Jung from Microsoft Research and Lana Yarosh from University of Minnesota. The session targeted the late stage graduate students for helping them in deciding how to apply for jobs, how to prepare for interview, and also how to negotiate a job offer. The presenters allotted a big time slot for questions after they finished their presentations. For more information about "Finding Your Dream Job Presentations" session and the highlights of the session, here is an informative blog post:
GHC14 - Finding your Dream Job Presentations

A global community of women leaders panel
The next session I attended was "A Global Community of Women Leaders" panel in the career track, moderated by Jody Mahoney (Anita Borg Institute). The panelists were Sana Odeh (New York University), Judith Owigar (Akirachix), Sheila Campbell (United States Peace Corps), Andrea Villanes (North Carolina State University).  They explained their roles in increasing the number of women in computing and the best ways to identify global technology leaders through their experience. At the end, they opened questions to the audience. "In the middle east, the women in technology represents a big ratio of the people in computing," said Sana Odeh.

There were many interesting sessions, such as, "Building Your Professional Persona Presentations" and "Building Your Professional Network Presentations", for presenting how to build your professional image and how to promote yourself and present your ideas in a concise and appealing way to the people. These are two blog posts that cover the two sessions in details:
Facebook booth in the career fair #GHC14
In the meantime, the career fair was launched on the first day, Wednesday 8 October at 4:30 - 6:30 p.m and continued the second day and part of the third day. The career fair is a great forum for facilitating open conversations about career positions in industry and academia. Many famous companies, such as Google, FacebookMicrosoftIBM, Yahoo,  Thomson Reuters, etc.,  many universities such as, Stanford University, Carnegie Mellon UniversityThe George Washington UniversityVirginia Tech University, etc., and non-profit organizations such as CRA-W. Each company had many representatives to discuss the different opportunities they have for women. The poster session was held in the evening.

Cotton candy in the career fair #GHC14
Like the last year, Thomson Reuters attracted many women's attention with a great promotion through bringing up a caricature artists. Other companies used nice ideas to promote themselves, such as cotton candy. There were many representatives for promoting each organization and also for interviewing. I enjoyed being among all of these women in the career fair which inspired me enough to think about how to direct my future in a way to contribute to computing and also encourage many other women to computing. My advice to anyone who will go to GHC next year, print many copies of your resumes to be prepared for the career fair.

Day 2 started with welcoming from the audience by Barb Gee, the vice president of programs for Anita Borg institute. Gee presented the GirlRising videoclip "I'm not a number".

After the clip, Dr. Whitney introduced the special guest, the amazing Megan Smith, the new Chief Technology Officer of the United States and the previously vice president of Google[x]. Smith was a last year's keynote speaker, in which she gave a very inspiring talk entitled, "Passion, Adventure and Heroic Engineering". Smith welcomed the audience and talked about her new position as the CTO of the United States. She expressed her happiness to serve the president of USA and serve her country. "Let’s work together together to bring everyone a long and to bring technology that we know how to solve the problems with," Smith said at the end of her short inspiring talk.

Dr. Whitney talked about the the Building Recruiting And Inclusion for Diversity (BRAID) initiative between the Anita Borg Institute and Harvey Mudd College to increase the diversity in computer science undergraduates. The BRAID initiative is funded by Facebook, Google, Intel, and Microsoft.

The 2014 GHC technical leadership ABIE award went to Anne Condon, a professor and the head of the Department of Computer Science at University of British Columbia. Condon donated her award to Grace Hopper India and Programs of the Computing Research Association (CRA).

Maria Kawle on the right Satya Nadella on the left 
Satya Nadella, the Chief Executive Officer (CEO) of Microsoft, in an interesting conversation with Maria Kawle, the president of Harvey Mudd College, was the second keynote of GHC 2014. Nadella is the first male speaker at GHC. Nadella was asked many interesting questions. One of them as "Microsoft has competitors like Apple, Google, Facebook, Amazon. What can Microsoft do uniquely do in this new world?" Nadella answered that the two things that he believes Microsoft contribute to the world are the productivity and the platform. Maria continued, "it is not a competition, it is a partnership".

In answer to a tough question "Why does Microsoft hire fewer female engineer employers than male?", Nadella said that they all now have the numbers out there. Microsoft number is about 17% and it is almost the same numbers as Google, Facebook, and little below Apple. He said, "the real issue in our company how to make sure that we are getting women who are very capable into company and well represented".

In response to a question about how to ask for a raise in salary, Nadella said: "It’s not really about asking for a raise, but knowing and having faith that the system will give you the right raise." Nadella got a torrent of criticism and irate reaction on twitter.

Nadella later apologized for his "inarticulate” remarks in a tweet, followed by an issued statement to Microsoft employee, which was published on company's website.

"I answered that question completely wrong," said Nadella. "I believe men and women should get equal pay for equal work. And when it comes to career advice on getting a raise when you think it’s deserved, Maria’s advice was the right advice. If you think you deserve a raise, you should just ask."

Day 3 started with some announcements from the ABI board, then the best posters announcement and the Awards Presentation. The last keynote was by Dr. Arati Prabhakar, the Director of the Defense Advanced Research Projects Agency (DARPA). Dr. Prabhakar talked about "how do we shape our times with the technology that we work on and we passionate about?". Dr. Prabhakar shared neat technologies with us in her keynote. She started with a video of a quadriplegic using her thoughts to control a robotic arm by blogged her brain to the computer. She talked about building technologies at DARPA. She answered many questions from at the end related to her work in DARPA. It is an amazing to see a successful women who creates technology that serves her country. The keynote ended with a nice video promoting GHC 2015.

Latest trends and technical challenges of big data panel
After the keynote, I attended "Latest Trends and Technical Challenges of Big Data Analytics Panel", which was moderated by Amina Eladdadi (College of Saint Rose). The Panelists were Dr. Bouchra Bouqata from GE, Dr. Kaoutar El Maghraoui from IBM, Dr. Francine Berman from RPI, and Dr. Deborah Agarwal from LBNL. This panel focused on discussing new Big Data Analytics data-driven technologies, infrastructure, and challenges. The panelists introduced use cases from industry and academia. They are many challenges that faces big data: storage, security (specifically for cloud computing), and the scale of the data and bring everything together to solve the problem.

ArabWIC lunch table
After the panel, I attended the career fair then I attended the Arab Women in Computing (ArabWIC) meeting during the lunch. I had my first real experience with ArabWIC organization in GHC 2013. ArabWIC had more participation this year. I also attended ArabWIC reception, Sponsored by Qatar Computing Research Institute (QCRI),on Wednesday's night and get a chance to connect many Arab women in computing in business and academia.

After that I attended the "Data Science in Social Media Analysis Presentations", which included three presentations that talk about data analysis. The three useful presentations were:
"How to be a data scientist?" by Christina Zou
The presenters talked about real-life projects. The highlights of the presentations were:

  • "Improve the accuracy is what we strove for."
  • "It’s important to understand the problem."
  • "Divide the problem into pieces."
  • After the presentations, I talked to Christina about my research, and she gave me some ideas that I'll apply.
    The picture taken from GHC Facebook page
    At the end of the day, Friday celebration, which was sponsored by Google, Microsoft, GoDaddy, begins at 7:30. The dancing floor was full of amazing ladies celebrating and dancing with glowing sticks!

    It was fantastic meeting a large number of like-minded peers and future employers. I'm pleased to have this great opportunity which allowed me to network and communicate with many great women in computing. GHC allowed me to discuss my research ideas with many senior women and got positive feedback about it. I came back with multiple ideas that will help me shape my next phase of my research and my next career path.


    Tuesday, October 7, 2014

    2014-10-07: FluNet Visualization

    (Note: This wraps up the current series of posts about visualizations created either by students in our research group or in our classes. I'll post more after the Spring 2015 offering of the course.)

    I've been teaching the graduate Information Visualization course since Fall 2011.  In this series of posts, I'm highlighting a few of the projects from each course offering.  (Previous posts: Fall 2011, Fall 2012, 2013)

    The final visualization in this series is an interactive visualization of the World Health Organization's global influenza data, created by Ayush Khandelwal and Reid Rankin in the Fall 2013 InfoVis course. The visualization is currently available at and is best viewed in Chrome.

    The Global Influenza Surveillance and Response System (GISRS) has been in operation since 1995 and aggregates data weekly from laboratories and flu centers around the world. The FluNet website was constructed to provide access to this data, but does not include interactive visualizations.

    This project presents an interactive visualization of all of the GISRS data available through FluNet as of October 2013. The main visualization is an animated 3D choropleth globe where hue corresponds to virus lineage (influenza type A or type B) and color intensity corresponds to infection level. This shows the propagation of influenza across the globe over time.  The globe is also semi-transparent, so that the user can see how influenza infection rates change on the opposite hemisphere. The user may pick a specific time period or press the play button and watch the yearly cycle of infection play itself out on the globe's surface.

    The visualization also includes the option to show a 2D version of the globe, using the Natural Earth projection.

    There is a stacked area slider located under the globe for navigating through time (example of a "scented widget").  The stacked area chart provides a view of the progression of infection levels over time and is shown on a cubic-root scale to compensate for the peaks during the 2009 flu pandemic.

    If the user clicks on a country, a popout chart will be displayed, showing a single year of data for that country, centered on the current point in time.  The default view is a stacked area chart, but there are options to show either a streamgraph or an expanded 100% stacked area chart.  The popout chart animates with the choropleth.

    The video below shows a demo:

    Although the data was freely available from the GISRS website, there was still a significant amount of data cleaning involved.  Both OpenRefine and Mr. Data Converter were used to clean and format the data into JSON.  The D3.js, NVD3, and TopoJSON libraries were used to create the visualization.

    Our future work on this project involves turning this into an extensible framework that can be used to show other global datasets over time.


    Friday, October 3, 2014

    2014-10-03: Integrating the Live and Archived Web Viewing Experience with Mink

    The goal of the Memento project is to provide a tighter integration between the past and current web.    There are a number of clients now that provide this functionality, but they remain silent about the archived page until the user remembers to invoke them (e.g., by right-clicking on a link).

    We have created another approach based on persistently reminding the user just how well archived (or not) are the pages they visit.  The Chrome extension Mink (short for Minkowski Space) queries all the public web archives (via the Memento aggregator) in the background and will display the number of mementos (that is, the number of captures of the web page) available at the bottom right of the page.  Selecting the indicator allows quick access to the mementos through a dropdown.  Once in the archives, returning to the live web is as simple as clicking the "Back to Live Web" button.

    For the case where there are too many mementos to make navigating an extensive list useable (think captures), we have provided a "Miller Columns" interface that allows hierarchical navigation and is common in many operating systems (though most don't know it by name).

    For the opposite case where there are no mementos for a page, Mink provides a one-click interface to submit the page to Internet Archive or for immediate preservation and provides just-as-quick access to the archived page.

    Mink can be used concurrently with Memento for Chrome, which provides a different modality of letting the user specify desired Memento-Datetime as well as reading cues provided by the HTML pages themselves.  For those familiar with Memento terminology, Memento for Chrome operates on TimeGates and Mink operates on TimeMaps.  We also presented a poster about Mink at JCDL 2014 in London (proceedings, poster, video).

    Mink is for Chrome, free, publicly available (go ahead and try it now!), and open source (so you know there's no funny business going on).

    —Mat (@machawk1)

    Thursday, September 25, 2014

    2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

    The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode using IA's opensource online BookReader application. This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using an Optical Character Recognition (OCR) engine called ABBY FineReader to power these features.

    I was curious to find out how do they define the term "dictionary" in a dictionary of early 19th century? So I gave the "search inside" feature of IA's book reader a try and searched for the term "dictionary" there. It took about 40 seconds to search for the lookup term in a book with 850 pages and returned three results. Unfortunately, they were pointing to the title and advertisement pages where this term appeared, but not the page where it was defined. After this failed OCR attempt, I manually flipped pages in the BookReader back and forth the way word lookup is performed in printed dictionaries until I reached the appropriate page. Then I located the term on the page and the definition there was, "A book containing the words of any language in alphabetical order, with explanations of their meaning; a lexicon; a vocabulary; a word-book." I thought I would give the "search inside" feature another try. According to the definition above, dictionary is a book, hence I chose "book" as the next lookup term. This time the BookReader took about 50 seconds to search and returned 174 possible places where the term was highlighted in the entire book. These matches include derived words and definitions or examples of other words where the term "book" appeared. Although the OCR engine did work, the goal of finding the definition of the lookup term was still not achieved.

    After experimenting with an English dictionary, I was tempted to give another language a try. When it comes to a non-Latin language, there is no better choice for me than Urdu. Urdu is a Right-to-Left (RTL) complex script language inspired from Arabic and Persian languages, shares a lot of vocabulary and grammar rules with Hindi, spoken by more than 100 million people globally (majority in Pakistan and India), and it happens to be my mother tongue as well. I picked an old dictionary entitled, Farhang-e-Asifia (1908) - Sayed Ahmad Dehlavi (four volumes). I searched for several terms one after the other, but every time the response was "No matches were found.", although I verified their existence in the book. It turns out that the ABBY FineReader claims OCR support for about 190 languages, but it does not support more than 60% of the world's 100 most popular languages and the recognition accuracy of the supported languages is not reliable.

    Dictionaries are a condensed collection of words and definitions of languages and capture the essence of cultural vocabularies of the era they are prepared, hence they have great archival value and are of equal interest to linguistics and archivists. Improving accessibility of the preserved scanned dictionaries will make them more useful not only for linguistics and archivists, but for the general users too. Unlike general literature books, dictionaries have some special characteristics such as they are sorted to make the lookup of words easy and lookup in dictionaries is fielded searching as opposed to the full-text searching. These special properties can be leveraged when developing an application for accessing scanned dictionaries.

    To solve the scanned dictionary exploration and word lookup problem, we chose a crowdsourced manual approach that works well for every language irrespective of how poorly it is supported by OCR engines. In our approach pages or words of each dictionary are indexed manually to load appropriate pages that correspond to the lookup word. Our indexing approach is progressive hence it increases the usefulness and ease of lookup as more crowdsourced energy is put into the system, starting from the base case, "Ordered Pages" which is at least as good as IA's current BookReader. In the next stage the dictionary can go into "Sparse Index" state in which the first lookup word of each page is indexed that is sufficient to determine the page where any arbitrary lookup word can be found if it exists in the dictionary. To further improve the accessibility of these dictionaries, exhaustive "Full Index" is prepared that indexes every single lookup word found in the dictionary with corresponding pages as opposed to just the first lookup words of each page. This index is very helpful in certain dictionaries where sorting of words is not linear. To determine the exact location of the lookup word on the page we have "Location Index" that highlights the place on the page where the lookup word is located to point user's attention there. Apart from indexing we have introduced an annotation feature where users can link various resources to words on dictionary pages. Users are encouraged to help and contribute improving various indexes and annotations as they use the application. For more detailed description of our approach, please read our technical report:
    Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages. Technical Report arXiv:1409.1284, 2014.
    We have built an online application called "Dictionary Explorer" that utilizes the indexing described above and it has an interfaces suitable for dictionaries. The application serves as the explorer of various dictionaries in various languages at the same time it represents various context-aware controls for feedback to contribute to indexes and annotations. In the Dictionary Explorer the user selects a lookup language that loads a tree like word index in the sidebar for the selected language and various tabs in the center region, each tab corresponds to one monolingual or multilingual dictionary that has indexes in the selected language. The user can then either directly input the lookup term in the search field or locate the search term in the sidebar by expanding corresponding prefixes. Once the lookup is performed, all the tabs are loaded simultaneously with appropriate pages corresponding to the lookup term in each dictionary. A pin is placed on pages where the word exists on the page if the location index is available for the lookup word which allows interaction with the word and annotations. A special tab accumulates all the related resources such as user contributed definitions, audio, video, images, examples, and resources from third party online dictionaries and services.

    Following are some feature highlights to summarize the Dictionary Explorer application:
    • Support for various indexing stages.
    • Indexes in multiple languages and multiple monolingual and multilingual dictionaries in each language.
    • Bidirectional (right-to-left and left-to-right) language support.
    • Multiple input methods such as keyboard input, on screen keyboard, and word prefix tree.
    • Simultaneous lookup in multiple dictionaries.
    • Pagination and zoom controls.
    • Interactive location marker pins.
    • Context aware user feedback and annotations.
    • Separate tab for related resources such as user contributions, related media, and third-party resources.
    • API for third-party applications.
    We have successfully developed a progressive approach of indexing that enables lookup in scanned dictionaries of any language with very little initial effort and improves over time as more people interact with the dictionaries. In the future we want to explore specific challenges of indexing and interaction in several other languages such as Mandarin or Japaneses where dictionaries are not sorted essentially based on their huge alphabet. We also want to utilize our current indexes that were developed by users over time to predict pages for lookup terms in dictionaries that are not indexed yet or have partial indexing. We have intuition that we can automatically predict pages of an arbitrary dictionary for a lookup term with acceptable variance by aligning pages of a dictionary with one or more resources such as indexes of other dictionaries in the same language, corpus of the language, most popular words in the language, and partial indexes of the dictionary.


    Sawood Alam

    Thursday, September 18, 2014

    2014-09-18: A tale of two questions

    (with apologies to Charles Dickens, Robert Frost, and Dr. Seuss)

    "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, ..." (A Tale of Two Cities, by Charles Dickens).

    At the end of this part of my journey; it is time to reflect on how I got here, and what the future may hold.

    Looking back, I am here because of answering two simple questions.  One from a man who is no longer here, one from a man who still poses new and interesting questions.  Along the way, I've formed a few questions of my own.

    The first question was posed by my paternal uncle, Bertram Winston.  Uncle Bert was a classic type A personality.  Everything in his life was organized and regimented.  When planning a road trip across the US, he would hand write the daily itinerary.  When to leave a specific hotel, how many miles to the next hotel,
    Uncle Bert and Aunt Artie
    phone numbers along the way, people to visit in each city, and sites to see.  He would snail-mail a copy of the itinerary to each friend along way, so they would know when to expect he and Aunt Artie to arrive (and to depart).  He did this all before MapQuest and Google maps.  He did all of this without a computer, using paper maps and AAA tour books. 

    Bert took this attention to detail to the final phase of his life.  As he made preparations for his end, he went through their house and boxed up pictures and mementos for friends and family.  These boxes would arrive unannounced, and were full of treasures.  After receiving, opening, sharing these detritus with Mary and our son Lane, I thanked Bert for helping to answer some of the questions that had plagued me since I was a child.  During the conversation, he posed the first question to me.  Bert said that he had been through his house many times and still had lots of stuff left that he didn't know what to do with.  He said,  "what will I do with the rest?"  I said that I would take it, all of it, and that I would take care of each piece.

    I continued to receive boxes until his death. 
    Josie McClure, my muse.
    With each; Mary, Lane, and I would sit in our living room and I would explain the history behind each memento.  One of these mementos was a picture of Josie McClure.  She became my muse for answering the second question.

    Dr. Michael L. Nelson,
    my academic parent.
    The second question was posed by my academic "parent," Michael L. Nelson.  One day in 2007; he stopped me in the Engineering and Computational Sciences Building on the Old Dominion University campus, and posed the question "Are you interested in solving a little programming problem?"  I said "yes" not having any idea about the question, the possible difficulties involved, the level of commitment that would be necessary, or the incredible highs and lows that
    would torment by soul.  But I did know that I liked the way he thought, his outlook on life, and his willingness to explore new ideas.

    The combination of answering two simple questions, resulted in a long journey.  Filled with incredible highs brought on by discovering things that no one else in the world knew or understood, and incredible lows brought on by no one else in the world knowing or understanding what I was doing.  My long and tortuous trail can be found here.

    While on this journey, I have accreted a few things that I hope will serve me well.

    My own set of questions:

    1.  What is the problem??  Sometimes just formulating the question is enough to see the solution, or puts the topic into perspective and makes it non-interesting.  Formulating the problem statement can be an iterative process where constant refining reveals the essence of the problem.

    2.  Why is it important??  The world is full of questions.  Some are important, others are less so.  Everyone has the same number of hours per day, so you have to choose which questions are important in order to maximize your return on the time you spend.

    3.  What have others done to try and solve the problem??  If the problem is good and worthy, then take a page from Newton and see what others have done about the problem.  It may be that they have solved the problem and you just hadn't been able to spend the time trying to find an existing solution.  If they haven't solved the problem, then you might be able to say (as Newton is want to say) "If I have seen further it is by standing on the shoulders of giants."

    4.  What will I do to solve the problem??  If no one has solved the problem, then how will you attack it??  How will your approach be different or better than everything done  by everyone else??

    5.  What did I do to prove I solved the problem??  How to show that your approach really solved the problem??

    6.  What is the conclusion??  After you have labored long and hard on a problem, what do you do with the knowledge you have created??

    Be an active reader.

    Read everything closely to ensure that I understand what the author was (and was not) saying.  Making notes in the margins on what has been written.  Noting the good, the bad, and the ugly.  If it is important enough, track down the author and speak to them about the ideas and thoughts they had written.  Imagine if you will, receiving a call from a total stranger about something that you've published a few years before.  It means that someone has read your stuff, has questions about it, and that it was important enough to talk directly to you.  How would you feel if that happened to you??  I've made those calls and you can almost feel the excitement radiating through the phone.

    Understand all the data you collect.

    In keeping with Issac Asimov's view on data: "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'"  When we conduct experiments, we collect data of some sort.  Be that memento temporal coverage, public digital longevity, digital usage patterns, data of all sorts and types.  Then we analyze the data, and try to glean a deeper understanding.  Watch for the outliers, the data that "looks funny" have additional things to say.

    Everyone has stories to tell.  

    Our stories are the threads of the fabric of our lives.  Revel in stories from other people.  Those stories they choose to share, are an intimate part of what makes them who they are.  Treat their stories with care and reverence, and they will treat yours the same way.

    Don't be afraid to go where others have not.  

     During our apprenticeship, all our training and work point us to new and uncharted territories.  To wit:
    Two roads diverged in a wood, and I,
    I took the one less traveled by,
    And that has made all the difference."
    (The Road Not Taken, by Robert Frost)

    Remember through it all;

    The highs are incredible, the lows will crush your soul, others have survived, and that you are not alone.

    And in the end,

    be your name Buxbaum or Bixby or Bray
    or Mordecai Ali Van Allen O'Shea,
    you're off to Great Places!
    Today is your day!
    Your mountain is waiting.
    So...get on your way!"
    (Oh, the Places You'll Go!, by Dr. Seuss)

    With great fondness and affection,

    Chuck Cartledge
    The III. A rapscallion.  A husband.  A father.  A USN CAPT.  A PhD.  A simple man.

    Thanks to Sawood Alam, Mat Kelly, and Hany SalahEldeen for their comments and review of "my 6 questions."  They were appreciated and incorporated.

    2014-09-18: Digital Libraries 2014 (DL2014) Trip Report

    Mat Kelly, Justin F. Brunelle and Dr. Michael L. Nelson travel to London, UK to report on the Digital Libraries 2014 Conference.                           

    On September 9th through 11th, 2014, Dr. Nelson (@phonedude_mln), Justin (@justinfbrunelle), and I (@machawk1) attending the Digital Libraries 2014 conference (a composite of the JCDL (see trip reports for 2013, 2012, 2011) and TPDL (see trip reports for 2013 and 2012) conferences this year) in London, England. Prior to the conference, Justin and I attended the DL2014 Doctoral Consortium, which occurred on September 8th.

    The main conference on September 9th opened with George Buchanan (@GeorgeRBuchanan) indicating that this year's conference was a combination of both TPDL and JCDL from previous years. With the first digital libraries conference being in 1994, this year marked the 20th year anniversary of the conference. George celebrated this by singing Happy Birthday to the conference and introduced the Ian Locks, the Master of company if the Worshipful Company of Stationers and Newspaper Makers, to continue the introduction.

    Ian first gave a primer and history of the his organization as a "chain gang that dated back 1000 years" that "reduced the level of corruption when people could not read or write". Originally, his organization became Britain's first library of deposit wherein printed works needed to be deposited with them and the organization was central to the first copyright in world in 1710.

    Ian then gave way to Martin Klein (@mart1nkle1n), who gave insight into the behind-the-scenes dynamics of the conference. He stated that the programming committee had 183 members to allocate reviews for every submission received. The committee's goal was to have four first level reviews. Of the papers received, 38 countries were represented with the largest number coming from the U.S. followed by the U.K. then Germany. The acceptance rate for full papers was 29% while the rate for short papers was 32%. 33 posters and 12 demos were also accepted. Interestingly, the country with the highest acceptance rate that submitted over five papers was Brazil, with over half of their papers accepted.

    Martin then segued to introducing the keynote speaker, Professor Dieter Fellner of the Fraunhofer Institute. Dieter's theme consisted mostly of the different means and issues in digitally preserving 3-dimensional objects. He described the digitization as, "A grand opportunity for society, research, and economy but a grand challenge for digital libraries." In reference to object recovery for preservation before or after an act of loss he said, "if we cannot physically preserve an object, having a digital artifact is second best." Dieter then went on to tell of the inaccuracies of preserving artifacts from a single or insufficient lighting conditions. TO evaluate how well an object is preserved, he spoke of a "Digital Artifact Turing Test" wherein, he said, first create photos of a 3D artifact then make a 3D model. If you can't tell the difference, then the capture can be deemed successful and represntative. -

    Dieter continued with some approaches they have used to achieve better lighting conditions and how varying the lighting conditions has provided instances of uncovering data that previously was hard to accurately capture. As an example, he show a piece of driftwood from Easter Island that had an ancient etched message that was very subtle to see and thus would likely be unknowingly used as firewood. By varying the light conditions when preserving the object, the ancient writing was exposed and preserved for later translation once more is known about the language.

    Another instance he gave was based on scans today of ancient objects, how accurately can we replicate the original color, citing the discolored bust of Nefertiti. Further inspection using various colored lighting to scan produced potentially better results for a capture.

    After a short break, the meeting resumed with simultaneous sessions. I attended the "Web archives and memory" session where WS-DL's Michael Nelson lead off with "When Should I Make Preservation Copies of Myself?", a work related to WS-DL's recent alumnus Chuck Cartledge's PhD dissertation. In his presentation, Michael spoke of the preservation of objects, particular of Chuck now-famous ancestor "Josie", for which he had a physical photo over a hundred years old with some small bits of metadata hand-written on the back. With respect to modeling the self-preservation of the correlative digital object of the Josie photo (e.g., a scanned image on Flickr), Michael described the "movement" of how this image should propagate in the model in a way akin to Craig Reynold's Boids in the desired behavior of collision avoidance, velocity matching, and flock centering. This "small world" consisting of the set of duplicated objects in a variety of locations can be described with the "small world" concept and will not create a lattice structure in its propagation scheme.

    Chuck's implementation work includes adding a linked image embedded on the web page (using the HTML "link" tag and not the "a" tag) that allows the user to specify that they would like the object preserved. Michael then described the three policies used for duplication to ensure optimal spread of a resource, which included one-at-a-time until a limit is hit, as aggressively as possible until a soft limit then one-at-a-time until a hard limit, or a super aggressive policy of duplication until the hard limit is hit. From Chuck's work, Michael said, "It pays to be aggressive, conservation doesn't work for preservation. What we envisioned", he continued, "was to create objects that live longer than the people that created them."

    Cathy Marshall (@ccmarshall) followed Michael with "An Argument for Archiving Facebook as a Heterogeneous Personal Store". From her previous studies, she found that users were apathetic about preserving their Facebook contents, "Why we should archive Facebook despite the users not caring about archiving it?", she said. "Evidence has suggested that people are not going to archive their stuff in any kind of consistent way in the long term." Most users in her study could not think of anything to save on Facebook, assuming that if Facebook died, those files also live somewhere else and can be recovered. Despite this, she attempted to identify what users found most important in their Facebook contents with 50% of the users saying that they find the most value in their photos, 35% saying they would carry over their contacts if they needed to, and other than that, they did not care about much else.

    Michael Day (@bindonlane) followed Cathy with a review of recent work at the British Library. His group has been making attempts to implement preservation concepts from extremely large collections of digital material. They have published the British Library Content Strategy, which attempts to guide their efforts.

    When Michael was done presenting, I (Mat Kelly, @machawk1) presented my paper "The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript". The purpose of the work was to evaluate different web archiving tools and web sites in a way much similar to the Acid Tests originally designed for web standards but with more clarity and to test specific facets of the web for which web archiving tools have trouble.

    After I presented was the lunch session and another set of concurrent sessions. For this session I attended the "Digital Libraries: Evolving from Collections to Communities?" panel with Brian Beaton, Jill Cousins (@JilCos), and Herbert Van de Sompel (@hvdsomp) with Deanna Marcum and Karen Calhoun as moderators.

    Jill stated, "Europeana can't be everything to everybody but we can provide the data that everyone reuses." The group spoke about the Europeana 2020 strategy and how it aims to fulfill 3 principles: "share data, improve access conditions to data and create value in what we're doing". Brian asked, "Can laypeople do managed forms of expert work? That question has been answered. The real issue is determining how crowd-sourced projects can remain sustainable.", referencing previous discussion on the integration of crowd sourcing to fund preservation studies and efforts. He continued, "I think we're at the moment where we are at a lot of competitors entering race [for crowd sourcing platforms]. Lots of non-profits are turning to integration of crowd-funded platforms. I'm curious to see what happens where more competition emerges for crowd-sourcing."

    Following the panel and a short break, Alexander Ororbia of Penn State presented "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" relating to CiteSeerX, "The scholarly big data platform". The application he described relates to research management, collaboration discovery, citation recommendation, expert search, etc. and uses technologies like a private cloud, HDFS, NOSQL, Map Reduce, and crawl scheduling.

    Following Alex, Per Møldrup-Dalum (@perdalum) presented "Bridging the Gap Between Real World Repositories and Scalable Preservation Environments". In his presentation, Per spoke of Hadoop and his work in digitizing 32 million scanned newspaper pages using OCR and ensuring the digitization was valid according to his group's their preservation policy. In accomplishing this, he created a "stager" and "loader" as proof-of-concept implementations of using the SCAPE APIs. In doing this, Per wanted to emphasize reusability of the products he produced, as his work was mostly based on the reuse of other projects.

    After Per, Yinlin Chen described their work on utilizing the ACM Digital Library (DL) data set as the basis for a project on finding good feature representations that minimizes the differences between source and target domains of selected documents.

    C. Lee Giles of Penn State came next with his presentation "The Feasibility of Investing of Manual Correction of Metadata for a Large-Scale Digital Library". In this work, he sought to build a classifier using a truth discovery process using metadata from Google Scholar. He found that a "web labeling system" seemed more promising compared to simple models of crowdsourcing the classification.

    This finished the presentations for the first day of the conference and the poster session followed. In the poster session, I presented my work on developing a Google Chrome extension called Mink (now publicly available) that attempts to integrate the live and archived web viewing experience.

    Day 2

    George Buchanan started the second day by introducing Professor Jane Ohlmeyer of Trinity College, Dublin (@tcddublin) and her work relating to the 1641 Depositions, the records of massacre, atrocity and ethnic cleansing in seventeenth-century Ireland. These testimonies related to the Irish Rebellion around the 22nd of October in 1641 where Catholics robbed, murdered, and pillaged their protestant neighbors. From what's documented of the conflict, Jane noted that "we only hear one side of the suffering and we don't have reports of how the Catholics suffered, were massacred, etc." referring to the accounts being mostly collected from a single perspective of the conflict. Jane highlighted one particular account by Anne Butler from the 7th of September, 1642, where Anne first explained who she was then followed with her neighbors with whom she previously interacted with daily in the market subsequently threatening her during the conflict solely due to her being Protestant. "It's as if the depositions are allowing us to hear the story of those that suffered through fear and conflicts.", Jane said, referencing the accounts. The controversial depositions had originally been donated in 1741 and held in Trinity College and locked away due to their controversial documentation of the conflict. The accounts consists of over 19000 pages (about 3.5 million words) and include 8000 witness testimonies of events related to 1641 rebellion. The accounts had been attempted to be published by multiple parties in the past (including an attempt in 1930) but had previously been censored by the Irish government because of their graphic nature. Now that the parties involved in the conflict are at peace, further work is being done by Jane's group preserving the accounts while dealing with various features of the writing (e.g., multiple spellings, document bleed through, inconsistent data collection pattern) that might otherwise be lost in the process were the documents naively digitized.

    Jane's group has since launched a website (in 2010) to ensure that the documents are accessible to the public and currently have over 17,000 registered users. All of the data they have added is open source. Upon launching it, they had both Mary McAleese and Ian Paisley (who was notoriously anti-Catholic) together for the launch with Paisley surprisingly saying that he advocated the publication of the documents, as it "promoted learning" and he encouraged the documents be made accessible in the classroom to 14, 15, and 16 year olds so that society could "remember the past but not bound by the past". Through the digitization process, Jane's group has looked to other more recent (and some currently ongoing) controversial conflicts and how the accounts of the conflict can be documented and released in a way that is appropriate to the respectively affected society.

    Following Jane's Keynote (and a coffee break), the conference was split into concurrent sessions where I attended the "Browsing and Searching" session, where Edie Rasmussen introduced Dana Mckay (@girlfromthenaki) began her presentation, "Borrowing rates of neighbouring books as evidence for browsing". In her work, she sought to explore the concept of browsing in respect to the various digital platforms for doing so (e.g., for books on Amazon) vs. the analog of browsing in a library. With library-based browsing, a patron is able to maintain physical context and see other nearby books as shelved by the library. "Browsing is part of the of the human information seeking process", she said, following with the quote, "The experience of browsing a physical library is enough to dissuade people to use e-books." In her work, she used 6 physical libraries as a sample set and checked the frequency at which physically nearby books had been borrowed as a function of likelihood in respect to an initially checked out book. In preliminary research, she found that from her sample set, that over 50% of the book had ever been borrowed and just 12% had been borrowed in the last year. In an attempt to quantify browsing, she first split her set of libraries into two sets consisting of those that used the Dewey Decimal system and those that used the Library of Congress system of organization. She first tested 100 random books, checked if they had been borrowed on day Y then checked the physically nearby books to see if they had been borrowed the day before. From her study, Dana found that there is definitely a causal effect on the location of books borrowed and that, especially in libraries, browsing has an effect on usage.

    Javier Lacasta followed Dana with "Improving the visibility of geospatial data on the Web". In the work, his group's objective was to identify, classify, interrelated, and facilitate access to geospactial web services. They wanted to create an automatic process that identified existing geospatial services on the web by using an XML specification. From the service discover, Javier wanted to extract content of fields containing the resource's title, description, thematic keywords, content date, spatial coordinates, textual descriptions of the place, and the creator of the service. By doing this, they hoped to harmonize the content for consistency between services. Further, they wanted a mean of classifying the services by assigning values from a controlled vocabulary. The study, he said, ought to be applicable to other fields, though his discovery of services was largely limited by lack of content for these type of services on the web.

    Martyn Harris was next with "The Anatomy of a Search and Mining System for Digital Humanities" where he looked at the barriers for tool adoption in the digital humanities spectrum. He found that documentation and usability evaluation was mostly neglected, so looked toward "dogfooding" in developing his own tool using context-dependent toolsets. An initial prototype uses a treemap for navigating the Old Testament and considers the probability of querying each document.

    Õnne Mets followed Martyn with "Increasing the visibility of library records via a consortial search engine". The target for the study was the search engine behind National Library of Estonia, which provides an e-books on-demand service as well as a service for digitizing public domain books into e-book form. Their service has been implemented in 37 libraries in 12 countries and provides an "EOD button" that sends the request to the respective library to scan and transfer the images from the physical book. Their service provides a central point for users to discover EOD eligible books and uses OAI-PMH to harvest and batch upload the book files via FTP. Despite the services' interface, Õnne said that 89% of the hits on their search interface came directly to their landing pages via Google. From this, Õnne concluded that collaboration with a consortial search engine does in fact make collections of digitized books more visible, which increases the potential audience.

    The conference then broke for lunch but returned with Daniel Hasan Dalip's presentation of "Quality Assessment of Collaborative Content With Minimal Information". In this work, Daniel investigated how users can better utilize the larger amount of information created in web documents using a multi-view approach that indicates a meaningful view of quality. As a use case, he divided a Wikipedia article into different views representing the evidence conveyed in the article. Using Spark Vector Machines (SVR), he worked to identify features within the document with a low prediction error. He concluded that using the algorithm allows the feature set of 68 features to be reduced by 15%, 18%, and 25% for three sample data article on Wikipedia for "MUPPET", "STARWAR", and "WIKIPEDIA", respectively.

    As Daniel's presentation was going on, Justin viewed Adam Jatowt's presentation "Quality Assessment of Collaborative Content With Minimal Information". In this work, Adam showed that words changed meaning over time using tools to verify words' evolution. He first took 5-grams from The Corpus of Historical American English (COHA) on Google Books and measured both the frequency and the temporal entropy of each 5-gram. He found that if a word is popular in one decade, it's usually popular in the next decade. He also investigated similarity based on context (i.e., the position in a sentence). Through the study he discovered word similarities as was with the case of the word "nice" being synonymous with "pleasant" around the year 1850.

    I then joined Justin for Nikolaos Aletras's (@nikaletras) presentation, "Representing Topics Labels for Exploring Digital Libraries". In this work, Nikolaos stated that the problem with online documents is that they have no structure, metadata, or manually created classification system accompanying them, which makes it difficult to explore and find specific information. He created unsupervised topic models that were data-driven and captured the themes discussed within the documents. The documents were then represented as a distribution over various topics. To accomplish this, he developed a topic model pipeline where a set of documents acted as the input with the output consisting of two matrices: topic-word (probability of each word on a given topic) and topic-document (probability of each document given the topic). He then used his trained model to identify as many documents relevant to a set of queries within 3 minute in a document collection using document models. The data set used was a subset of the Reuters Corpus from Rose et al. 2002. This data set had already been manually classified, so could be used for model verification. From the data set, 20 subject categories were used to generate a topic model. 84 topics were produced and provided via an alternative means of browsing the documents.

    Han Xu presented next with "Topical Establishment Leveraging Literature Evolution" where he attempted to discovery research topics from a collection of papers and to measure how well or not a given topic is recognized by the community. First, Han's group identified research topics whose recognition can be described as either persistent, withering or booking. Their approach was inspired by bidirectional mutual enforcement between papers and topical recognition. By using the weight of a topic as a sum of its recognitions in papers, he could compare using PageRank and RALEX (their previous work using random WALKS) and show that their own approach was more suitable, as it was more designed to take into account literature evolution, unlike PageRank.

    Fuminori Kimura was next with "A Method to Support Analysis of Personal Relationship through Place Names Extracted from Documents", a followup study on previous research for extracting personal relationships through place names. In this work, their extracted personal names and place names and counted the co-occurrence between them. Next, their created a personal's feature vector then calculated the personal relationship and stored this product in a database for further analysis. When a personal name and a place name appeared in the same paragraph, they hypothesized, it is an indicator of the relationship between the person and the location. Using cosine similarity and clustering, Fuminori found that initial tests of their word on Japanese historical documents could epitomize a relationship network graph of closely related people backed by their common relationships with locations.

    After a short break, the final set of concurrent sessions started where I attempted Christine Borgman's (@scitechprof presentation of "The Ups and Downs of Knowledge Infrastructures in Science: Implications for Data Management". In this work, she spoke of how countries in Europe, the U.S., and other parts of the world are now requiring scholars to release the data from their studies and questioned what sort of digital libraries should we be building for this data. Her work was reporting on the progress from the Alfred P. Sloan Foundation's study of 4 different scientific processes and how they make and use data. "What kind of new professionals should be prepared for data mining", she asked.She described four different projects in a 2x2 matrix where two had large amounts of data and two were projects that were just ramping up (with each project of the four holding a unique combination of these traits). The four projects (Center for EMBEDDED network Sensing (CENS), Sloan Digital Sky Survey (SDSS), Center for Dark Energy Biosphere investigations (C-DEBI), and the Large Synoptic Survey Telescope (LSST) each either had either previous methods of storing the data or were proposing ways to handle, store, and filter the large amount of data to-come. "You don't just trickle the data out as it comes across the instruments. You must clean, filter, document, and release very specific blocks.", she said of some projects releasing the cleaned data sets while others were planning to opt to release the raw data to the public. "Each data is accompanied by a paper with 250 authors", she said, highlighting that they were greatly used as a basis for much further research.

    Carl Lagoze of University of Michigan presented next with "CED2AR: The Comprehensive Extensible Data Documentation and Access Repository", which he described as "yet another metadata repository collection system." In a deal between the NSF and the Census Bureau, he worked to make better use of the Census Bureau's huge amount of data. Doing further work on the data was to increase emphasis to have scientists make data available on the network and make the data useful for replicating methods, verifying/validating studies, and taking advantage of the results. Key facets of the census data is that it is highly controlled and confidential, with the latter describing both the content itself as well as the metadata of the content. Because of this, both identity and provenance were key issues that had to be dealt with in the controlled data study. Regarding the mixing of this confidential data with public data, Carl said, "Taking controlled data spaces and mixing it with uncontrolled data spaces creates a new data problem in data integrity and scientific integrity.".

    David Bainbridge present next with "Big Brother is Watching You - But in a Good Way" where he initially presented the use case of having had something on his screen earlier for which he could not remember the specifics of some text. His group has created a system that records and remembers text that has displayed on a machine running XWindows (think: Linux) and allows the collected data to be searchable with graphical recall.

    During the presentation, David gave a live demo wherein he visited a website, which was immediately indexed and became searchable as well as showing results from earlier relevant browsing sessions.

    Rachael Kotarski (@RachPK) presented next with "A comparative analysis of the HSS & HEP data submission workflows" where she withed with a UK data archive looking for social science data. She referenced that users registering for an ORCID greatly helps with the mining process and takes only five minutes.

    Nikos Houssos (@nhoussos) presented the last paper of the day with "An Open Cultural Digital Content Infrastructure" where he spoke of 70 cultural heritage projects costing about 60 million Euros and how his group has helped associate successful validation with funding cash flows. By building a suite of services for repositories, they have provided a single point of access for these services through aggregation and harvesting. Much of the back-end, he said, is largely automated checking and compliance for safe keeping.

    Nikos closed out the sessions for Day 2. Following the sessions, the conference dinner was held at The Mermaid Function Centre.

    Day Three

    The third day of the Digital Libraries was short but to lead off was ODU WS-DL's own Justin Brunelle (@justinfbrunelle) with "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources". In this paper, we (I was a co-author) investigated the effects of missing resources on an archived web page and how the impact of a resource is not sufficiently evaluated on an each-resource-has-equal-weight basis. Instead, using measures of size, position, and centrality, Justin developed an algorithm to weight a missing resource's impact (i.e., "Damage") to a web page if not captured to the archived. He initially used the example of the web comic XKCD and how a single resource (the main comic) has much more importance for the page's purpose than all other resources on the page. When missing a stylesheet, the algorithm considers the background color of the page and the concentration of content with the assumption that if the stylesheet is missing and important, most of the content will be in the left third of the page.

    Hugo Huurdeman (@TimelessFuture) followed Justin with "Finding Pages on the Unarchived Web" by first asking, "Given that we cannot crawl lost web pages, how can we recover the content lost?" Working with the National Libraries of the Netherlands, which consists of about 10 terabytes of data from 2007, they focused on a subset of this data for 2012 with the temporal span of a year. From this they extracted the data for processing and sought to answer three research questions:

    1. Can we recover a significant fraction of unarchived pages?
    2. How rich are the representations for the unarchived pages?
    3. Are these representations rich enough to characterize the content?

    Using a measure involving Mean Reciprocal Rank, they took the average scores of the first correct result of each query while utilizing keywords within the URLs for non-homepages. A second measure of "Success Rate" allowed them to evaluate that 46.7% of homepages and 46% of non-homepages could have a summary generated if never preserved. Their approach claimed to "Reconstruct significant parts of the unarchived web." based on descriptions and link evidence pointing to the unpreserved pages.

    Nattiya Kanhabua presented last in the session with "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia" where she investigated the scenario of a computer that forgets intentionally and how that plays into digital preservation. "Forgetting plays a crucial role for human remembering and life.", she said. Nattiya spoke of "managed forgetting", i.e., to remember the right information. "Individuals' memories are subject to a fast forgetting process." She referenced various psychological studies to correlate the preservation process with "flashbulb memories". For a case study, they looked at the Wikipedia view logs as signal for collective memory, as they're publicly available traffic over a long span of time. "Looking at page views does not directly reflect how people forget; significant patterns are a good estimate for public remembering.", she said. Their approach developed a "remembering score" to rank related past events and identify features (e.g., time, location) as having a high correlation with remembering.

    Following a short final break, the final paper presentations of the conference commenced. I was able to attend the last two presentations of the conference where C. Lee Giles of Penn State University presented "RefSeer: A Citation Recommendation System", a citation recommendation system based on the content of an entire manuscript query. His work served as an example on how to build a tool on top of other system through integration. To further facilitate this, the system contains a novel language translation method and is intended to help users write papers better.

    Hamed Alhoori presented the last paper of the conference with "Do Altmetrics Follow the Crowd or Does the Crowd Follow Altmetrics?" where he used bookmarks as metrics. His work found that journal-level altmetrics have significant correlation among themselves compared with the weak correlations within article-level altmetrics. Further, they found that Mendeley and Twitter have the highest usage and coverage of scholarly activities.

    Following Hamed's presentation, George Buchanan provided information on the next year's JCDL 2015 and TPDL 2015 (which would again be split into two locations) and what ODU WS-DL was waiting for: the announcements for best papers. For best student paper, the nominees were:

    • Glauber Dias Gonçalves, Flavio Vinicius Diniz de Figueiredo, Marcos Andre Goncalves and Jussara Marques de Almeida. Characterizing Scholar Popularity: A Case Study in the Computer Science Research Community
    • Daniel Hasan Dalip, Harlley Lima, Marcos Gonçalves, Marco Cristo and Pável Calado. Quality Assessment of Collaborative Content With Minimal Information
    • Justin F. Brunelle, Mat Kelly, Hany Salaheldeen, Michele C. Weigle and Michael Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources

    For best paper, the nominees were:

    • Chuck Cartledge and Michael Nelson. When Should I Make Preservation Copies of Myself?
    • David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, John Wilkerson and Nick Stramp (Best paper nominees). Detecting and Modeling Local Text Reuse
    • Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar and Arjen P. de Vries (Best paper nominees). Finding Pages on the Unarchived Web

    The results (above tweet) served as a great finish to a conference with many fantastic papers that we will be exploring in-depth for the next year.

    — Mat