Thursday, September 25, 2014

2014-09-25: Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages

The Internet Archive (IA) and Open Library offer over 6 million fully accessible public domain eBooks. I searched for the term "dictionary" while I was casually browsing the scanned book collection to see how many dictionaries they have. I found several dictionaries in various languages. I randomly picked A Dictionary of the English Language (1828) - Samuel Johnson, John Walker, Robert S. Jameson from the search result. I opened the dictionary in fullscreen mode using IA's opensource online BookReader application. This book reader application has common tools for browsing an image based book such as flipping pages, seeking a page, zooming, and changing the layout. In the toolbar it has some interesting features like reading aloud and full-text searching. I wondered how could it possibly perform text searching and read aloud an scanned raster image based book? I sneaked inside the page source code which pointed me to some documentation pages. I realized it is using an Optical Character Recognition (OCR) engine called ABBY FineReader to power these features.

I was curious to find out how do they define the term "dictionary" in a dictionary of early 19th century? So I gave the "search inside" feature of IA's book reader a try and searched for the term "dictionary" there. It took about 40 seconds to search for the lookup term in a book with 850 pages and returned three results. Unfortunately, they were pointing to the title and advertisement pages where this term appeared, but not the page where it was defined. After this failed OCR attempt, I manually flipped pages in the BookReader back and forth the way word lookup is performed in printed dictionaries until I reached the appropriate page. Then I located the term on the page and the definition there was, "A book containing the words of any language in alphabetical order, with explanations of their meaning; a lexicon; a vocabulary; a word-book." I thought I would give the "search inside" feature another try. According to the definition above, dictionary is a book, hence I chose "book" as the next lookup term. This time the BookReader took about 50 seconds to search and returned 174 possible places where the term was highlighted in the entire book. These matches include derived words and definitions or examples of other words where the term "book" appeared. Although the OCR engine did work, the goal of finding the definition of the lookup term was still not achieved.

After experimenting with an English dictionary, I was tempted to give another language a try. When it comes to a non-Latin language, there is no better choice for me than Urdu. Urdu is a Right-to-Left (RTL) complex script language inspired from Arabic and Persian languages, shares a lot of vocabulary and grammar rules with Hindi, spoken by more than 100 million people globally (majority in Pakistan and India), and it happens to be my mother tongue as well. I picked an old dictionary entitled, Farhang-e-Asifia (1908) - Sayed Ahmad Dehlavi (four volumes). I searched for several terms one after the other, but every time the response was "No matches were found.", although I verified their existence in the book. It turns out that the ABBY FineReader claims OCR support for about 190 languages, but it does not support more than 60% of the world's 100 most popular languages and the recognition accuracy of the supported languages is not reliable.

Dictionaries are a condensed collection of words and definitions of languages and capture the essence of cultural vocabularies of the era they are prepared, hence they have great archival value and are of equal interest to linguistics and archivists. Improving accessibility of the preserved scanned dictionaries will make them more useful not only for linguistics and archivists, but for the general users too. Unlike general literature books, dictionaries have some special characteristics such as they are sorted to make the lookup of words easy and lookup in dictionaries is fielded searching as opposed to the full-text searching. These special properties can be leveraged when developing an application for accessing scanned dictionaries.

To solve the scanned dictionary exploration and word lookup problem, we chose a crowdsourced manual approach that works well for every language irrespective of how poorly it is supported by OCR engines. In our approach pages or words of each dictionary are indexed manually to load appropriate pages that correspond to the lookup word. Our indexing approach is progressive hence it increases the usefulness and ease of lookup as more crowdsourced energy is put into the system, starting from the base case, "Ordered Pages" which is at least as good as IA's current BookReader. In the next stage the dictionary can go into "Sparse Index" state in which the first lookup word of each page is indexed that is sufficient to determine the page where any arbitrary lookup word can be found if it exists in the dictionary. To further improve the accessibility of these dictionaries, exhaustive "Full Index" is prepared that indexes every single lookup word found in the dictionary with corresponding pages as opposed to just the first lookup words of each page. This index is very helpful in certain dictionaries where sorting of words is not linear. To determine the exact location of the lookup word on the page we have "Location Index" that highlights the place on the page where the lookup word is located to point user's attention there. Apart from indexing we have introduced an annotation feature where users can link various resources to words on dictionary pages. Users are encouraged to help and contribute improving various indexes and annotations as they use the application. For more detailed description of our approach, please read our technical report:
Sawood Alam, Fateh ud din B Mehmood, Michael L. Nelson. Improving Accessibility of Archived Raster Dictionaries of Complex Script Languages. Technical Report arXiv:1409.1284, 2014.
We have built an online application called "Dictionary Explorer" that utilizes the indexing described above and it has an interfaces suitable for dictionaries. The application serves as the explorer of various dictionaries in various languages at the same time it represents various context-aware controls for feedback to contribute to indexes and annotations. In the Dictionary Explorer the user selects a lookup language that loads a tree like word index in the sidebar for the selected language and various tabs in the center region, each tab corresponds to one monolingual or multilingual dictionary that has indexes in the selected language. The user can then either directly input the lookup term in the search field or locate the search term in the sidebar by expanding corresponding prefixes. Once the lookup is performed, all the tabs are loaded simultaneously with appropriate pages corresponding to the lookup term in each dictionary. A pin is placed on pages where the word exists on the page if the location index is available for the lookup word which allows interaction with the word and annotations. A special tab accumulates all the related resources such as user contributed definitions, audio, video, images, examples, and resources from third party online dictionaries and services.

Following are some feature highlights to summarize the Dictionary Explorer application:
  • Support for various indexing stages.
  • Indexes in multiple languages and multiple monolingual and multilingual dictionaries in each language.
  • Bidirectional (right-to-left and left-to-right) language support.
  • Multiple input methods such as keyboard input, on screen keyboard, and word prefix tree.
  • Simultaneous lookup in multiple dictionaries.
  • Pagination and zoom controls.
  • Interactive location marker pins.
  • Context aware user feedback and annotations.
  • Separate tab for related resources such as user contributions, related media, and third-party resources.
  • API for third-party applications.
We have successfully developed a progressive approach of indexing that enables lookup in scanned dictionaries of any language with very little initial effort and improves over time as more people interact with the dictionaries. In the future we want to explore specific challenges of indexing and interaction in several other languages such as Mandarin or Japaneses where dictionaries are not sorted essentially based on their huge alphabet. We also want to utilize our current indexes that were developed by users over time to predict pages for lookup terms in dictionaries that are not indexed yet or have partial indexing. We have intuition that we can automatically predict pages of an arbitrary dictionary for a lookup term with acceptable variance by aligning pages of a dictionary with one or more resources such as indexes of other dictionaries in the same language, corpus of the language, most popular words in the language, and partial indexes of the dictionary.


Sawood Alam

Thursday, September 18, 2014

2014-09-18: A tale of two questions

(with apologies to Charles Dickens, Robert Frost, and Dr. Seuss)

"It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, ..." (A Tale of Two Cities, by Charles Dickens).

At the end of this part of my journey; it is time to reflect on how I got here, and what the future may hold.

Looking back, I am here because of answering two simple questions.  One from a man who is no longer here, one from a man who still poses new and interesting questions.  Along the way, I've formed a few questions of my own.

The first question was posed by my paternal uncle, Bertram Winston.  Uncle Bert was a classic type A personality.  Everything in his life was organized and regimented.  When planning a road trip across the US, he would hand write the daily itinerary.  When to leave a specific hotel, how many miles to the next hotel,
Uncle Bert and Aunt Artie
phone numbers along the way, people to visit in each city, and sites to see.  He would snail-mail a copy of the itinerary to each friend along way, so they would know when to expect he and Aunt Artie to arrive (and to depart).  He did this all before MapQuest and Google maps.  He did all of this without a computer, using paper maps and AAA tour books. 

Bert took this attention to detail to the final phase of his life.  As he made preparations for his end, he went through their house and boxed up pictures and mementos for friends and family.  These boxes would arrive unannounced, and were full of treasures.  After receiving, opening, sharing these detritus with Mary and our son Lane, I thanked Bert for helping to answer some of the questions that had plagued me since I was a child.  During the conversation, he posed the first question to me.  Bert said that he had been through his house many times and still had lots of stuff left that he didn't know what to do with.  He said,  "what will I do with the rest?"  I said that I would take it, all of it, and that I would take care of each piece.

I continued to receive boxes until his death. 
Josie McClure, my muse.
With each; Mary, Lane, and I would sit in our living room and I would explain the history behind each memento.  One of these mementos was a picture of Josie McClure.  She became my muse for answering the second question.

Dr. Michael L. Nelson,
my academic parent.
The second question was posed by my academic "parent," Michael L. Nelson.  One day in 2007; he stopped me in the Engineering and Computational Sciences Building on the Old Dominion University campus, and posed the question "Are you interested in solving a little programming problem?"  I said "yes" not having any idea about the question, the possible difficulties involved, the level of commitment that would be necessary, or the incredible highs and lows that
would torment by soul.  But I did know that I liked the way he thought, his outlook on life, and his willingness to explore new ideas.

The combination of answering two simple questions, resulted in a long journey.  Filled with incredible highs brought on by discovering things that no one else in the world knew or understood, and incredible lows brought on by no one else in the world knowing or understanding what I was doing.  My long and tortuous trail can be found here.

While on this journey, I have accreted a few things that I hope will serve me well.

My own set of questions:

1.  What is the problem??  Sometimes just formulating the question is enough to see the solution, or puts the topic into perspective and makes it non-interesting.  Formulating the problem statement can be an iterative process where constant refining reveals the essence of the problem.

2.  Why is it important??  The world is full of questions.  Some are important, others are less so.  Everyone has the same number of hours per day, so you have to choose which questions are important in order to maximize your return on the time you spend.

3.  What have others done to try and solve the problem??  If the problem is good and worthy, then take a page from Newton and see what others have done about the problem.  It may be that they have solved the problem and you just hadn't been able to spend the time trying to find an existing solution.  If they haven't solved the problem, then you might be able to say (as Newton is want to say) "If I have seen further it is by standing on the shoulders of giants."

4.  What will I do to solve the problem??  If no one has solved the problem, then how will you attack it??  How will your approach be different or better than everything done  by everyone else??

5.  What did I do to prove I solved the problem??  How to show that your approach really solved the problem??

6.  What is the conclusion??  After you have labored long and hard on a problem, what do you do with the knowledge you have created??

Be an active reader.

Read everything closely to ensure that I understand what the author was (and was not) saying.  Making notes in the margins on what has been written.  Noting the good, the bad, and the ugly.  If it is important enough, track down the author and speak to them about the ideas and thoughts they had written.  Imagine if you will, receiving a call from a total stranger about something that you've published a few years before.  It means that someone has read your stuff, has questions about it, and that it was important enough to talk directly to you.  How would you feel if that happened to you??  I've made those calls and you can almost feel the excitement radiating through the phone.

Understand all the data you collect.

In keeping with Issac Asimov's view on data: "The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'"  When we conduct experiments, we collect data of some sort.  Be that memento temporal coverage, public digital longevity, digital usage patterns, data of all sorts and types.  Then we analyze the data, and try to glean a deeper understanding.  Watch for the outliers, the data that "looks funny" have additional things to say.

Everyone has stories to tell.  

Our stories are the threads of the fabric of our lives.  Revel in stories from other people.  Those stories they choose to share, are an intimate part of what makes them who they are.  Treat their stories with care and reverence, and they will treat yours the same way.

Don't be afraid to go where others have not.  

 During our apprenticeship, all our training and work point us to new and uncharted territories.  To wit:
Two roads diverged in a wood, and I,
I took the one less traveled by,
And that has made all the difference."
(The Road Not Taken, by Robert Frost)

Remember through it all;

The highs are incredible, the lows will crush your soul, others have survived, and that you are not alone.

And in the end,

be your name Buxbaum or Bixby or Bray
or Mordecai Ali Van Allen O'Shea,
you're off to Great Places!
Today is your day!
Your mountain is waiting.
So...get on your way!"
(Oh, the Places You'll Go!, by Dr. Seuss)

With great fondness and affection,

Chuck Cartledge
The III. A rapscallion.  A husband.  A father.  A USN CAPT.  A PhD.  A simple man.

Thanks to Sawood Alam, Mat Kelly, and Hany SalahEldeen for their comments and review of "my 6 questions."  They were appreciated and incorporated.

2014-09-18: Digital Libraries 2014 (DL2014) Trip Report

Mat Kelly, Justin F. Brunelle and Dr. Michael L. Nelson travel to London, UK to report on the Digital Libraries 2014 Conference.                           

On September 9th through 11th, 2014, Dr. Nelson (@phonedude_mln), Justin (@justinfbrunelle), and I (@machawk1) attending the Digital Libraries 2014 conference (a composite of the JCDL (see trip reports for 2013, 2012, 2011) and TPDL (see trip reports for 2013 and 2012) conferences this year) in London, England. Prior to the conference, Justin and I attended the DL2014 Doctoral Consortium, which occurred on September 8th.

The main conference on September 9th opened with George Buchanan (@GeorgeRBuchanan) indicating that this year's conference was a combination of both TPDL and JCDL from previous years. With the first digital libraries conference being in 1994, this year marked the 20th year anniversary of the conference. George celebrated this by singing Happy Birthday to the conference and introduced the Ian Locks, the Master of company if the Worshipful Company of Stationers and Newspaper Makers, to continue the introduction.

Ian first gave a primer and history of the his organization as a "chain gang that dated back 1000 years" that "reduced the level of corruption when people could not read or write". Originally, his organization became Britain's first library of deposit wherein printed works needed to be deposited with them and the organization was central to the first copyright in world in 1710.

Ian then gave way to Martin Klein (@mart1nkle1n), who gave insight into the behind-the-scenes dynamics of the conference. He stated that the programming committee had 183 members to allocate reviews for every submission received. The committee's goal was to have four first level reviews. Of the papers received, 38 countries were represented with the largest number coming from the U.S. followed by the U.K. then Germany. The acceptance rate for full papers was 29% while the rate for short papers was 32%. 33 posters and 12 demos were also accepted. Interestingly, the country with the highest acceptance rate that submitted over five papers was Brazil, with over half of their papers accepted.

Martin then segued to introducing the keynote speaker, Professor Dieter Fellner of the Fraunhofer Institute. Dieter's theme consisted mostly of the different means and issues in digitally preserving 3-dimensional objects. He described the digitization as, "A grand opportunity for society, research, and economy but a grand challenge for digital libraries." In reference to object recovery for preservation before or after an act of loss he said, "if we cannot physically preserve an object, having a digital artifact is second best." Dieter then went on to tell of the inaccuracies of preserving artifacts from a single or insufficient lighting conditions. TO evaluate how well an object is preserved, he spoke of a "Digital Artifact Turing Test" wherein, he said, first create photos of a 3D artifact then make a 3D model. If you can't tell the difference, then the capture can be deemed successful and represntative. -

Dieter continued with some approaches they have used to achieve better lighting conditions and how varying the lighting conditions has provided instances of uncovering data that previously was hard to accurately capture. As an example, he show a piece of driftwood from Easter Island that had an ancient etched message that was very subtle to see and thus would likely be unknowingly used as firewood. By varying the light conditions when preserving the object, the ancient writing was exposed and preserved for later translation once more is known about the language.

Another instance he gave was based on scans today of ancient objects, how accurately can we replicate the original color, citing the discolored bust of Nefertiti. Further inspection using various colored lighting to scan produced potentially better results for a capture.

After a short break, the meeting resumed with simultaneous sessions. I attended the "Web archives and memory" session where WS-DL's Michael Nelson lead off with "When Should I Make Preservation Copies of Myself?", a work related to WS-DL's recent alumnus Chuck Cartledge's PhD dissertation. In his presentation, Michael spoke of the preservation of objects, particular of Chuck now-famous ancestor "Josie", for which he had a physical photo over a hundred years old with some small bits of metadata hand-written on the back. With respect to modeling the self-preservation of the correlative digital object of the Josie photo (e.g., a scanned image on Flickr), Michael described the "movement" of how this image should propagate in the model in a way akin to Craig Reynold's Boids in the desired behavior of collision avoidance, velocity matching, and flock centering. This "small world" consisting of the set of duplicated objects in a variety of locations can be described with the "small world" concept and will not create a lattice structure in its propagation scheme.

Chuck's implementation work includes adding a linked image embedded on the web page (using the HTML "link" tag and not the "a" tag) that allows the user to specify that they would like the object preserved. Michael then described the three policies used for duplication to ensure optimal spread of a resource, which included one-at-a-time until a limit is hit, as aggressively as possible until a soft limit then one-at-a-time until a hard limit, or a super aggressive policy of duplication until the hard limit is hit. From Chuck's work, Michael said, "It pays to be aggressive, conservation doesn't work for preservation. What we envisioned", he continued, "was to create objects that live longer than the people that created them."

Cathy Marshall (@ccmarshall) followed Michael with "An Argument for Archiving Facebook as a Heterogeneous Personal Store". From her previous studies, she found that users were apathetic about preserving their Facebook contents, "Why we should archive Facebook despite the users not caring about archiving it?", she said. "Evidence has suggested that people are not going to archive their stuff in any kind of consistent way in the long term." Most users in her study could not think of anything to save on Facebook, assuming that if Facebook died, those files also live somewhere else and can be recovered. Despite this, she attempted to identify what users found most important in their Facebook contents with 50% of the users saying that they find the most value in their photos, 35% saying they would carry over their contacts if they needed to, and other than that, they did not care about much else.

Michael Day (@bindonlane) followed Cathy with a review of recent work at the British Library. His group has been making attempts to implement preservation concepts from extremely large collections of digital material. They have published the British Library Content Strategy, which attempts to guide their efforts.

When Michael was done presenting, I (Mat Kelly, @machawk1) presented my paper "The Archival Acid Test: Evaluating Archive Performance on Advanced HTML and JavaScript". The purpose of the work was to evaluate different web archiving tools and web sites in a way much similar to the Acid Tests originally designed for web standards but with more clarity and to test specific facets of the web for which web archiving tools have trouble.

After I presented was the lunch session and another set of concurrent sessions. For this session I attended the "Digital Libraries: Evolving from Collections to Communities?" panel with Brian Beaton, Jill Cousins (@JilCos), and Herbert Van de Sompel (@hvdsomp) with Deanna Marcum and Karen Calhoun as moderators.

Jill stated, "Europeana can't be everything to everybody but we can provide the data that everyone reuses." The group spoke about the Europeana 2020 strategy and how it aims to fulfill 3 principles: "share data, improve access conditions to data and create value in what we're doing". Brian asked, "Can laypeople do managed forms of expert work? That question has been answered. The real issue is determining how crowd-sourced projects can remain sustainable.", referencing previous discussion on the integration of crowd sourcing to fund preservation studies and efforts. He continued, "I think we're at the moment where we are at a lot of competitors entering race [for crowd sourcing platforms]. Lots of non-profits are turning to integration of crowd-funded platforms. I'm curious to see what happens where more competition emerges for crowd-sourcing."

Following the panel and a short break, Alexander Ororbia of Penn State presented "Towards Building a Scholarly Big Data Platform: Challenges, Lessons and Opportunities" relating to CiteSeerX, "The scholarly big data platform". The application he described relates to research management, collaboration discovery, citation recommendation, expert search, etc. and uses technologies like a private cloud, HDFS, NOSQL, Map Reduce, and crawl scheduling.

Following Alex, Per Møldrup-Dalum (@perdalum) presented "Bridging the Gap Between Real World Repositories and Scalable Preservation Environments". In his presentation, Per spoke of Hadoop and his work in digitizing 32 million scanned newspaper pages using OCR and ensuring the digitization was valid according to his group's their preservation policy. In accomplishing this, he created a "stager" and "loader" as proof-of-concept implementations of using the SCAPE APIs. In doing this, Per wanted to emphasize reusability of the products he produced, as his work was mostly based on the reuse of other projects.

After Per, Yinlin Chen described their work on utilizing the ACM Digital Library (DL) data set as the basis for a project on finding good feature representations that minimizes the differences between source and target domains of selected documents.

C. Lee Giles of Penn State came next with his presentation "The Feasibility of Investing of Manual Correction of Metadata for a Large-Scale Digital Library". In this work, he sought to build a classifier using a truth discovery process using metadata from Google Scholar. He found that a "web labeling system" seemed more promising compared to simple models of crowdsourcing the classification.

This finished the presentations for the first day of the conference and the poster session followed. In the poster session, I presented my work on developing a Google Chrome extension called Mink (now publicly available) that attempts to integrate the live and archived web viewing experience.

Day 2

George Buchanan started the second day by introducing Professor Jane Ohlmeyer of Trinity College, Dublin (@tcddublin) and her work relating to the 1641 Depositions, the records of massacre, atrocity and ethnic cleansing in seventeenth-century Ireland. These testimonies related to the Irish Rebellion around the 22nd of October in 1641 where Catholics robbed, murdered, and pillaged their protestant neighbors. From what's documented of the conflict, Jane noted that "we only hear one side of the suffering and we don't have reports of how the Catholics suffered, were massacred, etc." referring to the accounts being mostly collected from a single perspective of the conflict. Jane highlighted one particular account by Anne Butler from the 7th of September, 1642, where Anne first explained who she was then followed with her neighbors with whom she previously interacted with daily in the market subsequently threatening her during the conflict solely due to her being Protestant. "It's as if the depositions are allowing us to hear the story of those that suffered through fear and conflicts.", Jane said, referencing the accounts. The controversial depositions had originally been donated in 1741 and held in Trinity College and locked away due to their controversial documentation of the conflict. The accounts consists of over 19000 pages (about 3.5 million words) and include 8000 witness testimonies of events related to 1641 rebellion. The accounts had been attempted to be published by multiple parties in the past (including an attempt in 1930) but had previously been censored by the Irish government because of their graphic nature. Now that the parties involved in the conflict are at peace, further work is being done by Jane's group preserving the accounts while dealing with various features of the writing (e.g., multiple spellings, document bleed through, inconsistent data collection pattern) that might otherwise be lost in the process were the documents naively digitized.

Jane's group has since launched a website (in 2010) to ensure that the documents are accessible to the public and currently have over 17,000 registered users. All of the data they have added is open source. Upon launching it, they had both Mary McAleese and Ian Paisley (who was notoriously anti-Catholic) together for the launch with Paisley surprisingly saying that he advocated the publication of the documents, as it "promoted learning" and he encouraged the documents be made accessible in the classroom to 14, 15, and 16 year olds so that society could "remember the past but not bound by the past". Through the digitization process, Jane's group has looked to other more recent (and some currently ongoing) controversial conflicts and how the accounts of the conflict can be documented and released in a way that is appropriate to the respectively affected society.

Following Jane's Keynote (and a coffee break), the conference was split into concurrent sessions where I attended the "Browsing and Searching" session, where Edie Rasmussen introduced Dana Mckay (@girlfromthenaki) began her presentation, "Borrowing rates of neighbouring books as evidence for browsing". In her work, she sought to explore the concept of browsing in respect to the various digital platforms for doing so (e.g., for books on Amazon) vs. the analog of browsing in a library. With library-based browsing, a patron is able to maintain physical context and see other nearby books as shelved by the library. "Browsing is part of the of the human information seeking process", she said, following with the quote, "The experience of browsing a physical library is enough to dissuade people to use e-books." In her work, she used 6 physical libraries as a sample set and checked the frequency at which physically nearby books had been borrowed as a function of likelihood in respect to an initially checked out book. In preliminary research, she found that from her sample set, that over 50% of the book had ever been borrowed and just 12% had been borrowed in the last year. In an attempt to quantify browsing, she first split her set of libraries into two sets consisting of those that used the Dewey Decimal system and those that used the Library of Congress system of organization. She first tested 100 random books, checked if they had been borrowed on day Y then checked the physically nearby books to see if they had been borrowed the day before. From her study, Dana found that there is definitely a causal effect on the location of books borrowed and that, especially in libraries, browsing has an effect on usage.

Javier Lacasta followed Dana with "Improving the visibility of geospatial data on the Web". In the work, his group's objective was to identify, classify, interrelated, and facilitate access to geospactial web services. They wanted to create an automatic process that identified existing geospatial services on the web by using an XML specification. From the service discover, Javier wanted to extract content of fields containing the resource's title, description, thematic keywords, content date, spatial coordinates, textual descriptions of the place, and the creator of the service. By doing this, they hoped to harmonize the content for consistency between services. Further, they wanted a mean of classifying the services by assigning values from a controlled vocabulary. The study, he said, ought to be applicable to other fields, though his discovery of services was largely limited by lack of content for these type of services on the web.

Martyn Harris was next with "The Anatomy of a Search and Mining System for Digital Humanities" where he looked at the barriers for tool adoption in the digital humanities spectrum. He found that documentation and usability evaluation was mostly neglected, so looked toward "dogfooding" in developing his own tool using context-dependent toolsets. An initial prototype uses a treemap for navigating the Old Testament and considers the probability of querying each document.

Õnne Mets followed Martyn with "Increasing the visibility of library records via a consortial search engine". The target for the study was the search engine behind National Library of Estonia, which provides an e-books on-demand service as well as a service for digitizing public domain books into e-book form. Their service has been implemented in 37 libraries in 12 countries and provides an "EOD button" that sends the request to the respective library to scan and transfer the images from the physical book. Their service provides a central point for users to discover EOD eligible books and uses OAI-PMH to harvest and batch upload the book files via FTP. Despite the services' interface, Õnne said that 89% of the hits on their search interface came directly to their landing pages via Google. From this, Õnne concluded that collaboration with a consortial search engine does in fact make collections of digitized books more visible, which increases the potential audience.

The conference then broke for lunch but returned with Daniel Hasan Dalip's presentation of "Quality Assessment of Collaborative Content With Minimal Information". In this work, Daniel investigated how users can better utilize the larger amount of information created in web documents using a multi-view approach that indicates a meaningful view of quality. As a use case, he divided a Wikipedia article into different views representing the evidence conveyed in the article. Using Spark Vector Machines (SVR), he worked to identify features within the document with a low prediction error. He concluded that using the algorithm allows the feature set of 68 features to be reduced by 15%, 18%, and 25% for three sample data article on Wikipedia for "MUPPET", "STARWAR", and "WIKIPEDIA", respectively.

As Daniel's presentation was going on, Justin viewed Adam Jatowt's presentation "Quality Assessment of Collaborative Content With Minimal Information". In this work, Adam showed that words changed meaning over time using tools to verify words' evolution. He first took 5-grams from The Corpus of Historical American English (COHA) on Google Books and measured both the frequency and the temporal entropy of each 5-gram. He found that if a word is popular in one decade, it's usually popular in the next decade. He also investigated similarity based on context (i.e., the position in a sentence). Through the study he discovered word similarities as was with the case of the word "nice" being synonymous with "pleasant" around the year 1850.

I then joined Justin for Nikolaos Aletras's (@nikaletras) presentation, "Representing Topics Labels for Exploring Digital Libraries". In this work, Nikolaos stated that the problem with online documents is that they have no structure, metadata, or manually created classification system accompanying them, which makes it difficult to explore and find specific information. He created unsupervised topic models that were data-driven and captured the themes discussed within the documents. The documents were then represented as a distribution over various topics. To accomplish this, he developed a topic model pipeline where a set of documents acted as the input with the output consisting of two matrices: topic-word (probability of each word on a given topic) and topic-document (probability of each document given the topic). He then used his trained model to identify as many documents relevant to a set of queries within 3 minute in a document collection using document models. The data set used was a subset of the Reuters Corpus from Rose et al. 2002. This data set had already been manually classified, so could be used for model verification. From the data set, 20 subject categories were used to generate a topic model. 84 topics were produced and provided via an alternative means of browsing the documents.

Han Xu presented next with "Topical Establishment Leveraging Literature Evolution" where he attempted to discovery research topics from a collection of papers and to measure how well or not a given topic is recognized by the community. First, Han's group identified research topics whose recognition can be described as either persistent, withering or booking. Their approach was inspired by bidirectional mutual enforcement between papers and topical recognition. By using the weight of a topic as a sum of its recognitions in papers, he could compare using PageRank and RALEX (their previous work using random WALKS) and show that their own approach was more suitable, as it was more designed to take into account literature evolution, unlike PageRank.

Fuminori Kimura was next with "A Method to Support Analysis of Personal Relationship through Place Names Extracted from Documents", a followup study on previous research for extracting personal relationships through place names. In this work, their extracted personal names and place names and counted the co-occurrence between them. Next, their created a personal's feature vector then calculated the personal relationship and stored this product in a database for further analysis. When a personal name and a place name appeared in the same paragraph, they hypothesized, it is an indicator of the relationship between the person and the location. Using cosine similarity and clustering, Fuminori found that initial tests of their word on Japanese historical documents could epitomize a relationship network graph of closely related people backed by their common relationships with locations.

After a short break, the final set of concurrent sessions started where I attempted Christine Borgman's (@scitechprof presentation of "The Ups and Downs of Knowledge Infrastructures in Science: Implications for Data Management". In this work, she spoke of how countries in Europe, the U.S., and other parts of the world are now requiring scholars to release the data from their studies and questioned what sort of digital libraries should we be building for this data. Her work was reporting on the progress from the Alfred P. Sloan Foundation's study of 4 different scientific processes and how they make and use data. "What kind of new professionals should be prepared for data mining", she asked.She described four different projects in a 2x2 matrix where two had large amounts of data and two were projects that were just ramping up (with each project of the four holding a unique combination of these traits). The four projects (Center for EMBEDDED network Sensing (CENS), Sloan Digital Sky Survey (SDSS), Center for Dark Energy Biosphere investigations (C-DEBI), and the Large Synoptic Survey Telescope (LSST) each either had either previous methods of storing the data or were proposing ways to handle, store, and filter the large amount of data to-come. "You don't just trickle the data out as it comes across the instruments. You must clean, filter, document, and release very specific blocks.", she said of some projects releasing the cleaned data sets while others were planning to opt to release the raw data to the public. "Each data is accompanied by a paper with 250 authors", she said, highlighting that they were greatly used as a basis for much further research.

Carl Lagoze of University of Michigan presented next with "CED2AR: The Comprehensive Extensible Data Documentation and Access Repository", which he described as "yet another metadata repository collection system." In a deal between the NSF and the Census Bureau, he worked to make better use of the Census Bureau's huge amount of data. Doing further work on the data was to increase emphasis to have scientists make data available on the network and make the data useful for replicating methods, verifying/validating studies, and taking advantage of the results. Key facets of the census data is that it is highly controlled and confidential, with the latter describing both the content itself as well as the metadata of the content. Because of this, both identity and provenance were key issues that had to be dealt with in the controlled data study. Regarding the mixing of this confidential data with public data, Carl said, "Taking controlled data spaces and mixing it with uncontrolled data spaces creates a new data problem in data integrity and scientific integrity.".

David Bainbridge present next with "Big Brother is Watching You - But in a Good Way" where he initially presented the use case of having had something on his screen earlier for which he could not remember the specifics of some text. His group has created a system that records and remembers text that has displayed on a machine running XWindows (think: Linux) and allows the collected data to be searchable with graphical recall.

During the presentation, David gave a live demo wherein he visited a website, which was immediately indexed and became searchable as well as showing results from earlier relevant browsing sessions.

Rachael Kotarski (@RachPK) presented next with "A comparative analysis of the HSS & HEP data submission workflows" where she withed with a UK data archive looking for social science data. She referenced that users registering for an ORCID greatly helps with the mining process and takes only five minutes.

Nikos Houssos (@nhoussos) presented the last paper of the day with "An Open Cultural Digital Content Infrastructure" where he spoke of 70 cultural heritage projects costing about 60 million Euros and how his group has helped associate successful validation with funding cash flows. By building a suite of services for repositories, they have provided a single point of access for these services through aggregation and harvesting. Much of the back-end, he said, is largely automated checking and compliance for safe keeping.

Nikos closed out the sessions for Day 2. Following the sessions, the conference dinner was held at The Mermaid Function Centre.

Day Three

The third day of the Digital Libraries was short but to lead off was ODU WS-DL's own Justin Brunelle (@justinfbrunelle) with "Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources". In this paper, we (I was a co-author) investigated the effects of missing resources on an archived web page and how the impact of a resource is not sufficiently evaluated on an each-resource-has-equal-weight basis. Instead, using measures of size, position, and centrality, Justin developed an algorithm to weight a missing resource's impact (i.e., "Damage") to a web page if not captured to the archived. He initially used the example of the web comic XKCD and how a single resource (the main comic) has much more importance for the page's purpose than all other resources on the page. When missing a stylesheet, the algorithm considers the background color of the page and the concentration of content with the assumption that if the stylesheet is missing and important, most of the content will be in the left third of the page.

Hugo Huurdeman (@TimelessFuture) followed Justin with "Finding Pages on the Unarchived Web" by first asking, "Given that we cannot crawl lost web pages, how can we recover the content lost?" Working with the National Libraries of the Netherlands, which consists of about 10 terabytes of data from 2007, they focused on a subset of this data for 2012 with the temporal span of a year. From this they extracted the data for processing and sought to answer three research questions:

  1. Can we recover a significant fraction of unarchived pages?
  2. How rich are the representations for the unarchived pages?
  3. Are these representations rich enough to characterize the content?

Using a measure involving Mean Reciprocal Rank, they took the average scores of the first correct result of each query while utilizing keywords within the URLs for non-homepages. A second measure of "Success Rate" allowed them to evaluate that 46.7% of homepages and 46% of non-homepages could have a summary generated if never preserved. Their approach claimed to "Reconstruct significant parts of the unarchived web." based on descriptions and link evidence pointing to the unpreserved pages.

Nattiya Kanhabua presented last in the session with "What Triggers Human Remembering of Events? A Large-Scale Analysis of Catalysts for Collective Memory in Wikipedia" where she investigated the scenario of a computer that forgets intentionally and how that plays into digital preservation. "Forgetting plays a crucial role for human remembering and life.", she said. Nattiya spoke of "managed forgetting", i.e., to remember the right information. "Individuals' memories are subject to a fast forgetting process." She referenced various psychological studies to correlate the preservation process with "flashbulb memories". For a case study, they looked at the Wikipedia view logs as signal for collective memory, as they're publicly available traffic over a long span of time. "Looking at page views does not directly reflect how people forget; significant patterns are a good estimate for public remembering.", she said. Their approach developed a "remembering score" to rank related past events and identify features (e.g., time, location) as having a high correlation with remembering.

Following a short final break, the final paper presentations of the conference commenced. I was able to attend the last two presentations of the conference where C. Lee Giles of Penn State University presented "RefSeer: A Citation Recommendation System", a citation recommendation system based on the content of an entire manuscript query. His work served as an example on how to build a tool on top of other system through integration. To further facilitate this, the system contains a novel language translation method and is intended to help users write papers better.

Hamed Alhoori presented the last paper of the conference with "Do Altmetrics Follow the Crowd or Does the Crowd Follow Altmetrics?" where he used bookmarks as metrics. His work found that journal-level altmetrics have significant correlation among themselves compared with the weak correlations within article-level altmetrics. Further, they found that Mendeley and Twitter have the highest usage and coverage of scholarly activities.

Following Hamed's presentation, George Buchanan provided information on the next year's JCDL 2015 and TPDL 2015 (which would again be split into two locations) and what ODU WS-DL was waiting for: the announcements for best papers. For best student paper, the nominees were:

  • Glauber Dias Gonçalves, Flavio Vinicius Diniz de Figueiredo, Marcos Andre Goncalves and Jussara Marques de Almeida. Characterizing Scholar Popularity: A Case Study in the Computer Science Research Community
  • Daniel Hasan Dalip, Harlley Lima, Marcos Gonçalves, Marco Cristo and Pável Calado. Quality Assessment of Collaborative Content With Minimal Information
  • Justin F. Brunelle, Mat Kelly, Hany Salaheldeen, Michele C. Weigle and Michael Nelson. Not All Mementos Are Created Equal: Measuring The Impact Of Missing Resources

For best paper, the nominees were:

  • Chuck Cartledge and Michael Nelson. When Should I Make Preservation Copies of Myself?
  • David A. Smith, Ryan Cordell, Elizabeth Maddock Dillon, John Wilkerson and Nick Stramp (Best paper nominees). Detecting and Modeling Local Text Reuse
  • Hugo Huurdeman, Anat Ben-David, Jaap Kamps, Thaer Samar and Arjen P. de Vries (Best paper nominees). Finding Pages on the Unarchived Web

The results (above tweet) served as a great finish to a conference with many fantastic papers that we will be exploring in-depth for the next year.

— Mat

Wednesday, September 17, 2014

2014-09-17: NEH ODH Project Directors' Meeting

On Monday (Sep 15), Michael and I attended the NEH Office of Digital Humanities Project Directors' Meeting at their new location in the Constitution Center in Washington, DC. We were invited based on our "Archive What I See Now" project being funded as a Digital Humanities Implementation Grant.

There were two main goals of the meeting: 1) provide administrative information and advice to project directors and 2) allow project directors to give a 3 minute overview of their project to the general public.

The morning was devoted the first goal.  One highlight for me was ODH Director Brett Bobley's welcome in which he talked a bit about the history of the NEH (NEH's 50th anniversary is coming up in 2015).  The agency is currently in the process of digitizing their historical documents, including records of all of the grants that have been awarded (originally stored on McBee Key Sort cards). He also mentioned the recent article "The Rise of the Machines" that describes the history of NEH and digital humanities. Bottom line, digital humanities is not a new thing.

The public afternoon session was kicked off with a welcome from the new NEH Chairman, Bro Adams.

The keynote address was given by Michael Whitmore, Director of the Folger Shakespeare Library.  He talked about adjacency in libraries allows people to easily find books with similar subjects ("virtuous adjacency").  But, if you look deeper into a book and are looking for items similar to a specific part of the book (his example was the use of the word "ape"), then the adjacent books in the stacks probably aren't relevant ("vicious adjacency"). In a physical library, it's not easy to rearrange the stacks, but in a digital library, you can have the "bookshelf rearrange itself". 

His work uses Docuscope to analyze types of words in Shakespeare's plays.  The algorithm classifies words according to what type of word it is (imperative, dialogue, anger, abstract nouns, ...) and then uses PCA analysis to cluster plays according to these descriptors. One of the things learned through this visual analysis is that Shakespeare used more sentence-starting imperatives than his peers. Another project mentioned was Visualizing English Print, 1530-1799.  The project visualized topics in 1080 texts with 40 texts from each decade. The visualization tool, Serendip, will be presented at IEEE VAST 2014 in Paris (30-second video).

After the keynote, it was time for the lightning rounds.  Each project director was allowed 3 slides and 3 minutes to present an overview of their newly funded work.  There were 33 projects presented, so I'll just mention and give links to a few here.  (2015-07-24 update: links to videos of all lightning talks are available at and

Lightning Round 1 - Special Projects and Start-Up Grants
Lightning Round 2 - Implementation Grants
  • Pop Up Archive, PRX, Inc. - archiving, tagging, transcribing audio
  • Bookworm, Illinois at Urbana-Champaign - uses HathiTrust Corpus and is essentially an open-source version of Google n-gram viewer
The program ended with a panel on how to move projects beyond the start-up phase.

Thanks to the ODH staff (Brett Bobley, Perry Collins, Jason Rhody, Jen Serventi, and Ann Sneesby-Koch) for organizing a great meeting!

For another take on the meeting, see the article "Something Old, Something New" at Inside Higher Ed. Also, the community has some active tweeters, so there's more commentary at #ODH2014.

The lightning presentations were recorded, so I expect to see a set of videos available in the future, as was done with the 2011 meeting.

One great side thing I learned from the trip is that mussels and fries (or, moules-frites) is a traditional Belgian dish (and is quite yummy).

2014-10-01 Edit:

Links to the tools from our Archive What I See Now project (work still in progress, we welcome feedback)

Tuesday, September 16, 2014

2014-09-16: A long and tortuous trail to a PhD

(or how I learned to embrace the new)

I am reaching the end of this part of my professional, academic, and personal life.  It is time to reflect and consider how I got here.

The trail ahead.
When I started, I thought that I knew the path, the direction, and the work that it would take.  I was wrong.  The path was rugged, steep, and covered with roots and stones that lay in wait to trip the unwary.  The direction was not straight forward.  At times I wasn't sure how to set my compass, and which way to steer.  In the end, there was more work than I thought in the beginning.  But the end is nigh.  The path has been long.  At times the was direction confusing.  The work seemed never ending.  This is a story of how I got to the end, using a little help from "a friend" at the end of this post.

Bringing the initially disparate disciplines of graph theory, digital preservation, and emergent behavior together to solve a particular class of problem, is/was non-trivial.  Sometimes you have to believe in a solution before you can see it.

Graph theory is: the study of graphs, the mathematical structures used to model pairwise relations between objects.  In my world, I focused on the application of graph theory as it applied to the creation of graphs that had the small-world properties of a high clustering coefficient and a low average path length.

Digital preservation is: a series of managed activities necessary to ensure continued access to digital materials for as long as they are needed.  In my world, I focused on preserving the "essence" of a web object (WO), not the entire object.  WOs can include links to resources and capabilities that are protected and not visible on the "surface web."  While this web "dark matter" could contain unknown wealth and information,  I was interested in the essence of the WO and preserving that for the long term.

Emergent behavior is: unanticipated behavior shown by a system.  In my world, I took Craig Reynolds' axiom of imbuing objects with a small set of rules, turning them loose, and seeing what happens.  My rules guided the WOs through their explorations of the Unsupervised Small-World (USW) graph, how they made decisions about which other WOs to connect to, and when and where to make preservation copies.

Graph theory, digital preservation, and emergent behavior are brought together in the USW process; the heart of my dissertation.

At the end of a very long climb, there is:

A video of the USW process in action video:

My PhD Defense PowerPoint presentation on SlideShare.

 A video of my dissertation defense can be found here.

 My dissertation in two different sized files.
A small (19 MB) version of my dissertation.

A much larger (619 MB) version of my dissertation can be found here.

A simple chronology from the Start in 2007 through the PhD in 2014 (with a little help from my friend).

2007: I started down this trail
The "story" of my dissertation. (My friend.)

2007 - 2013: The Unsupervised Small-World (USW) simulator (on GitHub) directly supported almost all phases of my work.  It went through many iterations from its first inception until its final form.  What started as a simple was to create simple graphs in python, through a couple of other scripting languages, stabilized as an message driven 5K line long C++ program.  The program served as a way to generate USW graph to test different theories and ideas.  The simulator generated data, while offline R scripts did the heavy lift analysis.  One my favorite graphs was a by-product of the simulator (and it didn't have anything to do with USW).

2008: Emergent behavior: a poster entitled "Self-Arranging Preservation Networks."

2009: Emergent behavior and graph theory: a short paper entitled "Unsupervised Creation of Small World Networks for the Preservation of Digital Objects."

2009: Graph theory: Doctoral consortium

2010: Digital preservation: a long paper entitled: "Analysis of Graphs for Digital Preservation Suitability."

2011: Graph theory: arXiv on entitled: "Connectivity Damage to a Graph by the Removal of an Edge or Vertex."

2011: Graph theory: a WS-DL blog article: "Grasshopper, prepare yourself. It is time to speak of graphs and digital libraries and other things."

2012: Digital preservation: a long paper entitled: "When Should I Make Preservation Copies of Myself?"

2013: Digital preservation: a WS-DL blog article: "Preserve Me! (... if you can, using Unsupervised Small-World graphs.)"

2013: The USW robot, my own Marvin, (on GitHub) grew from the lessons learned from the simulator.  Marvin worked with Sawood Alam's HTTP Mailbox application to actually create USW graphs based on data in the USW instrumented Web Pages.

2013 - 2014: Emergent behavior: working with Sawood Alam and his HTTP Mailbox application.  The Mailbox was the communication mechanism used by USW Web Objects.

2014: Digital preservation: an updated long paper entitled: "When Should I Make Preservation Copies of Myself?"

2014: My PhD defense (link to set of slides).

2014: LaTeX: a WS-DL blog article: LaTeX References, and how to control them.

2014: LaTeX: a WS-DL blog article: An ode to the "Margin Police," or how I learned to love LaTeX margins.

2014: Dissertation submitted and accepted by the Office of the Registrar.

In many movies, there is one line that stands out.  One line that resonates.  One line that sums up many things.  The one that comes to my mind was uttered by Sean Connery as William Forrester in the movie "Finding Forrester" when he pointed to the faded photograph on the wall and said: "I'm that one."

The trail, and the road was long and trying, with many places where things could have gone awry. But in the end, like Kwai Chang Caine and his brazier, the way out of the temple was shown and the last trial was completed.


Published works (ready for copying and pasting):
  • Sawood Alam, Charles L. Cartledge, and Michael L. Nelson. HTTP Mailbox - Asynchronous RESTful Communication. Technical report, arXiv:1305.1992, Old Dominion University, Computer Science Department, Norfolk, VA, 2013.
  • Sawood Alam, Charles L. Cartledge, and Michael L. Nelson. Support for Various HTTP Methods on the Web. Technical report, arXiv:1405.2330, Old Dominion University, Computer Science Department, Norfolk, VA, 2014.
  • Charles Cartledge. Preserve Me! (... if you can, using Unsupervised Small-World graphs.)., 2013.
  • Charles L. Cartledge and Michael L. Nelson. Self-Arranging Preservation Networks. In Proc. of the 8th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 445 – 445, 2008.
  • Charles L. Cartledge and Michael L. Nelson. Unsupervised Creation of Small World Networks for the Preservation of Digital Objects. In Proc. of the 9th ACM/IEEE-CS Joint Conf. on Digital Libraries, pages 349 – 352, 2009.
  • Charles L. Cartledge and Michael L. Nelson. Analysis of Graphs for Digital Preservation Suitability. In Proc. of the 21st ACM conference on Hypertext and hypermedia, pages 109 – 118. ACM, 2010.
  • Charles L. Cartledge and Michael L. Nelson. Connectivity Damage to a Graph by the Removal of an Edge or Vertex. Technical report, arXiv:1103.3075, Old Dominion University, Computer Science Department, Norfolk, VA, 2011.
  • Charles L. Cartledge and Michael L. Nelson. When Should I Make Preservation Copies of Myself? Tech. Report arXiv:1202.4185, 2012.
  • Charles L. Cartledge and Michael L. Nelson. When Should I Make Preservation Copies of Myself? In Proc. of the 14th ACM/IEEE-CS Joint Conf. on Digital Libraries, page TBD, 2014.

Published works (ready for BibTex):

Tuesday, September 9, 2014

2014-09-09: DL2014 Doctoral Consortium

After exploring London on Sunday, I attended the first DL2014 session: the Doctoral Consortium. Held in the College Building at the City University London, the Doctoral Consortium offered early-career Ph.D. students the opportunity to present their research and academic plans and receive feedback from digital libraries professors and researchers.

Edie Rasmussen chaired the Doctoral Consortium. I was a presenter at the Doctoral Consortium in 2012 with Hany SalahEldeen, but I attended this year as a Ph.D. student observer.

Session I: User Interaction was chaired by José Borbinha. Hugo Huurdeman was first to present his work entitled "Adaptive Search Systems for Web archive research". His work focuses on information retrieval and discovery in the archives. He explained the challenge with searching not only across documents but also across time.

Georgina Hibberd presented her work entitled "Metaphors for discovery: how interfaces shape our relationship with library collections." Georgina is working on digitally modeling the physical inputs library users receive when interacting with books and physical library media to allow the same information to be available when interacting with digital representations of the collection. For example, how can we incorporate physical proximity and accidental discovery in the digital systems, or how can we demonstrate frequency of use that would previously be shown in the condition of a book's spine?

Yan Ru Guo presented her work entitled "Designing and Evaluating an Affective Information Literacy Game" in which she proposes serious games to help tertiary students in an effort to help their ability to perform searches and information discovery in digital environments.

After a break to re-caffeinate, Session II: Working with Digital Collections began. Dion Goh chaired the session. Vincent Barrallon presented his work entitled "Collaborative Construction of Updatable Digital Critical Editions: A Generic Approach." This work aims to establish an updatable data structure to represent the collaborative flow of annotation, especially with respect to editorial efforts. He proposes using bidirectional graphs, triple graphs, or annotated graphs as representatives, and proposes methods of identifying graph similarity.

Hui Li finished the session with her presentation entitled "Social Network Extraction and Exploration of Historic Correspondences" in which she is working to use Named Entity Extraction to create a social network from digitized historical documents. Her effort utilizes topic modeling and event extraction to construct the network.

Due to a scheduling audible, lunch and Session III: Social Factors overlapped slightly. Ray Larson chaired this session, and Mat Kelly was able to attend after landing in LHR and navigating to our hotel. Maliheh Farrokhnia presented her work entitled "A request-based framework for FRBRoo graphical representation: With emphasis on heterogeneous cultural information needs." Her work takes user interests (through adaptive selection of target information) to present relational graphs of digital library content.

Abdulmumin Isah presented his work entitled "The Adoption and Usage of Digital Library Resources by Academic Staff in Nigerian Universities: A Case Study of University of Ilorin." His work highlights a developing country's use of digital resources in academia and cites factors influencing the success of digital libraries.

João Aguir Castro presented his work entitled "Multi-domain Research Data Description -- Fostering the participation of researchers in an ontology based data management environment." His work with Dendro uses metadata and ontologies to aid in long-term preservation of research data.

The last hour of the consortium was dedicated to an open mic session chaired by Cathy Marshall with the goal of having the student observers present their current work. I presented first and explained my work that aims to mitigate the impact of JavaScript on archived web pages. Mat went next and discussed his work about integrating public and private web archives with tools like WAIL and WARCreate.

Alexander Ororbia presented his work on using artificial intelligence and deep learning for error correcting crowd sourced data from scholarly texts. Md Arafat Sultan discussed his work on using natural language processing to detect similarity in text to identify text segments that adhered to set standards (e.g. educational standards). Kahyun Choi discussed her work on perceived mood in eastern music from the point of view of western listeners. Finally, Fawaz Alarfaj discussed his work using entity extraction, information retrieval, and natural language processing to identify experts within a specified field.

As usual, the Doctoral Consortium was full of interesting ideas, valuable recommendations, and highly motivated Ph.D. students. Tomorrow marks the official beginning of DL2014.

--Justin F. Brunelle