Tuesday, October 1, 2019

2019-10-01: Attending the Machine Learning + Libraries Summit at the Library of Congress

On September 20, 2019, I attended the Machine Learning + Libraries Summit at the Library of Congress. The aim of the meeting is to gather computer and information scientists, engineers, data scientists, and Liberians from reputable universities, government agencies, and industrial companies to come up with ideas on a bunch of questions such as how to expand the service of digital libraries, how to establish a good collaboration with  other groups on machine learning projects, and what factors to consider to design a sustainable machine learning project, especially in the digital library domain. In the initial solicitation, the focus was cultural heritage, but the discussion went far beyond that.

The meeting features many interesting lightning talks. Unfortunately, due to the relatively short time allocated, many questions and discussions have to go offline. The organizer also arranged several activities, stimulating brainstorm discussion and teamwork between people from different places. I took notes of some speakers and their presentation contents that are highly relevant to my research.

The summit organizers solicited many other potentially interesting topics but because there was not enough time, they opened a Google doc to create a "look book" allowing people to post 3 slides to highlight their research and potential contribution to the project. There are 3 sections of presentations.

Section 1: existing projects:
* Leen-Kiat Soh, Liz Lorang: University of Nebraska-Lincoln
  These people are focusing on newspapers and manuscript collections. It is an explorative project in image segmentation and visual content extraction. The project is called Aida.

* Thu Phuong 'Lisa' Nguyen, Hoover Institution Library & Archives, Stanford University
  These people are trying to process digital collection fo scanned documents from 1898 to 1945, published in the USA. They are working toward extracting meaningful data, document layout analysis, page-level segmentation, article segmentation. The text could be arranged in different directions (from left to right or from top to bottom). Some scripts are mixed, i.e., English and Japanese.

* Kurt Luther:  Assistant Professor of Computer Science and (by courtesy) History, Virginia Tech
  Kurt was leading a group on a project called civil war photo sleuth, which combines crowdsourcing and face recognition to identify historical portraits. They have about 4 million portraits today but only 10-20% are identified. They have developed a crowdsourcing platform with about 10k registered users.

* Ben Lee + Michael Haley Goldman: United States Holocaust Memorial Museum
  Ben and Michael are working on a project that involves 190 million images in WWII. Their goal is to trace missing family members. This dataset is an invaluable resource of Holocaust survivors and their families, as well as Holocaust researchers. They mostly use random forest models + template matching methods.

* Harish Maringanti: Associate Dean for IT & Digital Library Services; J. Willard Marriott Library at University of Utah

* David Smith: Associate Professor, Computer Science, Northeastern University
  David introduced his work on Networked Texts.

* Helena Sarin: Founder, Neural Bricolage
* Nick Adams: Goodly Labs

Section 2: Partnerships
* Mark Williams: Media Ecology Lab, Dartmouth College
  Mark mentioned an annotation tool called SAT "Semantic Annotation Tool".
* Karen Cariani: WGBH Media Library and Archives
* Josh Hadro + Tom Cramer: IIIF, Stanford Libraries
* Mia Ridge + Adam Farquhar: British Library
* Kate Murray: Library of Congress
* Representatives from the Smithsonian Institution Data Lab, National Museum of American History
  Rebecca from Smithsonian OCIO data science lab talks about machine learning at Smithsonian. Some interesting and potentially useful tools include Google vision API, RESNET50, and VGG. Their experiments indicate that the Google tool achieves high performance, but not customizable, RESNET and VGG have far lower success numbers but can be customized and re-trained.

* Jon Dunn + Shawn Averkamp: Indiana University Bloomington, AVP
* Peter Leonard: Yale DH Lab
  Peter talked about their project called PixPlot, which is a web interface to visualize about 30k images from Lund, Sweden. The source code is at https://github.com/YaleDHLab/pix-plot. The website is https://dhlab.yale.edu/projects/pixplot/.

Section 3: Future Possibilities & Challenges
* Michael Lesk: Rutgers University
  Michael talked about duplicate image detector tool at NMAH, including between 1-2 TB of images stored on legacy hardware and network directory. The goal is to determine if there are duplicates. If there are, which images have higher quality.

* Heather Yager: MIT Libraries
* Audrey Altman: Digital Public Library of America
* Hannah Davis: Independent Research Artist
  Hannah mentions an interesting tagger: https://www.tagtog.net/ 

Besides, the summit also has arranged open discussions and activities to stimulate the attendant's thoughts and discussions. Some noted questions are
* How do we communicate machine learning results/outputs to end-users?
* How does one get ML products from the pilot to production?
* Do you know of existing practical guidelines for successful machine learning project agreements?
* How can we overcome the challenges of access variable resources across varying contexts, such as infrastructure, licensing, and copyright structure?
* Which criteria would you use for evaluation of service whether for providers for internal/external use?

Another activity is to ask attendants in different tables to form groups and discuss factors to consider in collaborations with machine learning projects. Some noted points include
* Standardize and documenting data
* Clarity of roles and communication
* User expectation, regular share document of progress
* Organizational and political factors to get the project done.
* Get the right reasons, the right people, and the right plan. Having a value of the project.

Below are the people I met with both known and new friends.

* Stephen Downie from UIUC. He introduced to me some useful tools in HathiTrust that I can borrow for my ETD project.
* Tom Cramer from Stanford. Tom was leading a team to work on a similar project on mining ETDs. He also introduced the yewno.com website, which they are working with, to transform information in ETDs into knowledge.
* Kurt Luther from Virginia Tech at Arlington. Kurt was doing a historical portrait study.
* Wayne Graham from CLIR.
* Heather Yager from MIT. Heather and I had a brief chat on accessing ETDs from DSpace in MIT libraries.
* David Smith from Northeastern. David was an expert on image processing. He introduced hOCR to me which is exactly the tool I was looking for to identify bounding boxes of text on a scanned document.
* Michael Lesk from Rutgers. A senior but still energetic information scientist. He knew Dr. C. Lee Giles.
* Kate Zwaard the chief of National Digital Initiatives at the Library of Congress

Overall, the summit was very successful. The attendances presented real-world problems and discussed very practical questions. The logistic was also good. Eileen Jakeway did excellent jobs on communicating with people before and after the summit, including a follow-up survey. I thank Dr. Michael Nelson for telling me to register for this meeting.

I made a wise decision to stay overnight before the meeting. The traffic from Springfield to the Library of Congress is terrible with 3 accidents in the morning. I was lucky to find a parking spot costing $18 a day near LOC. The back trip was 1 hr longer than the map distance due to constructions. But the weather was fine and the people were friendly!

Jian Wu