Thursday, June 20, 2019

2019-06-20: Web Archiving and Digital Libraries Workshop 2019 Trip Report



A subset of JCDL 2019 attendees assembled together on June 6, 2019, at the Illini Union for the Web Archiving and Digital Libraries workshop (WADL 2019). Like previous years, this year's workshop too was organized by Dr. Martin Klein, Dr. Zhiwu Xie, and Dr. Edward A. Fox. Martin inaugurated the session by welcoming everyone and introducing the schedule for the day. He observed that WADL 2019 had an equal representation from both males and females, which was not only the case with attendees, but also presenters. Web Science and Digital Libraries Research Group (WS-DL) from the Old Dominion University was represented there by Dr. Michele C. Weigle, Alexander C. Nwala, and Sawood Alam (me) with two accepted talks.



Cathy Marshall from the Texas A&M University presented her keynote talk entitled, "In The Reading Room: what we can learn about web archiving from historical research". She told many fascinating stories and the process she went through to collect bits and pieces of those stories. Her talk shed light on many problems similar to what we see in web archiving. Her talk reminded me of her presentation at the IIPC General Assembly 2015 entitled, "Should we archive Facebook? Why the users are wrong and the NSA is right".


Corinna Breitinger from the University of Konstanz (but now moved to the University of Wuppertal) presented her team's work entitled, "Securing the integrity of time series data in open science projects using blockchain-based trusted timestamping". She discussed a service called OriginStamp that allows people to create a tamper-proof record of ownership of some digital data at the current time by creating a record in a blockchain. She mentioned Blockchain_Pi project that allows connecting a Raspberry Pi to blockchain for timestamping various sensor data. A remarkable achievement of their project was being cited by a German Supreme Court ruling on a Dashcam recording that was configured to trigger a timestamping call on a short clip when something unusual happens on the road.


I, Sawood Alam, presented "Impact of HTTP Cookie Violations in Web Archives". This was a summary of two of our blog posts entitled "Cookies Are Why Your Archived Twitter Page Is Not in English" and "Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay in Multiple Languages" in which we performed detailed investigation of two HTTP cookie related issues in web archives. We found that long-lasting cookies in web archives have undesired consequences in both crawling and replay.


Ed Fox from Virginia Tech presented his team's work entitled, "Users, User Roles, and Topics in School Shooting Collections of Tweets". They attempted to identify patterns in user engagement on Twitter regarding school shootings. They also created a tool called TwiRole (source code) that classifies a Twitter handle as "Male", "Female", or a "Brand" using multiple techniques.


Ian Milligan from the University of Waterloo presented his talk entitled, "From Archiving to Analysis: Current Trends and Future Developments in Web Archive Use". He emphasized that the historians of the future writing history of post-1996 will need to understand the Web. Web archives will play a big role in writing the history of today. It is important that there are tools beyond Wayback Machine that they can use to interact with web archives and understand their holdings. He mentioned the Archives Unleashed Cloud as a step in that direction.


Jasmine Mulliken from the Stanford University Press (SUP) presented her talk entitled, "Web Archive as Scholarly Communication". She described various SUP projects and related stories. She spent a fair amount of time describing the use of Webrecorder at SUP in projects like Enchanting the Desert. She also described that SUP is in peril and mentioned Save SUP site that documents the timeline of recent events threatening the existence of SUP. While talking about this, she played a clip from the finale of the Game of Thrones in which the dragon burns the Iron Throne.


Brenda Reyes Ayala from the University of Alberta presented her talk entitled, "Using Image Similarity Metrics to Measure Visual Quality in Web Archives" (slides). Automated quality assurance of archival collections is a topic of interest for many IIPC members. Brenda shared initial findings of her team using image similarities of captures with and without archival banners. She concluded that their result showed significant success in identifying poor and high quality captures, but there is a lot more that needs to be done to improve the quality assurance automation.


Sergej Wildemann from the L3S Research Center presented his talk entitled, "A Collaborative Named Entity Focused URI Collection to Explore Web Archives". He started his talk by describing that the temporal aspect of named entities is often neglected when indexing the live web. Temporal changes associated with an entity become more important when exploring an archival collection related to the entity. He mentioned Internet Archive's beta version of a new prototype of Wayback Machine released in 2016 that provided text search indexed based on the anchor text pointing to sites. Towards the end of his talk he showcased his tool called Tempurion that allows archived named entity search with temporal dimension attached for filtering search results based on date ranges.


I, Sawood Alam, presented my second talk (and the last talk of the day) entitled, "MementoMap: An Archive Profile Dissemination Framework". This talk was primarily based on our JCDL submission that was nominated for the best paper award, but in the WADL presentation we focused more on technical details, use cases, and possible extensions, instead of experimental results. We also talked about the Universal Key Value Store (UKVS) format with some examples.


Once all the formal presentation were over, we all started to discuss about post-workshop plans. The first matter we discussed was about making proceedings available online or as a special issue of a journal. In previous years (except the last year) WADL proceedings were published in the IEEE-TCDL Bulletin, which is discontinued now. Potential fallback options include: 1) compilation of all submissions with an introduction and publishing it as a single document to arXiv, but citing individual work would be an issue, 2) publishing on OSF Preprint, and 3) utilizing a GitHub Pages, with the added advantage of providing supplementary materiel such as slides. To enable more effective communication, a proposal was made to create a mailing list (e.g., using Google Groups) for the WADL community. It was proposed that posters should not be included in the call for papers, because the number of submissions are usually finite enough to give a presentation slot to everyone. Fun fact, only Corinna brought a poster this time. We discussed the possibility of more than one WADL events per year which may or may not be associated with a conference. Since the next JCDL event would be in China, people had some interest in having an independent WADL workshop in the US. Finally, we discussed the possibility of adding a day ahead of JCDL for a hackathon and a day after for the workshop where hackathon results can be discussed in addition to usual talks.

It was indeed a fun week of #JCDL2019 and #WADL2019 where we got to meet many familiar and new faces and explored the spacious campus of the University of Illinois. You may also want to read our detailed trip report of the JCDL 2019. We would like to thanks organizers and sponsors of both JCDL and WADL for making it happen. We would like to extend special thanks to Dr. Stephen Downie, without whom this event would not have been as organized and fun as it was. We would also like to thank NSF, AMF, and ODU for funding our travel expenses. Last but not the least, I would personally thank the "WADL DongleNet" which made it possible for me to connect my laptop with the projector twice.


--
Sawood Alam

Wednesday, June 19, 2019

2019-06-19: Use of Cognitive Memory to Improve the Accessibility of Digital Collections

Eye Tracking Scenario
(source - https://imotions.com/blog/eye-tracking-work/)
Since I joined ODU, I have been working with eye tracking data recorded when completing a Working Memory Capacity (WMC) measure to predict a diagnosis of Attention-Deficit/Hyperactivity Disorder (ADHD). People with ADHD could be restless and hyperactive with distinct behavioral symptoms such as difficulty in paying attention and controlling impulsive behaviors. WM is a cognitive system, which makes it possible for human  to hold and manipulate information simultaneously. Greater WMC means greater ability to use attention to avoid distraction. Theoretically, adults with ADHD have reduced working memory when compared with their peers, demonstrating significant differences in WMC.

Among many tasks (O-Span, R-SpanN-Back) to measure the WMC, the reading span task (R-Span) is used as a valid measure of working memory yielding a WMC score. In R-Span, participants are asked to read a sentence and letter they see on a computer screen. Sentences are presented in varying sets of 2-5 sentences. Participants are asked to judge sentence coherency by saying 'yes' or 'no' at the end of each sentence. Then, participants are asked to remember the letter printed at the end of the sentence. After a 2-5 sentence set, participants are asked to recall all the letters they can remember from that set. R-Span scores are generated based on the number of letters accurately recalled, divided by the total number of possible letters recalled in order. This task represents a person’s ability to hold and manipulate information simultaneously.

We investigated eye gaze metrics collected during this task to differentiate the performance of adults with and without ADHD. This was important as it reveals an important eye movements feature differences between atypical and complex attention systems. The precise measurements of eye movements during cognitively demanding tasks provide a window into underlying brain systems affected by ADHD or other learning disabilities.
Fig 1: Comparison of Eye Fixations for ADHD (Left) and Non-ADHD (Right) participant during WMC Task (source -https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)
We chose standard information retrieval evaluation metrics such as  precision, recall, f-measure, and accuracy to evaluate our work. We developed three detailed saccades (rapid changes of gaze) and fixation feature sets. Saccades are eye movements used to jump rapidly from one point to another. Fixations are the times which our eyes stop scanning and hold the vision in place to process what is being looked at. Feature includes the qualifiers: gender, number of fixations, fixation duration measured in milliseconds, average fixation duration in milliseconds, fixation standard deviation in milliseconds, pupil diameter left, pupil diameter right, and diagnosis label or class. The three feature sets categorized according to metric type:
1) fixation feature set
2) saccade feature set
3) saccade and fixation combination feature set

Fig 2: Classification of Eye Saccade Features during WMC (source - https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)

Fig 3: Classification of Eye Fixation and Saccade Features during WMC (source - https://www.igi-global.com/chapter/predicting-adhd-using-eye-gaze-metrics-indexing-working-memory-capacity/227272)
The purpose of our research is to determine if eye gaze patterns during a WMC task would help us create an objective measuring system to differentiate a diagnosis of ADHD for adults. We identified six of the top performing classifiers for each of the three feature sets: J48, LMT, RandomForest, REPTree, K Star, and Bagging. While fixation features, saccade features, and a combination of saccade and fixation features accurately predicted the classification of ADHD with an accuracy of greater than 78%, saccade features were the best predictors with an accuracy of 91%. 
We published our work at IGI Global book chapter.
Anne M. P. Michalek, *Gavindya Jayawardena, and Sampath Jayarathna. "Predicting ADHD Using Eye Gaze Metrics Indexing Working Memory Capacity", Computational Models for Biomedical Reasoning and Problem Solving, IGI Global, pp. 66-88. 2019
An extended version of the paper is published at arXiv that elaborates more on the use of area of interest (AOI) during the ADHD diagnosis with eye tracking measures.
Use of Working Memory Capacity in the Wild...
Research shows that learning disabilities may be present in one's life either from birth or develop later in life due to dementia or injuries. Regardless of their declined cognitive abilities, people are interested in learning new things. For instance, older adults love to read books and learn new things after retirement to make use of their free time. But, physical disabilities and declined cognitive abilities might restrict people from accessing library materials. The Library of Congress Digital Collection is an excellent place for people to do their research as all they need is a computer and an internet connection. Therefore it is essential to make these public digital collection accessible.

Fig 4: The Library Of Congress Digital Collection Home Page.
Most of the times, web developers focus on regular users, and tend to forget how to cater to all types of users. Digital Collections requires careful considerations for the web UI to make it accessible, and we believe, based on our eye tracking research on WMC, we can help content creators such as Library of Congress Digital Collections achieve that. 
In Dr. Jayarathna's HCI course, I learned, to understand people, to be careful of different perspectives, and to design for clarity, and consistency. But as you can see, the application of these rules may differ with requirements. Since we predicted a diagnosis of ADHD with an accuracy greater than 78% using eye gaze data, there is a potential where we could identify people with and without declined cognitive abilities. This allow us to dynamically determine the complexity of the attentional system (whether typical or complex) of users and provide variations of the UI (similar to how a language translator works, click of a button to change the UI to be accessible). 

In the Future...
Consider an example scenario when a person with ADHD views the content of the Library of Congress Digital Collection. With a click of a button, web UI can change the presentation of the content. If the person has ADHD or some other learning disability, the content could be arranged in a different layout which allows the user to interact with it differently. 
Expanding on our results, we set our goal to explore how can we generalize our study to improve content accessibility for the people with learning disabilities without overloading their cognitive memory. We plan to use the Library of Congress or other similar online platform to start our exploration.
There is a real opportunity for us to help content creators of digital collection to be make these collections accessible for people regardless of their cognitive abilities.

-- Gavindya Jayawardena  

Tuesday, June 18, 2019

2019-06-18: It is time to go back home!

On May 11, 2019 I officially obtained my PhD in Computer Science from Old Dominion University. My graduate studies journey started when I received a full scholarship from the University of Hail in Saudi Arabia, where I worked there two years as a teacher assistant. I came to the USA and specifically to San Francisco in 2010 with my husband and my three-months old daughter. I attended Kaplan Institute where I took English classes and a GRE course for almost a year. After that I got accepted in ODU as a CS Masters student in 2011. In July 2013 I welcomed my second baby girl Jenna, and in August I graduated from the Master program and joined the PhD program to work with the wsdl (Web Science and Digital Libraries) research group there.
On April 4, 2019, I defended my dissertation research, “Expanding the usage of web archives by recommending archived webpages using only the URI” (slides, video).

The goal of my work was to build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both the archive's HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. An example is when a user requests a Virginia Tech football webpage from the archive. The user knows about the popular Virginia Tech football webpage http://hokiesports.com/football/ and will request that webpage. This webpage is currently on the live Web and archived. However, the user does not know that the webpage http://techsideline.com exists in the archive. In 2013, when requesting the webpage from the live Web it redirects to https://virginiatech.sportswar.com. If the user did not have a link to that webpage on the live Web, the user will never know it existed.

To accomplish this, we first detect the semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. Archival quality refers to measuring memento damage by evaluating the number and impact of the missing resources in a webpage. Webpage popularity considers how often the webpage has been archived and its popularity on the live Web. A special case of popularity are webpages in “cold spots”, which are pages that are not on the live Web, are not currently popular, but are archived. Temporal similarity can refer to how close the candidate webpage’s Memento-Datetime is to the requested URI. URI similarity assesses the similarity of candidate URI tokens to the requested URI tokens. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations.
My life as a graduate student and especially PhD was not an easy one. Trying to juggle family responsibilities with academic work is a hard task which took me some time to figure a way to balance and handle. There are some lesson learned points that I think could be helpful to other graduate students. First, working on a research concentration that interests you and an advisor that is committed and productive is a key to success. It may take time to find the exact topic you are going to work on but with the right guidance from the advisor, doing a lot of reading on other people’s research, and performing some experiments along the way will help you get there. Second, working with a group that is energized will keep you motivated. It is important to have meetings with the other group members and talk about what was accomplished and what is the future work. Not only does this keep you energized, but it also may lead to research contributions. Third, try to find a balance between your personal life and academic life. It is not easy to have kids and do graduate studies, however having great family and friends support is important. Finally, being a graduate student requires patience and hard work. You need to be self motivated during this journey and believe in yourself.
After 9 long, hard, and beautiful years as a graduate student in the US, I will be heading home in June to where it all started, ‘University of Hail’ at the college of computer science and engineering, and work as an assistant professor.

-Lulwah M. Alkwai

Sunday, June 9, 2019

2019-06-09: How to Become a Tenure-Track Assistant Professor - Part I (publications, research, teaching and service)

This is a three-part write-up, in this first post, I’ll talk about what you need to do to prepare yourself in the next 2 to 3 years for a tenure-track assistant professor job. I’ll do another blog post about how to find tenure-track positions, how to shortlist your target schools, CV, teaching statement, research statement and cover letters. I’ll do another blog post later about the interview prep (skype/phone, onsite), what to do and not to do during your on-campus interview, offer negotiations, two-body problem etc.
If you are considering a tenure-track job, start as early as possible (2-3 years before your intended job search) working on your teaching, research topics and publications. For some of these, you need careful planning like accumulating enough publications, you need to start publishing as early in your doctoral research and as often in each year. If possible do real teaching, not just a teaching assistantship. Let me get into each of these and explain how I went about doing these aspects during my tenure-track job search.
Publications:
There is no hard and fast rule that says, if you publish this many papers you are guaranteed a tenure-track job. So you need to figure this out. When I went for job market in 2016, I had about 9 conference papers and 2 journal publications. Take a look at my CV, you can see the publications are all over the places, some not so good venues and few in top venues like (CIKM, JCDL, and journals like IEEE TBE and ACM TAP). Some of them are from my MS degree from another school (in different research topic). My recommendation, publish early, collaborate, and find multiple research topics (I’ll get back to this again when I talk about research topics below). As you can see from my not so stellar publication track record, I didn’t do all this, but I tried to publish as early in my PhD, easier said than done. I was the only PhD student (in my research group) on this particular topic and I was the last generation of my adviser’s PhD students working on this particular project. I’d suggest for you to ask your adviser for multiple projects or ask pair you with other students in the group. Also you can talk to other faculty in the department, especially new assistant professors, they always have new topics, and in need of people to jump start their own publication track record for tenure process. You can also start your own side project from something interesting, but always think who’s is going to pay for your publications (conferences are expensive!). Talk to others from industry, research labs, and universities you meet at conferences and meetings. Things don’t always works out as you think, especially people rarely stay true to their word of collaborations (when you meet them at conferences and workshops) if you are just a PhD student. So I’d stick with more of internal collaborators like faculty from other departments.
If you are serious about getting a good tenure-track job in a PhD granting CS department, I’d say you need about 15-20 solid publications (combination of 2 page poster papers, short and full papers and journals). Better if they are from good venues. Again, not a rule of thumb, just my personal opinion going through the tenure-track job market multiple times and being successful getting offers each of the time. Another tip, if you have a list of universities/departments you are interested in applying, take a look at the most recent assistant professors recruited and their track record. You can compare the publications, where they are coming from (tenure-track somewhere else, postdoc or direct ABD), the place of their PhDs, and if you are lucky they might have the CV so you can take a peek and see what else they did (service, teaching) to get in. Again, tenure-track hiring dynamics changes with every search committee and also will depend on the area the department wants to hire. So don’t dwell much in these but you can get a good sense about the hiring process and what sort of people the department hire. If your target is any primarily teaching institutions (non-PhD granting institutions, liberal arts schools, 4 year colleges, community colleges), you should be fine with may be having less than 10 publications, but you never know.
Teaching Experience:
You need a solid teaching experience to get a tenure track offer or at least an interview. Your teaching assistant (TA) experience probably not going to cut it. In my first 5 years of the PhD, I was pretty much a research assistant (RA), doing research related work. I was lucky enough to get into a teaching fellowship during my final year to do a real teaching. I taught for about 2 semesters as the instructor of record and I’m sure my first job offers (teaching oriented schools) and all the interviews I got were pretty much because of my solid teaching experience teaching undergraduates. Second time around I had enough teaching experience as an assistant professor of a primarily teaching school. I recommend all aspiring tenure-track applicants to find a solid teaching experiences.
Here’s few tips, talk to your department chair (or the faculty responsible for handling teaching assignments), they may need someone to take over lower level programming courses to each. Also talk to other schools around your area, like community colleges, technical schools and state schools, they always hire adjunct faculty to teach. If you are an international student, only option is to find a teaching position on-campus. If none of the above works, talk to your adviser, see whether you can pair teach. During my fellowship, I pair-taught with my adviser and also with another professor who was also a member of my dissertation committee, this I believe helped immensely to get a solid letter of recommendations. If this doesn’t work, may be ask your adviser whether you can deliver couple of invited lectures, I did this couple of times covering a class session when my adviser was on travel for a conference. You can ask other faculty in the department from any related areas too, who doesn’t love to skip a teaching duty once in a while!
Also don’t forget to collect some informal feedback from your classes. Take a look at my teaching portfolio, I regularly take feedback and keep track of what students say about my teaching in class. Especially this is important at the time when you write your teaching statement, you can talk about your formal evaluations (if you get any), and if not, some of the informal feedback you got. Also this is a good practice for your tenure-package, you can talk about what student say and quote from these informal feedbacks to write how you go about improving your classroom experiences.
Research Experience:
You need multiple research directions you want to pursue when you start as a tenure track assistant professor. Remember your dissertation topic is not going to be enough, you need to have topics that can pan over several years and interesting enough to get publications, grant money and students. Talk to your adviser or other faculty and collaborate. I was fortunate enough that my adviser got me into a secondary topic area that helped me use my experiences towards areas beyond my PhD topic. This is important when you write your research statement, having multiple topics will show the search committee that you have a well-planned direction for your future research agenda. Letter of recommendation writers: You need at least 3 solid letter of recommendations. Obviously your PhD adviser will be one of them and having a good relationship with him/her will be imperative. Make sure you are making a good progress, and this should be exemplified by your work as an independent researcher. So having multiple topics in different areas and collaborating with other faculty should give you those solid writers, you need people who can vouch for your collegiality, dedication and expertise. Remember the faculty life is sometimes more about how you work with your colleagues, so if you get an on-campus interview, this is something probably faculty who’s going to have one-on-one with you going look for (More about this later with my on-campus interview blog post).
Service:
This won’t be something as highly important as other points that I discussed earlier like publications, research and teaching, but having a solid service experiences work wonders. Service is part of the faculty life, so getting to do stuff other than just research and teaching help build your portfolio. Early you learn to juggle multiple things, will help you survive the busy faculty life of managing multiple roles. These opportunities won’t come to you, so reach out and talk to others and ask around. In early days of my PhD, I reached out to student communities like University Graduate Associations (part of student government) for possible representations. One time I held multiple university-wide positions representing the department, college and the university. Again, these are time consuming tasks, I advise you to participate for these early during your PhD career so you can focus on other important aspects (publications and research) later in your PhD life. I’d suggest you to focus more on things related to student advising (most of these committees require some student participants, see what is available on your university), these will give you good points to talk about in your teaching statement like how the life on the other side as a faculty. Also reach out and see if you have any opportunities to help the department search committee in some capacity, you get to see your potential competitors in the job market. Bonus, you get to see some of the wining candidates.
Another service activity you need to start early in the process is reviewing for conferences/journals and organizing committee activities for conferences and workshops. Do student volunteering early in your PhD career, get to know the organizing committee member for the future years and talk to them about volunteering for organizing committee activities. If you are from CS background, https://chisv.org/ lists bunch of conferences you can volunteer your time. Most of these conferences offer you a free conference registration, food, free goodies/t-shirt and some will even give you travel grants. Also look out for ACM-W, TAPIA, and Grace Hopper for travel scholarships for women in computing. If you don’t get to do student volunteering, talk to your adviser or other faculty, I got my first organizing committee volunteering opportunity by asking around. Jump into any vacant position and help out and make a name for yourself. You don’t always need to travel to conferences if you become part of the organizing committee, some positions require you to get things done before the conference starts, like publication chair (responsible for setting up the conference publishing activities), and publicity chair (twitter, and posting cfp to mailing lists). Also most of these positions come with co-chairing opportunities, so you are not the only one responsible for the activity, you get to share the work with several others. Also ask your colleagues, advisers and other faculty to recommend you to a program committee. You’ll get a chance to help the conferences reviewing papers. You can also ask your adviser for reviewing opportunities as a sub-reviewer. Go to the journal websites, most of them have a place for you to create a reviewer profile and you’ll get requests when papers are submitted relevant to your area of expertise. Again, these are time consuming tasks, you cannot build your profile overnight and need careful planning and starting things early in your PhD life.
Other Volunteering:
Volunteer for any opportunity that comes your way, departmental, college or university or external. There may be opportunities to be a judge for a high school or at an undergraduate poster competition, or to volunteer to teach something or give a lecture to local public school, organize an educational event, present a poster. Find things, all these are CV points you can add that brings the substance to your portfolio.
Internships:
From the get-go, apply for internships, talk to your faculty, friends, collaborators and see if they can refer you to a company summer position. Don’t afraid to shoot high, apply for top companies like Google, Microsoft and also look out for positions in places like research labs (Department of Energy, DOD etc.) Tenure track market is very competitive, there’re hundreds of people applying for handful of the vacant positions. Keep your options open with having some industry experiences.
Student Advising:
See whether your adviser can let you co-advise few undergraduate and other graduate students in the group. You can help them with their day-to-day research work, may be showing them around during initial years, help them with acclimating to the academic life. Volunteer in departmental and college wide opportunities to mentor students. These will become handy when you write your teaching statement.

--Sampath Jayarathna (@openmaze)

Thursday, June 6, 2019

2019-06-05: Joint Conference on Digital Libraries (JCDL) 2019 Trip Report

Alma Mater, a bronze statue at the University of Illinois by sculptor Lorado Taft. Photo by Illinois Library, used under CC BY 2.0 / Cropped from original
It's June, so this means it's time for the 19th ACM/IEEE Joint Conference on Digital Libraries Libraries (JCDL 2019). This year's JCDL was held at the University of Illinois, in Urbana-Champaign (UIUC) between June 2 - 6. Similar to last year's conference, we (members of WSDL) attended paper sessions, workshops (Web Archiving and Digital Libraries), tutorials, and panels, in which researchers from multiple disciplines presented the findings or progress of their respective research efforts. Unlike previous years, we did not feature any students or faculty in this year's JCDL doctoral consortium. We regret this and hope to resume next year.


Day 1



Following a welcome statement by Dr. Stephen Downie, Professor and Associate Dean for Research at the School of Information Sciences at UIUC, Day 1 began with a keynote from Dr. Patricia Hswe (pronounced "sway"), the program officer for Scholarly Communications at The Andrew W. Mellon Foundation. The title of her keynote was: Innovation is Dead! Long Live Innovation!
Her keynote proposed rethinking the purpose of innovation in the Digital Libraries domain to ensure what is being built is not entirely new. But to ensure innovation includes adaptation, reuse, recovery, etc., instead of rushing to build the next new "Next New Shiny Thing."

Three parallel paper sessions followed the keynote after a break:
  1. Generation and Linking
  2. Analysis and Curation, and 
  3. Search Logs

Generation and Linking Session


Pablo Figueira began this paper session with a full paper presentation titled: Automatic Generation of Initial Reading Lists: Requirements and Solutions. They proposed an automatic method for generating reading lists of scientific articles to help researchers familiarize themselves with existing literature by presenting four existing requirements, and one novel requirement for generating reading lists.
Next, Lucy McKenna, a PhD student at Trinity College Dublin, presented a full paper titled: NAISC: An Authoritative Linked Data Interlinking Approach for the Library Domain. They showed that Information Professionals such as librarians, archivists, and cataloguers have difficulty in creating five star Linked Data. Consequently, they proposed NAISC, an approach for assisting Information Professionals in the Linked Data creation process.
Next, Rohit Sharma presented a short paper titled: BioGen: Automated Biography Generation. They proposed BioGen, a system that automatically creates biographies of people by generating short sets of biographical sentences related to multiple life events. They also showed their system produced biographies similar to those manually generated by Wikipedia.
The Generation and Linking session ended with a short paper presentation by Tinghui Duan, PhD student at the University of Jena, titled: Corpus Assembly as Text Data Integration from Digital Libraries and the Web. Their work proposes a method of building a Digital Humanities corpora by searching and extracting fragments of high-quality digitized versions of artifacts from the Web.

Analysis and Curation Session


Dr. Antoine Doucet, professor of Computer Science at the University of La Rochelle, France, began the first paper session by presenting their full paper: Deep Analysis of OCR Errors for Effective Post‐OCR Processing. They presented the results of a study of five general Optical Character Recognition (OCR) errors: misspellings (real-word and non-word errors), edit operations, length effects, character position errors, and word boundary. Subsequently, they recommended different approaches to design and implement effective OCR post-processing systems.
Next,  Colin Post, a doctoral candidate in the Information and Library Science program at the University of North Carolina, Chapel Hill, presented a full paper (best paper nominee) titled: Digital curation at work: Modeling workflows for digital archival materials. This research provides insight about digital curation in practice by studying and comparing the digital curation workflows of 12 cultural heritage institutions, and focusing on the use of open-source software in their workflows.
Next was a presentation from Julianna Pakstis, Metadata Librarian at the Department of Biomedical and Health Informatics (DBHi) at the Children's Hospital of Philadelphia (CHOP), and Christiana Dobrzynski, Digital Archivist at DBHi. Their short paper presentation was titled: Advancing Reproducibility Through Shared Data: Bridging Archival and Library Practice. This research highlights the work of a team of librarians and archivists at CHOP. This team implemented Arcus, an initiative of the CHOP Research Institute with the purpose of providing the biomedical research data archive and discovery catalog more broadly available within their institution.
The session was concluded with Ana Lucic's short paper presentation titled: Unsupervised Clustering with Smoothing for Detecting Paratext Boundaries in Scanned Documents. This research explores addressing the problem of separating the main text of a work from its surrounding paratext, a task common to the processing of large collections of scanned text in the Digital Humanities domain. The paratext is often required to be removed in order to avoid the distortion of word counts computation, locating of references, etc. They proposed a method for detecting the paratext based on a smoothed unsupervised clustering technique, and showed that their method improved subsequently text processing post removal of the paratext.

Search Logs Session


This session began the first (best paper nominee) of three full papers presentation from Behrooz Mansouri, Computer Science PhD Student at the Rochester Institute of Technology, titled: Toward math-enabled digital libraries: Characterizing searches for mathematical concepts. The work explores what queries people use to search for mathematical concepts (e.g., "Taylor series") by studying a dataset of 392,586 queries from a two-year query log. Their results show that math search sessions are typically longer and less successful than general search, and their queries are more diverse. They claim these findings could aid in the design of search engines designed for processing mathematical notation.
Next, Maram Barifah, presented a full paper titled: Exploring Usage Patterns of a Large-scale Digital Library in which they proposed a framework for assisting librarians and webmasters explore the usage patterns of Digital Libraries.
Finally, Yasunobu Sumikawa, presented the final full paper of the session titled: Large Scale Analysis of Semantic and Temporal Aspects in Cultural Heritage Collection's Search. In this presentation they reported the results of a study of a 15-month snapshot of query logs of the online portal of the National Library of France to understand the the interest of users and how users find cultural heritage content.

Classification, Discovery and Recommendation Sessions


Following a lunch break, Abel Elekes, presented the first full paper titled: Learning from Few Samples: Lexical Substitution with Word Embeddings for Short Text Classification. To help in the classification of short text, this paper proposes clustering semantically similar terms when training data is scarce to improve the performance of text classification tasks.
Next, Andrew Collins, a researcher at Trinity College Dublin, presented a short paper titled: Document Embeddings vs. Keyphrases vs. Terms for Recommender Systems: A Large‐Scale Online Evaluation. They compared a standard term-based recommendation approach to document embedding and keyphrases - two methods used for related-article recommendation in digital libraries, by applying the algorithms to multiple recommender systems.
Next, Corinna Breitinger, a PhD student at the University of Konstanz, presented her short paper titled: 'Too Late to Collaborate': Challenges to the Discovery of in-Progress Research. She presented the finding from an investigation to understand how how computer science researchers from four disciplines currently identify ongoing research projects within their respective fields. Additionally, she outlined the challenges faced by researchers such as avoiding duplicate research, while protecting the progress of their research for fear of idea plagiarism.
Finally, Norman Meuschke, a PhD candidate at the University of Wuppertal, presented a full paper titled: Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations. He presented their approach for addressing the problem of detecting concealed plagiarism (heavy paraphrasing, translation, etc.) in scholarly text which consists of a two-staged detection that combines similarity assessments of mathematical content, academic citations, and text, as well as similarity measures that consider the order of mathematical features.
Minute Madness followed after Norman's presentation, wrapping up the scholarly activities of Day 1 of JCDL. In Minute Madness, poster presenters were given one minute to advertise their respective posters to the conference attendees. The poster session began after the minute madness.

Minute Madness



Day 2



Day 2 of JCDL 2019 began with a keynote from Dr. Robert Sanderson, the Semantic Architect for the J. Paul Getty Trust: Standards and Communities: Connected People, Consistent Data, Usable Applications. The keynote highlighted the value of Web/Internet standards in providing the underlying foundation that makes the connected world possible. Additionally, the keynote explored the relationship between standards and their target communities, some common inverse relationships such as the trade-off between the completeness and usability, production and consumption, etc.
The Web Archives session followed the keynote.


Web Archives 1 Session


Sawood Alam,  a PhD student at Old Dominion University, and member of the WSDL group presented a full paper on behalf of Mohamed AturbanArchive Assisted Archival Fixity Verification Framework. Sawood presented two approaches, Atomic and Block, to establish and check fixity ( testing if an archived resource has not been unaltered since the last capture time) of archived resources. The Atomic approach for checking fixity involves storing fixity information of web pages in a JSON file and publishing the fixity content before it is disseminated to multiple on-demand Web archive. In contrast, the block approach involves merging the fixity information of multiple archived pages in a single file before its publication and dissemination to the archives.

Next, Dr. Martin Klein, a research scientist, at the Los Alamos National Laboratory presented a short paper titled: Evaluating Memento Service Optimizations. He explained the the problem of long response time services that utilize the Memento Aggregator experience. This problem arises because search requests are broadcast to all Web archives connected to the Aggregator irrespective of the fact that some URI requests can only be fulfilled by some Web Archives. He subsequently reported some results of some performance optimizations of the Memento Aggregator such as Caching and Machine Learning-based predictions.
Finally, Sawood Alam, again, presented a full paper (best paper nominee) titled: MementoMap Framework for Flexible and Adaptive Web Archive ProfilingSawood additionally proposed the MementoMap framework as a flexible and adaptive means of efficiently summarizing the holdings of a Web archive, showing its application for the summary of the holdings of a Portuguese Web archive  collection (http://arquivo.pt/) consisting of 5 billion mementos (archived copies of web pages).

Other papers were presented concurrently in the Analysis and Processing session.


Analysis and Processing Session


In this session, Felix Hamborg, a PhD candidate at the University of Konstanz, presented a full paper titled: Automated Identification of Media Bias by Word Choice and Labeling in News Articles. Felix presented their research about an automatic method to detect a specific form of news bias - Word Choice and Labeling (WCL). WCL often occurs when journalists use different terms (e.g., "economic migrants" vs. "refugees.") to refer to the same concepts.
Next, Drahomira Herrmannova,  presented a full paper (Vannevar Bush best paper award winner) titled: Do Authors Deposit on Time? Tracking Open Access Policy Compliance. This paper presented the findings from an analysis of 800,000 research papers published over a 5 year period. They investigated if the time lag between the publication date of research papers and the dates the papers were deposited in a repository can be tracked across thousands of repositories globally.
Following a break, the paper sessions continued.

Web Archives 2 Session


Sergej Wildemann, a researcher at the L3S Research Center, began with a full paper presentation titled: Tempurion: A Collaborative Temporal URI Collection for Named Entities, where he introduced Tempurion, a collaborative service for enriching entities (e.g., People, Places, and Creative Work) by linking them with URLs that best describe them. The URLs are dynamic in nature and change as the associated entities change.
Next, I (Alexander Nwala) presented a full paper (best paper nominee) titled: Using Micro-collections in Social Mediato Generate Seeds for Web Archive Collections. I highlighted the importance of Web Archive collections as a means of traveling back in time to study events (e.g., Ebola Virus Outbreak and Flint Water Crisis) that may not be properly represented on the live Web due to link rot. These Archived collections begin with seed URLs that are often manually selected by experts or crowdsourced. As a result of the time consuming nature in collecting seed URLs for Web Archive collections, it is common for major news events to occur without the creation of a Web Archive collection to memorialize the events, justifying the need for automatically generating seed URLs. I showed that social media Micro-collections (curated lists created by social media users) provide the opportunity for generating seeds and produce collections with distinctive properties from convention collections generated by scraping Web and Social Media Search Engine Result Pages (SERPs).

Next, Dr. Ian Milligan, history professor at the University of Waterloo, presented a short paper titled: The Cost of a WARC: Analyzing Web Archives in the Cloud. Dr. Milligan explored and answered (US$7 per TB) the question he proposed: "How much does it cost to analyze Web archives in the cloud?" He used the Archives Unleashed platform as an example to show some of the infrastructural and financial cost associated with supporting scholarship in the humanities and social sciences.
Finally, Dr. Ian Milligan, again, presented another short paper titled: Building Community and Tools for Analyzing Web Archives through Datathons. In his second talk of the session, Dr. Milligan highlighted lessons learned from conducting the Archives Unleashed Datathons. The Archives Unleashed Datathons started in March 2016, as a collaborative Data hackathon in which social scientists, humanists, archivists, librarians, computer scientists, etc. work together for 2-3 days on analyzing Web archive data.
Another series of paper sessions followed after a break.

User Interface and Behavior Session


Dr. George Buchanan and Dr. Dana Mckay, researchers at the University of Melbourne School of Computing and Information systems, presented a full paper titled: One Way or Another I'm Gonna Find Ya: The Influence of Input Mechanism on Scrolling in Complex Digital Collections. They presented their findings from comparing the effect of input modality-touch and scrolling-on navigation in book browsing interfaces, by reporting user satisfaction associated with horizontal and two-dimensional scrolling.
Next, Dr. Dagmar Kern, a Human Computer Interaction and User Interface Engineering researcher at Gesis, presented a short paper titled: Recognizing Topic Change in Search Sessions of Digital Libraries Based on Thesaurus and Classification System. She presented their thesaurus and classification-based solution for segmenting user session information of a social science literature into its topical components.
Finally, Cole Freeman, a researcher at Northern Illinois University, presented the last short paper of the session titled: Shared Feelings: Understanding Facebook Reactions to Scholarly Articles. where he presented a new dataset of Facebook Reactions to research papers, and the results of analyzing it.

Citation Session


Dattatreya Mohapatra, a recent Computer Science graduate of Indraprastha Institute of Information, presented, a full paper (best student paper award winner) titled: Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees. He presented a novel data structure, the Influence Dispersion Tree (IDT) to model the impact of a scientific paper without relying of citation counts, but instead captures the relationship of follow-up papers and and their citation dependencies.
Next, Leonid Keselman, a researcher at Carnegie Mellon University, presented a full paper titled: Venue Analytics: A Simple Alternative to Citation‐Based Metrics. He presented a means for automatically organizing and evaluating the quality of Computer Science publishing venues, by producing venue scores for conferences and journals, done by formulating venue authorship as a regression problem.
Day 2 ended with the conference banquet and awards presentation at the Memorial football stadium.
The best demo award was given to MELD: a Linked Data Framework for Multimedia Access to Music Digital Libraries, by Dr. Kevin Page, David Lewis, and Dr. David M. Weigl
The best student paper award to given to Go Wide, Go Deep: Quantifying the Impact of Scientific Papers through Influence Dispersion Trees, by Dattatreya Mohapatra, Abhishek Maiti, Dr. Sumit Bhatia and Dr. Tanmoy Chakraborty
The Vannevar Bush best paper award was given to Do Authors Deposit on Time? Tracking Open Access Policy Compliance by Drahomira Herrmannova, Nancy Pontika and Dr. Petr Knoth


Day 3



Day 3 of JCDL 2019 began with a keynote from Dr. John Wilkin, the Dean of Libraries and University Librarian at the University of Illinois at Urbana-Champaign. His keynote was titled: How do you lift an elephant with one hand? and explored the challenges overcome in building the HathiTrust Digital Library, a large-scale digital repository that offers millions of titles digitized from libraries around the world.
Following the keynote was an ACM Digital Library (DL) panel session titled: Towards a DL by the Communities and for the Communities. The ACM Digital Library & Technology Committee is headed by Dr. Michael Nelson and Dr. Ed Fox, and the panel session featured talks from Dr. Daqing He, Dr. Dan Wu, Wayne Graves, and Dr. Martin Klein. During the panel, Dr. Daqing presented usage statistics of the ACM DL, Wayne Graves, Director of Information Systems at ACM presented the redesigned ACM DL website (available soon) and received feedback on existing and future services, and Dr. Martin Klein presented Piloting a ResourceSync Interface for the ACM Digital Library. Dr. Dan Wu invited the researchers to Wuhan University, the host of the JCDL 2020 conference, and introduced the audience to the city, subsequently, Dr. Stephen Downie gave the conference closing remarks.



I would like to thank the organizers and sponsors of the conference and the hosts, Dr. Stephen Downie and the University of Illinois, in Urbana-Champaign (UIUC), and Corinna Breitinger for taking and uploading additional photos of the conference.

-- Alexander C. Nwala (@acnwala)