Monday, November 12, 2018

2018-11-12: Google Scholar May Need To Look Into Its Citation Rate



Google Scholar has long been regarded as a digital library containing the most complete collection of scholarly papers and patterns. For a digital library, completeness is very important because otherwise, you cannot guarantee the citation rate of a paper, or equivalently the in-link of a node in the citation graph. That is probably why Google Scholar is still more widely used and trusted than any other digital libraries with fancy functions.

Today, I found two very interesting aspects of Google Scholar, one is clever and one is silly. The clever side is that Google Scholar distinguishes papers, preprints, and slides and count citations of them separately.

If you search "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", you may see the same view as I attached. Note that there are three results. The first is a paper on IEEE. The second actually contains a list of completely different authors. These people are probably doing a presentation of that paper. The third is actually a pre-print on arXiv. These three have different numbers of citations, which they should do.


The silly side is also reflected in the search result. How does a paper published in less than a year receive more than 1900 citations? You may say that may be a super popular paper. But if you look into the citations. Some do not make sense. For example, the first paper that "cites" the DeepLab paper was published in 2015! How could it cite a paper published in 2018?

Actually, the first paper's citation rate is also problematic. A paper published in 2015 was cited more than 6500 times! And another paper published in 2014 was cited more than 16660 times!

Something must be wrong about Google Scholar! The good news that the number looks higher, which makes everyone happy! :)


Jian Wu

Saturday, November 10, 2018

2018-11-11: More than 7000 retracted abstracts from IEEE. Can we find them from IA?


Science magazine:

More than 7000 abstracts are quietly retracted from the IEEE database. Most of these abstracts are from IEEE conferences that took place between 2009 and 2011.  The plot below clearly shows when the retraction happened. The reason was weird: 
"After careful and considered review of the content of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE’s Publication Principles. "
Similar things happened in Nature subsidiary journal (link) and other journals (link).


The question is can we find them from internet archive? Can they still be legally posted on a digital library like CiteSeerX? If they do, they can provide a very unique training dataset to be used for fraud and/or plagiarism detection, assuming that the reason under the hood is one of them. 

Jian Wu

2018-11-10: Scientific news and reports should cite original papers

Image result for sciencealert

I highly encourage all scientific news or reports cite corresponding articles. ScienceAlert usually does a good job on this. This piece of scientific news from ScienceAlert discovers two Rogue planets.  Most planets we discovered rotate around a star. A Rogue planet does not rotate around a star, but the center of the Galaxy. Because planets do not emit light, Rogue planets are extremely hard to detect. This piece of news cites a recently published paper on arXiv. Although anybody can publish papers on arXiv. Papers published by reputable organizations should be reliable.

A reliable citation is beneficial for all parties. It makes the scientific news more trustable. It gives credits to the original authors. It could also connect readers to a place to explore other interesting science.

Jian Wu



Friday, November 9, 2018

2018-11-09: Grok Pattern

Image result for logstash logo
Grok is a way to match a text line against a regular expression, map specific parts of the line into dedicated fields, and perform actions based on this mapping. Grok patterns are (usually long) regular expressions that are widely used in log parsing. With tons of search engine logs, how to effectively parse them, extract useful metadata for analytics, training, and prediction has become a key problem in mining text big data. 

In this article, Ran Ramati gives a beginner's guide to Grok Pattern used in Logstash, one of the powerful tools in the Elastic Stack (the other two are Kibana and Elastic Search).

https://logz.io/blog/logstash-grok/

The StreamSets webpage gives a list of Grok pattern examples: 

https://streamsets.com/documentation/datacollector/3.4.3/help/datacollector/UserGuide/Apx-GrokPatterns/GrokPatterns_title.html

The recent paper by Huawei research lab in China summarizes and compare a number of log parsing tools:

https://arxiv.org/abs/1811.03509

I am kind of surprised that although they cited the Logstash website, they did not compare Logstash with its peers.

 Jian Wu

Thursday, November 8, 2018

2018-11-08: Decentralized Web Summit: Shaping the Next Web


In my wallet I have a few ₹500 Indian currency notes that say, "I PROMISE TO PAY THE BEARER THE SUM OF FIVE HUNDRED RUPEES" followed by the signature of the Governor of the Reserve Bank of India. However, this promise was broken two years ago from today, since then these bills in my pocket are nothing more than rectangular pieces of printed paper. So, I decided to utilize my origami skills and turn them into butterflies.

On November 8, 2016, at 8:00 PM (Indian Standard Time), Indian Prime Minister Narendra Modi announced the demonetization (effective in four hours after midnight) of the two biggest currency notes (₹1,000 and ₹500) in circulation at that time. Together these two notes represented about 86% of the total cash economy of India at that time. More than 65% of the Indian population still lives in rural and remote areas where availability of electricity, the Internet, and other utilities is not reliable yet. Hence, cash is a very common means of business in daily life there. It was morning here in Norfolk (USA) and I was going through the news headlines when I saw this announcement. I could not believe for a while that the news was real and not a hoax. I did not even know that there is a concept called demonetization that governments can practice. Irrespective of my political views and irrespective of the intents and goals behind the decision (whatever good or bad they might have been) I was shocked to realize that the system has so much centralization of power in place that a single person can decide sufferings for about 18% of the global population overnight and cause a chaos in the system. I wished for a better and more resilient system, I wanted a system with decentralization of power by design where no one entity has a significant share of power and influence. I wanted a DECENTRALIZED SYSTEM!

When the Internet Archive (IA) announced plans for the Decentralized Web (DWeb) Summit, I was on board to explore what can we do to eliminate centralization of control and power in systems on the Web. With a generous support from the Protocol Labs, AMF, and NSF IIS-1526700 grants I was able to travel to the West Coast to experience four days full of fun and many exciting events. I got the opportunity to meet many big names who brought us the Web we experience today and many of those who are working towards shaping the future of the Web with their vision, ideas, experience, code, art, legal understanding, education, or social values. They all had a different perspective to share with the rest, but all seemed to agree on one goal of fixing the current Web where freedom of expression is under an ever-growing threat, governments control the voice of dissent, big corporations use personal data of the Internet users for monetary benefits and political influence, and those in power try to suppress the history they might be uncomfortable with.

There was so much going on in parallel that perhaps no two people have experienced the same sequence of events. Also, I am not even pretending to tell everything I have observed there. In this post I will be describing my experience of the following four related events briefly that happened between July 31 and August 3, 2018.

  • IndieWebCamp SF
  • Science Fair
  • Decentralized Web Summit
  • IPFS Lab Day

IndieWebCamp SF


The IndieWeb is a people-focused alternative to the "corporate web". Their objectives include: 1) Your content is yours, 2) You are better connected, and 3) You are in control. Some IndieWeb people at Mozilla decided to host IndieWebCamp SF, a bootcamp the day before #DWebSummit starts and shared open invitation to all participants. I was quick to RSVP there which was going to be my first interaction with the IndieWeb.

On my way from the hotel to the Mozilla's SF office the Uber driver asked me why I came to SF. I replied to her, "to participate in an effort to decentralize the Web". She seemed puzzled and said, "my son was mentioning something about it, but I don't know much". "Have you heard about Bitcoin?", I asked her to get an idea how to explain. "I have heard this term in the news, but don't really know much about it", she said. So, I started the elevator pitch and in the next eight or so minutes (about four round trips of Burj Khalifa's elevator from the ground to the observation deck) I was able to explain some of the potential dangers of centralization in different aspects of our social life and what are some of the alternatives.




The bootcamp had both on-site and remote participants and was well organized. We started with keynotes from Miriam Avery, Dietrich Ayala, and Ryan Barrett then some people introduced themselves, why were they attending the DWeb Summit, and what ideas they had for the IndieWeb bootcamp. Some people had lightning demos. I demonstrated InterPlanetary Wayback (IPWB) briefly. I got to meet some people behind some projects I was well aware of (such as Universal Viewer and Dat Project) and also got to know about some projects I didn't know before (such as Webmention and Scuttlebutt). We then scheduled BarCamp breakout sessions and had lunch.

During and after the lunch I had an interesting discussion and exchanged ideas with Edward Silverton from the British Library and a couple of people from Mozilla's Mixed Reality team about the Universal Viewer, IIIF, Memento, and multi-dimensional XR on the Web.




Later I participated in two sessions "Decentralized Web Archiving" and "Free Software + Indieweb" (see the schedule for notes on various sessions). The first one was proposed by me in which I explained the state of Web archiving, current limitations and threats, and the need to move it to a more persistent and decentralized infrastructure. I have also talked about IPWB and how it can help in distributed web archiving (see notes for details and references). In the latter session we talked about different means to support Free Software and open-source developers (for example bug bounty, crowdfunding, and recurring funding), compared and contrasted different models and their sustainability as compared with closed-source software backed by for-profit organizations. We also touched on some licensing complications briefly.

I had to participate in the Science Fair at IA, so I had to get there a little earlier than the start time of the session. With that in mind, Dietrich (from the Firefox team) and I left the session a little before it was formally wrapped up as the SF traffic in the afternoon was going to make it a rather long commute.

Science Fair


The taxi driver was an interesting person with whom Dietrich and I shared the ride from the Mozilla SF office to the Internet Archive, talking about national and international politics, history, languages, music, and whatnot until we reached our destination where food trucks and stalls were serving the dinner. It was more windy and chilly out there than I anticipated in my rather thin jacket. Brewster Kahle, the founder of the IA, who had just came out of the IA building, welcomed us and navigated us to the registration desk where a very helpful team of volunteers gave us our name badges and project sign holders. I acquired a table right outside the entrance of the IA's building, placed the InterPlanetary Wayback sign on it, and went to the food truck to grab my dinner. When I came back I found that the wind has blown my project sign off the table, so I moved it inside of the building where it was a lot cozier and crowded.

The Science Fair event was full of interesting projects. You may explore the list of all the Science Fair projects along with their description and other details. Alternatively, flip through the pages of the following photo albums of the day.






Many familiar and new faces visited my table, discussed the project, and asked about its functionality, architecture, and technologies. On the one hand I met people who were already familiar with our work and on the other hand some needed a more detailed explanation from scratch. I even met people who asked with a surprise, "why would you make your software available to everyone for free?" This needed a brief overview of how the Open Source Software ecosystem works and why one would participate in it.




This is not a random video. This clip was played to invite Mike Judge, Co-creator of HBO's Silicon Valley on the stage for a conversation with Cory Doctorow in the Opening Night Party after Brewster's welcome note (due to the streaming rights issue the clip is missing in IA's full session recording). I can't think of a better way to begin the DWeb Summit. This was my first introduction with Mike (yes, I had not watched the Silicon Valley show before). After an interesting Q&A session on the stage, I got the opportunity to talk to him in person, took a low-light blurred selfie with him, mentioned Indian demonetization story (which, apparently, he was unaware of), and asked him to make a show in the future about potential threats on DWeb. Web 1.0 emerged as a few entities having control on publishing with the rest of the people being consumers of that content. Web 2.0 enabled everyone to participate in the web both as creators and consumers, but privacy and censorship controls gone in the hands of governments and a few Internet giants. If Web 3.0 (or DWeb) could fix this issue too, what would potentially be the next threat? There should be something which we may or may not be able to think of just yet, right?


Mike Judge and Sawood Alam


Decentralized Web Summit


For the next two days (August 1–2) the main DWeb Summit was organized in the historical San Francisco Mint building. There were numerous parallel sessions going on all day long. At any given moment perhaps there was a session suitable for everyone's taste and no one could attend everything they would wish to attend. A quick look at the full event schedule would confirm this. Luckily, the event was recorded and those recordings are made available, so one can watch various talks asynchronously. However, being there in person to participate in various fun activities, observe artistic creations, experience AR/VR setups, and interacting with many enthusiastic people with many hardware, software, and social ideas are not something that can be experienced in recorded videos.





If the father of the Internet with his eyes closed trying to create a network with many other participants with the help of a yellow yarn, some people trying to figure out what to do with colored cardboard shapes, and some trying to focus their energy with the help of specific posture are not enough then flip through these photo albums of the event to have a glimpse into many other fun activities we had there.





Initially, I tried to plan my agenda but soon I realized it was not going to work. So, I randomly picked one from the many parallel sessions of my interest, spent an hour or two there, and moved to another room. In the process I interacted with many people from different backgrounds participating both in their individual or organizational capacity. Apart from usual talk sessions we discussed various decentralization challenges and their potential technical and social solutions in our one-to-one or small group conversations. An interesting mention of additive economy (a non-zero-sum economy where transactions are never negative) reminded me of our gamification idea we explored when working on the Preserve Me! project and I ended up having a long conversation with a couple of people about it during a breakout session.




If Google Glass was not cool enough then meet Abhik Chowdhury, a graduate student, working on a smart hat prototype with a handful of sensors, batteries, and low-power computer boards placed in a 3D printed frame. He is trying to find a balance in on-board data processing, battery usage, and periodic data transfer to an off-the-hat server in an efficient manner, while also struggling with the privacy implications of the product.

It was a conference where "Crypto" meant "Cryptocurrency", not "Cryptography" and every other participant was talking about Blockchain, Distributed/Decentralized Systems, Content-addressable Filesystem, IPFS, Protocols, Browsers, and a handful other buzz-words. Many demos there were about "XXX but decentralized". Participants included the pioneers and veterans of the Web and the Internet, browser vendors, blockchain and cryptocurrency leaders, developers, researchers, librarians, students, artists, educators, activists, and whatnot.

I had a lightning talk entitled, "InterPlanetary Wayback: A Distributed and Persistent Archival Replay System Using IPFS", in the "New Discoveries" session. Apart from that I spend a fair amount of my time there talking about Memento and its potential role in making decentralized and content-addressable filesystems history-aware. During a protocol related panel discussion, I worked with a team of four people (including members from the Internet Archive and MuleSoft) to pitch the need of a decentralized naming system that is time-aware (along the lines of IPNS-Blockchain) and can resolve a version of a resource at a given time in the past. I also talked to many people from Google Chrome, Mozilla Firefox, and other browser vendors and tried to emphasize the need of native support of Memento in web browsers.

Cory Doctorow's closing keynote on "Big Tech's problem is Big, not Tech" was perhaps one of the most talked about talk of the event, which received many reactions and commentary. The recorded video of his talk is worth watching. Among many other things in his talk, he encouraged people to learn programming and to understand functions of each software we use. After his talk, an artist asked me how can she or anyone else learn programming? I told her, if one can learn a natural language, then programming languages are way more systematic, less ambiguous, and easier to learn. There are really only three basic constructs in a programming language that include variable assignments, conditionals, and loops. Then I verbally gave her a very brief example of mail merge using all of these three constructs that yields gender-aware invitations using a message template for a list of friends to be invited in a party. She seemed enlightened and delighted (while enthusiastically sharing her freshly learned knowledge with other members of her team) and exchanged contacts with me to learn more about some learning resources.

IPFS Lab Day


It looks like people were too energetic to get tired of such jam-packed and eventful days as some of them have planned post-DWeb events of special interest groups. I was invited by Protocol Labs to give an extended talk in one such IPFS-centric post-DWeb event called Lab Day 2018 on August 3. Their invitation arrived the day after I had booked my tickets and reserved the hotel room, so I ended up updating my reservations. This event was in a different location and the venue was decorated with a more casual touch with bean bags, couches, chairs, and benches near the stage and some containers for group discussions. You may take a glimpse of the venue in these pictures.








They welcomed us with new badges, some T-shirts, and some best-seller books to take home. The event had a good lineup of lightning talks and some relatively longer presentations, mostly extended forms of similar presentations in the main DWeb event. Many projects and ideas presented there were in their early stages. These sessions were recorded and published later after necessary editing.

I presented my extended talk entitled, "InterPlanetary Wayback: The Next Step Towards Decentralized Archiving". Along with the work already done and published about IPWB, I also talked about what is yet to be done. I explored the possibility of an index-free, fully decentralized collaborative web archiving system as the next step. I proposed some solutions that would require some changes in IPFS, IPNS, IPLD, and other technologies around to accommodate the use case. I encouraged people to discuss with me if they have any better ideas to help solve these challenges. The purpose was to spread the word out so that people keep web archiving related use cases in mind while shaping the next web. Some people from the core IPFS/IPNS/IPLD developer community approached me and we had an extended discussion after my talk. The recording of my talk and slides are made available online.




It was a fantastic event to be part of and I am looking forward to more such events in the future. IPFS community and people at Protocol Labs are full of fresh ideas and enthusiasm and they are a pleasure to work with.

Conclusions


Decentralized Web has a long way to go and DWeb Summit is a good place to bring people from various disciplines with different perspectives together every once in a while to synchronize all the distributed efforts and to identify the next set of challenges. While I could not attend the first summit (in 2016) I really enjoyed the second one and would love to participate in future events. Those two short days of the main event had more material than I can perhaps digest in two weeks, so my only advice would be to extend the duration of the event instead of having multiple parallel session with overlapping interests.

I extend my heartiest thanks to organizers, volunteers, fund providers, and everyone involved in making this event happen and making it a successful one. I wish going forward not just the Web, but many other organizations, including governments, become more decentralized so that I do not open my wallet once again to realize it has some worthless pieces of currency bills that were demonetized over night.

Resources




--
Sawood Alam

Friday, October 19, 2018

2018-10-19: Some tricks to parse XML files

Recently I was parsing the ACM DL metadata in XML files. I thought parsing XML is a very straightforward job provided that Python has been there for a long time with sophisticated packages such as BeautifulSoup and lxml. But I still encountered some problems and it took me quite a bit of time to figure how to handle all of them. Here, I share some tricks I learned. They are not meant to be a complete list, but the solutions are general so they can be used as the starting points to handle future XML parsing jobs.


  • CDATA. CDATA is seen in the values of many XML fields. CDATA means Character Data. Strings inside the CDATA section are not parsed. In other words, they are kept as what they are, including marksups. One example is
    <script> 
    <![CDATA[  <message> Welcome to TutorialsPoint </message>  ]] >
    </script >
  • Encoding. Encoding is a pain in text processing. The problem is that there is no way to know what the encoding the text is before opening it and reading it (at least in Python). So we must sniff it by trying to open and read the file using an encoding. If the encoding is wrong, the program usually will throw an error message. In this case, we try another possible encoding. The "file" command in Linux gives the encoding information so I know there are 2 encodings in the ACM DL XML file: ASCII and ISO-8859. 
  • HTML entities, such as &auml; The only 5 built-in entities in XML are quotampaposlt and gt. So any other entities should be defined in the DTD file to show what they mean. For example, the DBLP.xml file comes with a DTD file. The ACM DL XML should have associated DTD files: proceedings.dtd and periodicals.dtd but they are not in my dataset.
The following snippet of Python code solves all the three problems above and give me the correct parsing results.

encodings = ['ISO-8859-1','ascii']
for e in encodings:
    try:
        fh = codecs.open(confc['xmlfile'],'r',encoding=e)
        fh.seek(0)
    except UnicodeDecodeError:
        logging.debug('got unicode error with %s, trying a different encoding' % e)
    else:
        logging.debug('opening the file with encoding: %s' % e)
        break

f = codecs.open('xmlfile',encoding=e)
soup = BeautifulSoup(f.read(),'html.parser')


Note that we use codecs.open() instead of the Python built-in open(). And we open the file twice, the first time only to check the encoding, and the second time the whole file is pass to a handle before it is parsed by BeautifulSoup. I found that BeautifulSoup is better to handle XML parsing than lxml, not just because it is easier to use but also because you are allowed to pick the parser. Note I choose the html.parser instead of the lxml parser. This is because the lxml parser is not able to parse all entries (for some unknown reason). This is reported by other users on stackoverflow.

Jian Wu

Thursday, October 11, 2018

2018-10-11: iPRES 2018 Trip Report

September 24th marked the beginning of iPRES 2018 located in Boston, MA, for which both Shawn Jones and I traveled from New Mexico to present our accepted papers: Measuring News Similarity Across Ten U.S. News SitesThe Off-Topic Memento Toolkit, and The Many Shapes of Archive-It.

iPRES ran paper and workshop sessions in parallel, therefore I will focus on the sessions I was able to attend. However, this year organizers created and shared collaborative notes with all attendees for all sessions to help others who couldn't attend many individual sessions. All the presentation materials and associated papers were also made available via google drive.

Day 1 (September 24, 2018): Workshops & Tutorials

The first day of iPRES attendees gathered at the Joseph B. Martin Conference Center at Harvard Medical School to get their registration lanyards and iPRES swag.

Afterwards, there were scheduled workshops and tutorials to enjoy throughout the day. Attending registrants needed to sign up early to get into these workshops. Many different topics were available for to attendees choose from found on Open Science Framework event page. Shawn and I chose to attend: 
  • Archiving Email: Strategies, Tools, Techniques. A tutorial by: Christopher John Prom and Tricia Patterson.
  • Human Scale Web Collecting for Individuals and Institutions (Webrecorder Workshop). A workshop by: Anna Perricci.
Our first session on Archiving Email consisted of talks and small group discussion on various topics and tools for archiving email. It started with talks on the adoption of email preservation systems into our organizations. Within our group talk, it was found that few organizations have email preservation systems. I found the research ideas and topics stemming from these talks to be very interesting especially in the aspect of studying natural language from email content.
Many of the difficulties of archiving email unsurprisingly revolve around issues of privacy. Some of the difficulties range from actually requesting and acquiring emails from users, discovering and disclosing sensitive information inside emails, and also other ethical decisions for preserving emails.

Email preservation also has the challenge of curating at scale. As one can imagine, going through millions of emails inside of a collection can be time consuming and redundant which requires the development of new tools to combat these challenges.
This workshop also exposed many interesting tools to use for archiving and exploring emails including:



Many different workflows for archiving email and also using the aforementioned tools for archiving emails were explained thoroughly at the end of the session. These workflows covered migrations with different tools, accessing disk images of stored emails and attachments via emulation, and bit-level preservation.

Following the email archiving session we continued on for the Human Scale Web Collecting for Individuals and Institutions session presented by Anna Perricci from the Webrecorder team.


Having used Webrecorder before I was very excited for this session. Anna walked through process of registering and starting your first collection. She explained how to start sessions and also how collections are formed as easily as clicking different links on a website. Webrecorder can handle javascript replay very efficiently. For example, past videos streamed from a website like Vine or YouTube are recorded from a user's perspective and then available for replay later in time. Other examples included automated scrolling through twitter feeds or capturing interactive news stories from the New York Times.
During the presentation Anna showed Webrecorder's capability of extracting mementos from other web archives for the possibility of repairing missing content. For example, it managed to take CNN mementos from the Internet Archive past November 1, 2016 and then fix their replay by aggregating resources from other web archives and also the live web - although this could also be potentially harmful. This is an example of Time Travel Reconstruct implemented in pywb.

Ilya Kreymer presented the use of Docker containers for emulating different browser environments and how it could play an important role for replaying specific content like Flash. He demonstrated various tools available open source on Github including: pywb, Webrecorder WARC player, warcio, and warcit.
Ilya also teased at Webrecorder's Auto Archiver Prototype, a system that understands how Scalar websites work and can anticipate URI patterns and other behaviors for these platforms. Auto Archiver introduces automation of the capture of many different web resources on a website, including video and other sources.
Webrecorder Scalar automation demo for a Scalar website

To finish the first day, attendees were transported to a reception hosted at the MIT Samberg Conference Center accompanied by a great view of Boston.

Day 2 (September 25, 2018): Paper Presentations and Lightning Talks

To start the day attendees gathered for the plenary session which was opened by a statement from Chris Bourg.



Eve Blau then continued the session by presenting the Urban Intermedia: City, Archive, Narrative capstone project of a Mellon grant. This talk was about a Mellon Foundation project the Harvard Mellon Urban Initiative. It is a collaborative effort across multiple institutions of architecture, design and humanities. Using multimedia and visual constructs it looked at processes and practices that shape geographical boundaries, focusing on blind spots in:
  • Planned / unplanned - informal processes
  • Migration / mobility, patterns, modalities of inclusion & exclusion
  • Dynamic of nature & technology, urban ecologies
After the keynote I hurried over to open for the Web Preservation session with my paper on Measuring News Similarity Across Ten U.S. News Sites. I explained our methodology of selecting archived news sites, the tool top-news-selectors we created for mining archived news, how the similarity of news collections were calculated, the events that peaked in similarity, and how the U.S. election was recognized as a significant event among many of the news sites.


Following my presentation, Shawn Jones presented his paper The Off-Topic Memento Toolkit. Shawn presentation focused on the many different use cases of Archive-It, and then detailed how many of these collections can go of topic. For example, pages that have missing resources at a point in time, content drift causes different languages to be included in a collection, site redesigns, and etc. This lead to the development of the Off-Topic Memento Toolkit to detect these off-topic mementos inside of a collection through a process of collection a memento and then assigning a score, testing multiple different measures. It showed that in this study Word Count had the highest accuracy and best F1 score for detecting off-topic mementos.

Shawn also presented his paper The Many Shapes of Archive-It. He explained how to understand Archive-It collections using the content, metadata (Dublin Core and custom fields), and collection structure, but also the issues that come with these methods. Using 9351 collections from Archive-It as data, Shawn explained the concept of growth curves for collections which compares seed count, memento count, and also memento-datetime. Using different classifiers Shawn showed that using structural features of a collection one can predict the semantic category of a collection, with the best classifier found to be Random Forest.


Following lunch, I headed to the amphitheater to see Dragan Espenschied's short paper presentation Fencing Apparently Infinite Objects. Dragan questioned how objects, synonymous with file or a collection of files, are bound in digital preservation. The concept of "performative boundaries" was explained to explain different potentials of an object - bound, blurry, and boundless. Using many early software examples like early 2000 Microsoft Word (bound), Apple's QuickTime (blurry), and Instagram (boundless). He shared productive approaches for future replay of these objects:

  • Emulation of auxiliary machines
  • Synthetic stub services or simulations
  • Capture network traffic and re-enact on access 

Dragan Espenschied presenting on Apparently Infiinite Objects 
The next presentation was Digital Preservation in Contemporary Visual Art Workflows by Laura Molloy who presented remotely. This presentation informs us that on a regular basis digital preservation for someone's work isn't a main part of the teachings at an art school, and it should be. Digital technologies are used widely today for creating art with a variety of different formats. When asking various artist about digital preservation this is how they answered:
“It’s not the kind of thing that gets taught in art school, is it?”
“You don’t need to be trained in [using and preserving digital objects]. It’s got to be instinctive and you just need to keep it very simple. Those technical things are invented by IT guys who don’t have any imagination.” 
The third presentation was by Morgane Stricot for her short paper Open the museum’s gates to pirates: Hacking for the sake of digital art preservation. Morgane explained the that software dependency is a large threat for digital art and supporting media archaeology is required for preservation of some forms of these digital arts. Backups of older operating systems (OS) on disks help avoid issues of incompatibility. She also detailed how copyright prohibitions, for example older Mac OS, are difficult to find and that many pirates as well as "harmless hackers" have cracks to gain access to these OS environments while some are unsalvageable.
The final paper presentation was presented by Claudia Roeck on her long paper Evaluation of preservation strategies for an interactive, software-based artwork with complex behavior using the case study Horizons (2008) by Geert Mul. Claudia explored different possible preservation strategies for software such as reprogramming to a different programming language, migration of software, virtualization, and emulation, and also significant properties for what determines the qualities one would want to preserve. She used Horizons as an example project to explore the use cases and determined that reprogramming was of the options they decided was suitable for it. However, it was stated that there were no clear winner for the best preservation strategy in the mid-term of the work.
For the rest of the day lightning talks were available to the attendees and it became packed with viewers. Some of these talks consisted of preservation games to be held the next day such as: Save my Bits, Obsolescence, Digital Preservation Storage Criteria Game, and more. Ilya, from Webrecorder, held a lightning talk showing a demo of the new Auto Archiver prototype for Webrecorder.


After the proceedings another fantastic reception was held, this time at the Harvard Art Museum.

Harvard Art Museum at night

Day 3 (September 26, 2018): Minute Madness, Poster Sessions, and Awards 

This day was opened by a review of iPRES's achievements and challenges for past 15 years with a panel discussion composed of: William Kilbride, Eld Zierau, Cal Lee, and Barbara Sierman. Achievements included the innovation of new research as well as the courage to share and collaborate among peers with similarities in research. This lead to iPRES's adoption of cross-domain preservation in libraries, archives, and digital art. Some of the challenges include decisions for archivists to decide of what to do with past and future data and also conforming to the standard of OAIS.
After talking about the past 15 years it was time to talk about the next 15 years with a panel discussion composed of: Corey Davis, Micky Lindlar, Sally Vermaaten, and Paul Wheatley. This panel discussed what would be good for the future for more attendees be available to attend. They discussed possible organization models to emulate for regional meetings such as code4lib and NDSR. There were suggestions for updates to the Code of Conduct and the value for it to hold for the future.
After the discussion panels it was time for minute madness. I had seen videos of this before but it was the first time I personally had seen this. I found it somewhat theatrical. It was where most people had to explicitly pitch their research in a minute so we would later come visit them during the poster session while some of them put up a show, like Remco van Veenendaal. The topics ranged from workflow integration, new portals for preserving digital materials, code ethics, and timelines for detailing file formats.

After the minute madness attendees wandered around to view the posters available. The first poster I visited conveniently was referencing work from our WSDL group!
Another interesting poster consisted of research into file format usage over time.
I was also surprised at the amount of tools and technologies some of the new preservation platforms for government agencies that had emerged, like the French government IT program for digital archiving, Vitam.

Vitam poster presentation for their digital archiving architecture
Following the poster sessions I was back to paper presentations where Tomasz Miksa presented his long paper Defining requirements for machine-actionable data management plans. This talk involved machine actionable data management plans (maDMPs), which represents living documents automated by information collection systems and notification systems. He showed how current formatted data management systems could be transformed to reuse existing standards such as Dublin Core and PREMIS.
Alex Green then went on to present her short paper Using blockchain to engender trust in public digital archives. It was explained that archivist alter, migrate, normalize, and sometimes make changes to digital files but there is little proof that a researcher receives an authentic copy of a digital file. The ARCHANGEL project proposed to use blockchain to verify integrity of these files and their provenance. It is still unknown if blockchain tech will prevail as a lasting technology as it is still very new. David Rosenthal wrote a review of this paper found on his blog.
I then went on to the Storage Organization and Integrity session to see a long paper presentation Checksums on Modern Filesystems, or: On the virtuous consumption of CPU cycles by Alex Garnett and Mike Winter. The focus of the talk was the computing of checksums on files to prevent bit rot in digital objects and compares different approaches for verifying bit-level preservation. It showed that data integrity can be achieved when computer hardware, such as filesystems using ZFS, are dedicated to digital preservation. This work shows a bridge between digital preservation practices and high-performance computing for detecting bit-rot.

After this presentation I stayed for short paper presentation The Oxford Common File Layout by David Wilcox. The Oxford Common File Layout (OCFL) is an effort to define a shared approach to file hierarchy for long-term preservation. The goal of this layout is to have structure at scale, easily ready for migrations and minimize file transfers, and designed to be managed by many different applications. With a set of defined principles for this file layout, such as ability to log transactions on digital objects among other principles, there is plan for a draft spec release sometime at the end of 2018.
This day closed with the award ceremony for best poster, short papers, and long papers. My paper, Measuring News Similarity Across Ten U.S. New Sites, was nominated for best long paper but did not prevail as the winner. The winners were as follows:
  • Best short paper: PREMIS 3 OWL Ontology: Engaging Sets of Linked Data
  • Best long paper: The Rescue of the Danish Bits - A case study of the rescue of bits and how the digital preservation community supported it  by Eld Zierau
  • Best poster award: Precise & Persistent Web Archive References by Eld Zierau



Day 4 (September 27, 2018): Conference Wrap-up

The final day of iPRES 2018 was composed of paper presentations, discussion panels, community discussions, and games. I chose to attend the paper presentations.

The first paper presentation I viewed was Between creators and keepers: How HNI builds its digital archive by Ania Molenda. Over 4 million documents were recorded to track progressive thinking for Dutch architecture. When converting and pushing these materials into a digital archive there were many issues observed, such as: duplicate materials, file formats with complex dependencies, time and effort to digitalize the multitude of documents, and knowledge lost over time for accessing these documents with no standards in place.

Afterwards I watched the presentation on Data Recovery and Investigation from 8-inch Floppy Disk Media: Three Use Cases by Abigail Adams. This showed the acquisition of three different floppy disk collections ranging in date ranges from 1977-1989! This presentation introduced me to some foreign hardware, software, and encodings required for attempting to recover data from floppy disk media and also a workflow for data recovery from these floppies.

The last paper presentation of my viewing was Email Preservation at Scale: Preliminary Findings Supporting the Use of Predictive Coding by Joanne Kaczmarek and Brent West. Having already been to the email preservation workshop I was excited for this presentation and I was not let down. Using 20gb of emails publicly available they used two different methods, a capstone approach and predictive coding approach, for discovering sensitive content inside emails. With the predictive coding approach, machine learning for training and prediction of documents, they showed preliminary results that classifying emails automatically is an approach that is capable of handling emails at scale.

As a final farewell, attendees were handed bags of tulip buds and told this:
"An Honorary Award will be presented to the people with the best tulip pictures."
It seems William Kilbride, among others, have already got a foot up on all the competition.
This marks the end of my first academic conference as well as my first visit to Boston, Massachusetts. It was an enjoyable experience with a lot of exposure to diverse research fields in digital preservation. I look forward to submitting work to this conference again and hearing about future research in the realm of digital preservation.


Resources for iPRES 2018: