Tuesday, September 27, 2016

2016-09-27: Introducing Web Archiving in the Summer Workshop

For the last few years the Department of Computer Science at Old Dominion University invites a group of undergrad students from India and hosts them in the summer. They work closely with a research group on some relevant projects. Additionally, researchers from different research groups in the departments present their work to the guest students twice a week and introduce various different projects that they are working on. The goal of this practice is to allow them to collaborate with graduate students of the department and to encourage them for research studies. The invited students also act as ambassadors to share their experience with their colleagues and spread the word out when they go back to India.

This year a group of 16 students from Acharya Institute of Technology and B.N.M. Institute of Technology visited Old Dominion University, they were hosted under the supervision of Ajay Gupta. They worked in the areas of Sensor Networks and Mobil Application Development. They researched ways to integrate mobile devices with low-cost sensors to solve problems in health care-related areas and vehicular networks.

I (Sawood Alam) was selected to represent our Web Science and Digital Libraries Research Group this year on July 28. Mat and Hany represented the group in the past. I happened to be the last presenter before they return back to India, by the time they were overloaded with scholarly information. Additionally, the students were not primarily from the Web science or digital libraries background. So, I decided to keep my talk semi-formal and engaging rather than purely scientific. The slides were inspired from my last year's talk in Germany on "Web Archiving: A Brief Introduction".

I began with my presentation slides entitled, "Introducing Web Archiving and WSDL Research Group". I briefly introduced myself with the help of my academic footprint and the lexical signature. I described the agenda of the talk and established the motivation for Web archiving. From there, I followed the talk agenda as laid out before, covering topics like issues and challenges in Web archiving, existing tools, services, and research efforts, my own research work about Web archive profiling, and some open research topics in the field of Web archiving. Then I introduced the WSDL research Group along with all the fun things we do in the lab. Being an Indian, I was able to pull in some cultural references from India to keep the audience engaged and entertained while still being on the agenda of the talk.

I heard encouraging words from Ajay Gupta, Ariel Sturtevant, and some of the invited students after my talk as they acknowledged it being one of the most engaging talks during the entire summer workshop. I would like to thank all who were involved in organizing this summer workshop and gave me the opportunity to introduce my field of interest and the WSDL research group.

Sawood Alam

Monday, September 26, 2016

2016-09-26: IIPC Building Better Crawlers Hackathon Trip Report

Trip Report for the IIPC Building Better Crawlers Hackathon in London, UK.                           

On September 22-23, 2016, I attended the IIPC Building Better Crawlers Hackathon (#iipchack) at the British Library in London, UK. Having been to London almost exactly 2 years ago for the Digital Libraries 2014 conference, I was excited to go back, but was more so anticipating collaborating with some folks I had long been in contact with during my tenure as a PhD student researcher at ODU.

The event was a well-organized yet loosely scheduled meeting that resembled more of an "Unconference" than a Hackathon in that the discussion topics were defined as the event progressed rather than a larger portion being devoted to implementation (see the recent Archives Unleashed 1.0 and 2.0 trip reports). The represented organizations were:

Day 0

As everyone arrived at the event from abroad and locally, the event organizer Olga Holownia invited the attendees to an informal get-together meeting at The Skinners Arms. There the conversation was casual but frequently veered into aspects of web archiving and brain picking, which we were repeatedly encouraged to "Save for Tomorrow".

Day 1

The first day began with Andy Jackson (@anjacks0n) welcoming everyone and thanking them for coming despite the short notice and announcement of the event over the Summer. He and Gil Hoggarth (@grhggrth), both of the British Library, kept detailed notes of the conference happenings as they progressed with Andy keeping an editable open document for other attendees to collaborate on building.

Tom Cramer (@tcramer) of Stanford, who mentioned he had organized hackathons in the past, encouraged everyone in attendance (14 in number) to introduce themselves and give a synopsis of their role and their previous work at their respective institutions. He also asked how we could go about making crawling tools accessible to non-web archiving specialists to stimulate conversation.

The responding discussion initiated a theme that ran throughout the hackathon -- that of the web archiving from a web browser.

One tool to accomplish this is Brozzler from Internet Archive, which combines warcprox and Chromium to preserve HTTP content sent over the wire into the WARC format. I had previously attempted to get Brozzler (originally forked from Umbra) up and running but was not successful. Other attendees either had previously tried or had not heard of the software. This transitioned later into Noah Levitt (of Internet Archive) giving an interactive audience-participation walk through of installing, setting up, and using Brozzler.

Prior to the interactive portion of the event, however, Jefferson Bailey (@jefferson_bail) of Internet Archive started a presentation by speaking about WASAPI (Web Archiving Systems API), a specification for defining data transfer of web archives. The specification is a collaboration with University of North Texas, Rutgers, Stanford via LOCKSS, and other organizations. Jefferson emphasized that the specification is not implementation specific; it does not get into issues like access control, parameters of a specific path, etc. The rationale behind this was so that the spec would not be just a preservation data transport tool but also a means of data transfer for researcher. Their in-development implementation takes WARCs, pulls out data to generates a derivative WARC, then defines a Hadoop job using Pig syntax. Noah Levitt added that the Jobs API requires you to supply an operation like "Build CDX" and the WARCs on which you want to perform the operation.

In a typical non-linear unconference fashion (also exhibited in this blog post), Noah then gave details on Brozzler (presentation slides). With a room full of Mac and Linux users, installation proved particularly challenging. One issue I had previously run into was latency in starting RethinkDB. This issue was also exhibited by Colin Rosenthal (@colinrosenthal) while he was on Linux and I on Mac. Noah's machine, which he showed in a demo as having the exact same versions of all dependencies I had installed did not show this latency, so Your Mileage Might Vary with installation but in the end both Colin and I (possibly others) were successful in crawling a few URIs using Brozzler.

Andy added to Noah's interactive session by referencing his effort in Dockerizing Brozzler and his other work in component-izing and Dockerized the other roles and tasks web archiving process with his project Wren. While one such component is the Archival Acid Test project I had created for Digital Libraries 2014, the other sub-projects of run allow for the mitigation of other tools that are otherwise difficult or time consuming to configure.

One such tool that was lauded throughout the conference was Alex Osborne's (@atosborne) tinycdxserver Andy also has created a Dockerized version of tinycdxserver. This tool was new to me but the reported statistics on CDX querying speed and storage have the potential for significant improvement for large web archives. Per Alex's description of the tool, the indexes are stored compressed using Facebook's RocksDB and are about a fifth of the size in tinycdxserver when compared to a flat CDX file. Further, Wayback instances can simply be pointed at a tinycdxserver instance using the built-in RemoteResourceIndex field in the Wayback configuration file, which makes for easy integration.


A wholly unconference discussion then commenced with topics we wanted to cover in the second part of the day. After coming up with and classifying various idea, Andy defined three groups: the Heritrix Wish List, Brozzler, and Automated QA.

Each attendee could join any of the three for further discussion. I chose "Automated QA", given the relevance of archival integrity is related to my research topic.

The Heritrix group expressed challenges that the members had encountered in transitioning from Heritrix version 1 to version 3. "The Heritrix 3 console is a step back from Heritrix 1's. Building and running scripts in Heritrix 3 is a pain." was the general sentiment from the group. Other concerns were scarce documentation, which might be remedied with funded efforts to improve it, as deep knowledge of the tool's working are needed to accurately represent the capability of the tool. Kristinn SigurĂ°sson (@kristsi), who was involved in the development of H3 (and declined to give a history documenting the non-existence of H2) has since resolved some issues. I was encouraged to use his fork of Heritrix 3 from he and others, my own recommendation inadvertent included:

The Brozzler group first identified the behavior of Brozzler versus a crawler in its handling of one page or site at a time (a la WARCreate) instead of adding discovered URIs to a frontier and seeding those URIs for subsequent crawls. Per above, Brozzler's use of RethinkDB as both the crawl frontier and the CDX service makes it especially appealing and more scalable. Brozzler allows multiple workers to pull URIs for a pool and report back to a RethinkDB instance. This worked fairly well in my limited but successful testing at the hackathon.

The Automated QA group first spoke about the National Library of Australia's Bamboo project. The tool consumes Heritrix's (and/or warcprox) crawl output folder and provides in-progress indexes from WARC files prior to a crawl finishing. Other statistics can also be added in as well as automated generation of screenshots for comparison of the captures on-the-fly. We also highlighted some particular items that crawlers and browser-based preservation tools have trouble capturing. For example, video formats that vary in support between browsers, URIs defined in the "srcset" attribute, responsive design behaviors, etc. I also referenced my work in Ahmed AlSum's (@aalsum) Thumbnail Summarization using SimHash, as presented at the Web Archiving Collaboration meeting.

After presentation by the groups, the attendees called it a day for further discussions at a nearby pub.

Day 2

The second day commenced with a few questions we all decided upon and agreed to while at the pub as good discussions for the next day. These questions:

  1. Given five engineers and two years, what would you build?
  2. What are the barriers in training for the current and future crawling software and tools?
Given Five...

Responses to the first included something like Brozzler's frontier but redesigned to allow for continuous instead of a single URI for crawling. With a segue toward Heritrix, Kristinn verbally considered the relationship between configurability and scalability. "You typically don't install heritrix on a virtual machine", he said, "usually a machine for this use requires at least 64 gigabytes of RAM." Also discussed was getting the raw data for a crawl versus being able to get the data needed to replicate the experience and the particular importance of the latter.

Additionally, there was talk of adapting the scheme used by Brozzler for an Electron application meant for browsing and the ability to toggle archiving through warcprox (related: see recent post on WAIL). On the flip side, Kristinn mentioned that it surprised him that we can potentially create a browser of this sort that can interact with a proxy but not build another crawler -- highlight the lack of options in other Heritrix-like robust archival crawlers.

Barriers in Training

For the second question, those involved with institutional archives seemed to agreed that if one was going to hire a crawl engineer, Java and Python experience are a pre-requisite to exposure to some of the archive-specific concepts. For current institutional training practice, Andy stated that he turns new developers in his organization loose on ACT, which is simply a CRUD application to introduce them into the web archiving domain. Others said it would be useful to have staff exchanges and internships for collaboration and getting more employees familiar with web archiving.


Another topic arose from the previous conversation about future methods of collaboration. For future work on writing documentation, more Getting Started fundamental guides as well as test sites for tools would be welcomed. For future communication, the IIPC Slack Channel as well as the newly created IIPC GitHub wiki will be the next iteration of the outdated IIPC Tools page and the COPTR initiative.

The whole-group discussion wrapped up with identifying concrete next steps from what was discussed at the event. These included creating setup guides for Brozzler, testing of any further use cases of Umbra versus Brozzler, future work on access control considerations as currently done by institutions and next steps regarding that, and a few other TODOs. A monthly online meeting is also planned to facilitate collaboration between meetings as well as more continued interaction via Slack instead of a number of outdated, obsolete, or noisy e-mail channels.

In Conclusion...

Attendance of the IIPC Building Better Crawlers Hackathon was invaluable to establishing contacts and gaining more exposure to the field and efforts done by others. Many of the conversations were open-ended, which lead to numerous other topics discussed and opened the doors to potential new collaborations. I gained a lot of insight from discussing my research topic and others' projects and endeavors. I hope to be involved with future Hackathons-turned-Unconferences from IIPC in the future and appreciate the opportunity I had to attend.

—Mat (@machawk1)

Kristinn SigurĂ°sson has also written a post about his take aways from the event.

Tom Cramer also published his report on the Hackathon since the publication of this post.

Wednesday, September 21, 2016

2016-09-20: The promising scene at the end of Ph.D. trail

From right to left, Dr. Nelson (my advisor),
Yousof (my son), Yasmin (myself), Ahmed (my husband)
August 26th marked my last day as a Ph.D. student in the Computer Science department at ODU, while September 26 marks my first day as a Postdoctoral Scholar in Data Curation for the Sciences and Social Sciences at UC Berkeley. I will lead research in the areas of software curation, data science, and digital research methods. I will be honored to work under the supervision of Dr. Erik Mitchell, the Associate University Librarian and Director of Digital Initiatives and Collaborative Services at the University of California, Berkeley. I will have an opportunity to collaborate with many institutions across UC Berkeley, including the Berkeley Institute for Data Science (BIDS) research unit. It is amazing to see the light at the end of the long tunnel. Below, I talk about the long trail I took to reach my academic dream position. I'll recap the topic of my dissertation, then I'll summarize lessons learned at the end.

I started my Ph.D. in January 2011 at the same time that the uprisings of the Jan 25 Egyptian Revolution began. I was witnessing what was happening in Egypt while I was in Norfolk, Virginia. I could not do anything during the 18 days except watch all the news and social media channels, witnessing the events. I wished that my son Yousof, who was less than 2 years old at that time, could know what was happening as I saw it. Luckily, I knew about Archive-It, a subscription service by the Internet Archive that allows institutions to develop, curate, and preserve collections of Web resources. Each collection in Archive-It has two dimensions: time and URI. Understanding the contents and boundaries of these archived collections is a challenge for most people, resulting in the paradox of the larger the collection, the harder it is to understand.

There are multiple collections in Archive-It about the Jan. 25 Egyptian Revolution 

There is more than collection documenting the Arab Spring and particularly the Egyptian Revolution. Documenting long-running events such as the Egyptian Revolution results in large collections that have 1000s of URIs and each URI has 1000s of copies through time. It is challenging for my son to pick a specific collection to know the key events of the Egyptian revolution. The topic of my dissertation, which was entitled "Using Web Archives to Enrich the Live Web Experience Through Storytelling", focused on understanding the holdings of the archived collections.
Inspired by “It was a dark and stormy night”, a well-known storytelling trope: https://en.wikipedia.org/wiki/It_was_a_dark_and_stormy_night/  
We named the proposed framework the Dark and Stormy Archive (DSA) framework, in which we integrate “storytelling” social media and Web archives. In the DSA framework, we identify, evaluate, and select candidate Web pages from archived collections that summarize the holdings of these collections, arrange them in chronological order, and then visualize these pages using tools that users already are familiar with, such as Storify. An example of the output is bellow. It shows three stories for the three collections about the Egyptian Revolution. The user can gain an understanding about the holdings of each collection from the snippets of each story.

The story of the Arab Spring Collection

The story of  the North Africa and the Middle East collection

The story of the Egyptian Revolution collection

With the help of Archive-It team and partners, we obtained a ground truth data set for evaluating the generated stories by the DSA framework. We used Amazon Mechanical Turk to evaluate the automatically generated stories against the stories that were created by domain experts. The results show that the automatically generated stories by the DSA are indistinguishable from those created by human subject domain experts, while at the same time both kinds of stories (automatic and human) are easily distinguished from randomly generated stories. I successfully defended my Ph.D. dissertation on 06/16/2016.

Generating persistent stories from themed archived collections will ensure that future generations will be able to browse the past easily. I’m glad that Yousof and future generations will be able to browse and understand the past easily through generated stories that summarize the holding of the archived collections.


To continue WS-DLer’s habit in providing recaps, lessons learned, and recommendations, I will list some of the lessons learned for what it takes to be a successful Ph.D. student and advice for applying in academia. I hope these lessons and advice will be useful for future WS-DLers and grad students. Lessons learned and advice:
  • The first one  and the one I always put in front of me: You can do ANYTHING!!

  • Getting involved in communities in addition to your academic life is useful in many ways. I have participated in many women in technology communities such as the Anita Borg Institute and the Arab Women in Computing (ArabWIC) to increase the inclusion of women in technology. I was awarded travel scholarships to attend several well-known women in tech conferences: CRA-W (Graduate Cohort 2013), Grace Hopper Celebration of Women in Computing (GHC) 2013, GHC 2014, GHC 2015, and ArabWIC 2015. I am a member of the leadership committee of ArabWIC. Attending these meetings grows maturity and enlarges personal connections and development that prepare students for future careers. I also gained leadership skills from being part of the leadership committee of ArabWIC. 
  • Publications matter! if you are in WS-DL, you will have to get the targeted score đŸ˜‰. You can know more about the point system on the wiki. If you plan to apply in academia, the list of publication is a big issue. 
  • Teaching is important for applying in academia. 
  • Collaboration is a key for increasing your connections and also will help in developing your skills for working in teams. 
  • And at last, being a mom holding a Ph.D. is not easy at all!!
The trail was not easy, but it is worth it. I learned and have changed much since I started the program. Having enthusiastic and great advisors like Dr. Nelson and Dr. Weigle is a huge support that results in happy ending and achievement to be proud of.


Tuesday, September 20, 2016

2016-09-20: Carbon Dating the Web, version 3.0

Due to API changes, the old carbon date tool is out of date and some modules no longer work, such as topsy. I have taken up the responsibility of maintaining and extending  the service, beginning with the following now available in Carbon Date v3.0.

Carbon date 3.0

What's new

New services have been added, such as bing searching, twitter searching and pubdate parsing.

The new software architecture enable us to load given scripts or disable given services during runtime.

The server framework has been changed from CherryPy server to tornado server which is still a python minimalist WSGI server, with better performance.

How to use the Carbon Date service

  • Through the website, http://carbondate.cs.odu.edu: Given that carbon dating is computationally intensive, the site can only hold 50 concurrent requests, and thus the web service should be used just for small tests as a courtesy to other users. If you have the need to Carbon Date a large number of URLs, you should install the application locally.  Note that the old link http://cd.cs.odu.edu still works.
  •  Through local installation: The project source can be found at the following repository: https://github.com/oduwsdl/CarbonDate. Consult README.md for instructions on how to install the application.

Dockerizing the Carbon Date Tool

Carbon Date now only supports python 3. Due to potential  package conflicts between python 2 and python 3 (most machine have python 2 installed as default), we recommend running Carbon Date in docker.

Build docker image from source
  1.  Install the docker.
  2. Clone the git hub source to local directory.
  3. Run 
  4. Then you can choose either server or local mode
    • server mode

      Don't forget to mapping your port to server port in container.
      Then in the browser visit

      for index page or
      in the terminal

      for direct query
    • local mode
or get deployed image automatically from dockerhub :

System Design

In order to make Carbon Date tool easier to maintain and develop, the structure of the application has been refactored.  The system now has four layers:

When a query has been sent to application, the query proceed as following:

Add new module to Carbon Date

Now all the modules are loaded and executed automatically. The module manipulator will try to search and call the entry function of each module. A new module can be loaded and executed automatically without altering other scripts if it define the function in the way described below.

Name the module main script as cdGet<Module name>.py
And ensure the entry function is named:

or customize your own entry function name by assign string value to 'entry' variable in the beginning of your script.

For example, a new module using baidu.com as search engine to find potential creation date of a URI. The script should be named cdGetBaidu.py.  And the entry function should be:

The core.py will pass outputArray, indexOfOutputArray and "displayArray"in the kwargs into the function. Note that outputArray is for core.py to compute the earliest creation date, so only one value should be assigned here. And the displayArray is for return value, it can be the same as result creation date or anything else in the form of an array of tuples.

In this example, when we get the result from baidu.com, the code to return these value is:

Source maintenance

Some web service may change, so some modules should be updated frequently.

Here, the twitter module should be updated when twitter has changed their page hierarchy. Because currently cdGetTwitter.py crawls the twitter search page and parses the  time stamp of each tweet in the result. So the old algorithm may not work when twitter moves the tweets' time stamp to other tags in the future.

Thus the twitter script should be updated periodically until twitter allows users to get old tweets more than one week ago through the twitter api.

I am grateful to everyone who helped me on Carbon Date especially Sawood Alam, who helped greatly with deploying the server and countless advice about refactoring the application, and John Berlin who advised me to use tornado instead of cherryPy. Further recommendations or comments about how this service can be improved are welcome and will be appreciated.


Tuesday, September 13, 2016

2016-09-13: Memento and Web Archiving Colloquium at UVa

Yesterday, September 12, I went to the University of Virginia to give a colloquium at the invitation of Robin Ruggaber to talk with her staff about Memento, Web Archiving, and related technologies.  I also had the pleasure of meeting with Worthy Martin of the CS department and the Institute for Advanced Technology in the Humanities.  I met Robin at CNI Spring 2016 and she was intrigued by our work at using storytelling to summarize archival collections, and was hoping to apply it to their Archive-It collections (which are currently not public).  My presentation yesterday was more of an overview of web archiving,  although the discussion did cover various details, including a proposal for Memento versioning in Fedora


Sunday, September 11, 2016

2016-09-11: Web Archiving in Popular Media

At the Old Dominion University Web Science and Digital Libraries Research Group we have been studying web archiving for a long time.  In the past few years, we have noticed a significant uptick in the use of web archives in mainstream media, both to support stories and as the subject.  This post presents articles from the popular media that use web archive holdings (mementos) as evidence and concludes with articles about web archives.

Articles that Reference Web Archives

'Fake News' And How The Washington Post Rewrote Its Story On Russian Hacking Of The Power Grid

What the Washington Post's rush to be the first to report on Russian hackers breaching the US power grid teaches us about how "breaking news" can all too often become "fake news" when we over-trust government sources and fail to verify facts.
Forbes.com uses an Internet Archive memento as evidence that the Washington Post edited an article with updated information instead of performing proper research prior to publication.
2016-12-31 03:06:08
Now-modified Washington Post web page.

Articles that Reference Web Archives

2016-10-04 • GOP Says Pence Won the VP Debate-Hours Before It Starts

The Republican National Committee may have added a few fortune tellers to its staff. Either that, or the RNC just published all its post-debate spin by accident hours before the debate even began. Yeah, we're going to go ahead and say it was the second one. Around 7 pm ET, two hours before Gov.
Wired.com uses an Internet Archive memento as evidence that the Republican Party published its declaration that Pence won the vice presidential candidate hours before the debate started.
2016-10-04 23:06:52
Now-deleted GOP web page declaring Pence the winner.

Tabloid Facing $100 Million Lawsuit Pulls Michael Jackson Abuse Story

2016-09-06 • Radar Online is known for a lot of things in the tabloid world, but factual reporting apparently isn't one of them. We first reported back in June a laundry list of items supposedly found in Michael Jackson's Neverland Ranch by the Santa Barbara County Sheriff's Department back in 2003.
This article uses an Internet Archive memento as evidence that a tabloid (Radar Online) might be attempting to "bury this piece to avoid a huge payout…".
2016-06-21 19:36:45
PDF of Santa Barbara County Sheriff's Department report.
Warning: although redacted, the photographs in this report are unsuitable for children and most workplaces.

Clinton's Website Deleted Statement Saying Rape Victims Have the 'Right to Be Believed'

2016-08-15 • Hillary Clinton's presidential campaign has deleted a statement on its website that said that all rape victims have the "right to be believed." BuzzFeed reported Sunday that the change was ma
This article uses a memento from archive.is to show that "right to be believed" was removed from Hillary Clinton's speech.
2015-11-30 01:45:14
Campus sexual assault page from Hillary Clinton's website.

2016-08-23 • The Epidemic Archives of the Future Will Be Born Digital

Colorful AIDS education posters from the 1980s. Black-and-white photos of mid-20th-century anatomy lessons for midwives. Eighteenth-century instructions for the administration of patent medicines. While a paper archival collection in the U.S. National Library of Medicine might contain items like these-handwritten or typed journals, correspondence, educational materials, and official reports, some digitized many years after their creation-the next generation of health information lives online.
This article describes how a National Library of Medicine (NLM) team uses the Archive-It web archive to collect webpages, blog posts, and social media streams to capture online health information generated during health crises.

Panic Mode: Khizr Khan Deletes Law Firm Website that Specialized in Muslim Immigration - Breitbart

2016-08-02 • This development is significant, as his website proved – as Breitbart News and others have reported – that he financially benefits from unfettered pay-to-play Muslim migration into America. A snapshot of his now deleted website, as captured by the Wayback Machine which takes snapshots archiving various websites on the Internet, shows that as a lawyer he engages in procurement of EB5 immigration visas and other "Related Immigration Services."
This article uses Internet Archive mementos to bolster the claim that Khizr Khan deleted his website (currently accessible – 2016-09-04) from the Internet to avoid publicizing that he financially benefits from Muslim migration to America.
2016-08-02 12:14:11
Kahn's E2, EB5 Immigration Services page, no longer present on Kahn's website. https://web.archive.org/web/20160801212033/http://www.kmkhanlaw.com/International_Business.html
The deletion of the Kahn law firm website may have been an administrative oversight.  The memento below shows GoDaddy offering the domain name for sale.
2016-08-04 18:40:47
GoDaddy landing page for abandoned domains.

Vote Leave wipes homepage after Brexit result

2016-06-27 • In the wake of the EU referendum, the Vote Leave campaign has wiped its homepage. Visitors to the site are now greeted by the above image. The only active links are to the campaign's Privacy Policy and contact details.
This article shows that the UK Vote Leaves deleted speeches can still be found in the Internet Archive.
2016-06-27 12:35:31
GoDaddy landing page for abandoned domains.
https://web.archive.org/web/20160627123531/http://www.voteleavetakecontrol.org/letter_to_the_prime_minister_and_ foreign_secretary_getting_ the_facts_clear_on_turkey

Melania Trump's Website, Biography Have Disappeared From The Internet

2016-07-28 • The professional website of Melania Trump, wife of the Republican presidential nominee, has apparently been deleted from the internet as of Wednesday afternoon. The disappearance of Trump's elaborate website comes just days after news outlets, including The Huffington Post, raised serious questions about whether she actually earned an undergraduate degree in architecture from the University of Ljubljana, which is in Trump's native Slovenia.
Just days after news outlets raised questions about the veracity of Melania Trump's undergraduate degree, her website and biography were taken down.  However, the Internet archive had already captured her website over 250 times and her biography page 150 times.

2013-04-04 07:12:55

Melania Trump's undergraduate degree claim

Web evidence points to pro-Russia rebels in downing of MH17 (+video)

2014-07-14 • Igor Girkin, a Ukrainian separatist leader also known as Strelkov, claimed responsibility on a popular Russian social-networking site for the downing of what he thought was a Ukrainian military transport plane shortly before reports that Malaysian Airlines Flight MH17 had crashed near the rebel held Ukrainian city of Donetsk.
This article uses Internet Archive mementos of a Ukrainian separatist leader's (Igor Girkin) social media page as evidence that the separatists shot down Malaysian Airlines flight MH17.  The mementos below show the changes in Girkin's social media page as the news about MH17 unfolded.
2014-07-17 15:22:22
Shoot down claim of a Ukrainian AN-26 military transport. http://web.archive.org/web/20140717152222/http://vk.com/strelkov_info
2014-07-17 16:10:58
Both the original shoot down claim and denial of responsibility. http://web.archive.org/web/20140717161058/http://vk.com/strelkov_info
2014-07-17 16:56:38
Shoot down claim removed; denial of responsibility remains. http://web.archive.org/web/20140717165638/http://vk.com/strelkov_info

2013-11-21: The Conservative Party Speeches and Why We Need Multiple Web Archives

2013-11-21 • @Conservatives put speeches in Streisand's house: http://t.co/6aRiOsHwxO @UKWebArchive: http://t.co/BGD3tYavEx via @lljohnston @hhockx - Michael L. Nelson (@phonedude_mln)November 13, 2013 Circulating the web last week the story of the UK's Conservative Party (aka the " Tories") removing speeches from their website (see Note 1 below).
This blog post discusses the UK Conservative Party's attempt to delete history by removing old speeches from their website.  The party also tried blocking display of the speeches using robots.txt.  However, as the post points out, several archives already had copies.

David Cameron 2009 speech returned a 404 (not found) on 2013-11-21.
Archive.is is one of several archives with copies of the Conservative Party's speeches.

Online Retailer Says If You Give It A Negative Review It Can Fine You $3,500

2013-11-13 • Lots of quasi-legal action has been taken over negative reviews left by customers at sites like Ripoff Report and Yelp. Usually, it takes the form of post-review threats about defamation and libel. Every so often, though, a company will make proactive...
The article discusses a law suit brought by Kleargear.com against a customer who left negative feedback.  The Internet Archive memento cited in the article has since been excluded (probably using robots.txt):
2013-08-17 14:44:17
Kleargear Terms of Use has been excluded from the Internet Archive.
2013-08-17 14:44:17
Fortunately, Archive.is has a copy.

Articles about Web Archives

Web archives have proven their use in journalism, law, and other research areas to the point that the New Yorker, Forbes, The Atlantic, U.S. News, and others have all published insightful articles recently.

2016-08-17 • U.S. News
Wayback Machine Won’t Censor Archive for Taste, Director Says After Olympics Article Scrubbed

Internet Archive removed article for safety of Olympians.
Screenshot of Forbes "Reimagining Libraries in the Digital Era" article.
2016-03-19 Forbes
Reimagining Libraries In The Digital Era

Lessons From Data Mining The Internet Archive.
Screenshot of New Yorker "Cobweb" article.
2015-01-26 New Yorker
The Cobweb

Can the Internet be archived?
Screenshot of The Atlantic "Raiders of the Lost Web" article.
2015-10-14 • The Atlantic
Raiders of the Lost Web
If a Pulitzer-finalist 34-part series of investigative journalism can vanish from the web, anything can.

These lists of articles are just a beginning and will be expanded as new articles are discovered. Contributions and suggestions are welcome. Please email them to sainswor@cs.odu.edu or tweet them to @galsondor with hashtag #mementoinmedia.

Update History

Friday, September 9, 2016

2016-09-09: Summer Fellowship at the Harvard Library Innovation Lab Trip Report

Alexander Nwala standing at the main entrance of Langdell Hall
Myself standing at the main entrance of Langdell Hall
I was honored with the great opportunity of collaborating with the Harvard Library Innovation Lab (LIL) as a Fellow this Summer. Located at Langdell Hall, Harvard Law School, the Library Innovation Lab develops solutions to solve serious problems facing libraries. It consists of an eclectic group of Lawyers, Librarians, and Software Developers engaged in projects such as Perma.cc, Caselaw Access Project (CAP), The Nuremberg Project, among many others
The LIL Team
To help prevent link rot, Perma.cc creates permanent reliable links for web resources. The Caselaw Access Project is an ambitious project which strives to make all US case laws freely accessible online. The current collection to be digitized stands at over 42,000 volumes (nearly 40 million pages). The Nuremberg Project is concerned with the digitization of LIL's collection about the Nuremberg trials. 
I started work on June 6, 2016 (through August 24) as one of seven Summer Fellows, and was supervised by Adam Ziegler, LIL’s Managing Director. During the first week of the fellowship, we (Summer Fellows) were given a tour around the Harvard Law School Library and had the opportunity to share our research plans in the first Fellows hour - a session in which Fellows reported research progress, and received feedback from the LIL team as well as other Fellows. The Fellowship program was structured such that we had the flexibility to research subjects that interested us.
The 2016 LIL Summer Fellows
Harvard LIL 2016 Summer Fellows (See LIL's blog)
1. Neel Agrawal: Neel is a Law Librarian at LA Law Library Los Angeles, California. He is also a professional percussionist in various musical contexts such as Fusion, Indian and Western classical. He spent the Summer researching African drumming laws, to understand why/how colonial Government controlled, criminalized, and regulated drumming in Western/Northern Nigeria, Ghana, Uganda, Malawi, The Gambia, and Seychelles.
2. Jay Edwards: Jay was the lead database engineer for Obama for America in 2012 and also the ninth employee at Twitter. He spent the Summer working on the Caselaw Access Project, building a platform to enable non-programmers use Caselaw data.
3. Sara Frug: Sara is the Associate Director of the Cornell Law School Legal Information Institute, where she manages the engineering team which designs various tools that improve the accessibility and usability of legal text. Sara spent the Summer further researching how to improve the accessibility of legal text by developing a legal text data model.
4. Ilya KreymerIlya is the creator of Webrecorder  and oldweb.today. Webrecorder is an interactive archiving tool which helps users create high-fidelity web archives of websites by simply browsing through the tool. Ilya spent the Summer improving Webrecorder.
5. Muira McCammonMuira just concluded her M.A in Comparative Literature/Translation Studies at the University of Massachusetts-Amherst and received her B.A. in International Relations and French from Carleton College. Her M.A thesis was about the history of the Guantanamo Bay Detainee Library. She spent the Summer further expanding her GiTMO research by drafting a narrative nonfiction book on her GiTMO research, designing a tabletop wargame to model the interaction dynamics of various parties at GiTMO  and organizing a GiTMO conference.
6. Alexander Nwala: I am a computer science Ph.D student at Old Dominion University under the supervision of Dr. Michael Nelson. I worked on projects such as Carbon date, What did it look like?, and I Can Haz Memento. Carbon date helps you estimate the birth date of website, and What did it look like renders an animated GIF which shows how a website changed over time. I spent the Summer expanding my current research which is concerned with building collections for stories and events.
7. Tiffany TsengTiffany is the creator of Spin and a Ph.D graduate of the LiFELONG KiNDERGARTEN group of the MIT media Lab. Spin is a photography turnable system used for capturing animations of the evolution of design projects. Her research at MIT primarily focused on supporting designers and makers document and share their design process. Tiffany also has a comprehensive knowledge about a wide range of snacks.
Interesting things happen when you have a group comprising of scholars from different fields with different interest together. The opportunity to learn about our various research from different perspectives as offered by the Fellows and LIL team was constant. Progress was constant, as was scrum and button making.
A few buttons assembled during one of the many button making rituals at LIL
A few buttons assembled during one of the many button making rituals at LIL
The 2016 LIL Summer Fellowship concluded with a Fellows share event in which the seven Summer Fellows presented the outcome of their work during the Fellowship.

During the presentation, Neel talked about his interactive African drumming laws website.

A paid permit was required by law in order to drum in the Western Nigeria District Councils
The website provides an online education experience by tracing the creation of about 100 drumming laws between the 1950s and 1970s in District Councils throughout Western Nigeria.

88 CPU Cores processing the CAP XML data
Jay talked about the steps he took in order make the dense XML Caselaw data searchable by first validating the Caselaw XML files. Second, he converted the files to a columnar data store format (Parquet). Third, he loaded the Caselaw preprocessed data into Apache Drill in order to provide query capabilities.

Examples of different classification system of legal text: Eurovoc (Left), Library of Congress Subject Headings (Center), and Legistlative Indexing Vocabulary (Right)
Sara talked about a general data model she developed which enables developers to harness information available in different legal text classification systems, without having to understand the specific details of each system. 

Ilya demonstrated the new capabilities in the new version of Webrecorder.
Muira talked about her investigation about GiTMO and other detainee libraries. She highlighted her work with the Harvard Law School Library to create a Twitter archive for Carol Rosenberg's (Miami Herald Journalist) tweets. She also talked about her experiences in filing Freedom Of Information Act (FOIA) requests.
I presented the Geo derivative of the Local Memory Project which maps zip codes to local news media outlets. I also presented a non-public prototype of the Local Memory Project Google chrome extension. The extension helps users build/archive/share collections about local events or stories collected from local news media outlets.

Tiffany's work at Hatch Makerspace: Spin setup (left), PIx documentation station (center), and PIx whiteboard for sharing projects (right)
The presentations concluded with Tiffany's talk about her collaboration with HATCH - a makerspace run by Watertown Public Library. She also talked about her work improving Spin (a turntable system she created).

I will link to the Fellows share video presentation and booklet when LIL posts them.

-- Nwala (@acnwala)