Friday, August 30, 2019

2019-08-30: Where did the archive go? Part 1: Library and Archives Canada


Web archives are established with the objective of providing permanent access to archived web pages, or mementos. However, in our 14-month study of 16,627 mementos from 17 public web archives, we found that three web archives changed their base URLs and did not leave a machine readable method of locating their new URLs. We were able to manually discover the three new URLs for the archives. A fourth archive has partially ceased operations.

(1) Library and Archives Canada (collectionscanada.gc.ca)
Around May 2018, mementos in this archive were moved to a new archive (webarchive.bac-lac.gc.ca) which has a different domain name. We noticed that 49 mementos (out of 351) can not be found in the new archive.

(2) The National Library of Ireland (NLI) 
Around May 2018, the European Archive (europarchive.org) was shut down and the domain name was purchased by another entity. The National Library of Ireland (NLI) collection preserved by this archive was moved to another archive (internetmemory.org). All 979 mementos can be retrieved from the new archive (i.e., no missing mementos). Around September 2018, the archive internetmemory.org became unreachable (timeout error). The NLI collection preserved by this archive was moved to another archive (archive-it.org). The other archived collections in internetmemory.org may also have been moved to archive-it.org or to other archives. The number of missing from NLI mementos is 192 (out of 979).

(3) Public Record Office of Northern Ireland (PRONI) (webarchive.proni.gov.uk)
Around October 2018, all mementos preserved by this archive were moved to archive-it.org. The PRONI archive's homepage  is still online and shows a list of web pages' URLs (not mementos' URLs). By clicking on any of these URLs, it redirects to an HTML page in archive-it.org that shows the available mementos (i.e., the TimeMap) associated with the selected URL. The number of missing mementos from PRONI is 114 (out of 469).

(4) WebCite (webcitation.org)
The archive has been unreachable (timeout error) for about a month (from June 06, 2019 to July 08, 2019). The archive no longer accepts any new archiving requests, but it still provides access to all preserved mementos.

Library and Archives Canada 

In this post, we provide some details about changes in the archive Library and Archives Canada. Changes in the other three archives will be described in upcoming posts.

We refer to the archive from which mementos have moved as the "original archive", and we use the "new archive" to refer to the archive to which the mementos have moved. A memento is identified by a URI-M as defined in the Memento framework. 

In our study we have 351 mementos from collectionscanada.gc.ca. Around May 2018, 302 of those mementos have been moved to webarchive.bac-lac.gc.ca (the 49 remaining mementos are lost). For instance, the memento:

http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/

is now available at:

http://webarchive.bac-lac.gc.ca:8080/wayback/20051228174058/http://nationalatlas.gov/

The representations of both mementos are illustrated in the figure below. The original archive uses the green banner (left) while the new archive uses the yellow banner (right):



We have several observations about the change in the archive Library and Archives Canada:


Observation 1: The HTTP request of a URI-M from the original archive does not redirect to the corresponding URI-M in the new archive

The institution (Library and Archives Canada) that has developed the new archive (webarchive.bac-lac.gc.ca) still controls and maintains the domain name of the original archive (www.collectionscanada.gc.ca). Thus, it would be possible for requests of mementos (URI-Ms) to the original archive to redirect to the corresponding URI-Ms in the new archive. However, we found that every memento request to the original archive redirected to the home page of the new archive as shown below:

$ curl --head --location --silent http://www.collectionscanada.gc.ca/webarchives/20051228174058/http://nationalatlas.gov/ | egrep -i "(HTTP/1.1|^location:)"

Location: http://www.bac-lac.gc.ca/eng/discover/archives-web-government/Pages/web-archives.aspx
HTTP/1.1 302 Found
Location: http://webarchive.bac-lac.gc.ca/?lang=en
HTTP/1.1 200

Here is the representation of the home page of the new archive:



We had to 
manually intervene to detect the corresponding URI-Ms of the mementos in the new archive which can be done by replacing "www.collectionscanada.gc.ca/webarchives" with "webarchive.bac-lac.gc.ca:8080/wayback" in the URI-Ms of the original archive.

This reminds us of The End of Term Archive (eot.us.archive.org) which was established with the goal of preserving the United States government web (.gov). The domain name (eot.us.archive.org) is still under the control of the Internet Archive (archive.org). The example below shows how the HTTP request to a URI-M in the End of Term Archive redirects to the corresponding URI-M in the Internet Archive. This practice maintains link integrity via "follow-your-nose" from the old URI-M to the new URI-M.

$ curl --head --location --silent http://eot.us.archive.org/eot/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf | egrep -i "(HTTP/|^location:)"

HTTP/1.1 302 Found
Location: https://web.archive.org/web/20120520120841/http://www2.ed.gov/espanol/parents/academic/matematicas/brochure.pdf
HTTP/2 200

We can rewrite URI-Ms of the original archive and have them redirect (301 Moved Permanently) to their corresponding URI-Ms in the new archive. For example, for the Apache web server, the mod_rewrite rule can be used to perform automatic redirects and rewrite  requested URIs on the fly. Here is a rewrite rule example that the original archive can use to redirect requests to the new archive:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(\d{14})/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1/$2  [L,R=301]

If the original archive serves only mementos under /webarchives, then the mod_rewrite rule would be even simpler:

# With mod_rewrite
RewriteEngine on
RewriteRule   "^/webarchives/(.+)" http://webarchive.bac-lac.gc.ca:8080/wayback/$1  [L,R=301]


Observation 2: Not all mementos are available in the new archive

Each memento (URI-M) represents a prior version of an original web page (URI-R) at a particular datetime (Memento-Datetime). The timestamp, usually included in a URI-M, is identical to the value of the response header Memento-Datetime. 

For example, for:

URI-M = http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

we have:

Memento-Datetime = Wed, 08 Feb 2006 07:50:19 GMT
URI-R = http://www.cdc.gov/

For a URI-M from the original archive, if the values of the Memento-Datetime, the URI-R, and the final HTTP status code are not identical to the values of the corresponding URI-M from the new archive, we name this as a missing memento. 

In this study, we found that 49 mementos (out of 351) can not be retrieved from the new archive. Instead, the archive responds with other mementos that have different Memento-Datetimes. Those mementos may (or may not) have the same content compared to the content returned by the original archive. For example, when we requested the URI-M:

http://www.collectionscanada.gc.ca/webarchives/20060208075019/http://www.cdc.gov/

from the original archive (www.collectionscanada.gc.ca) on February 27, 2018, we received the HTTP status "200 OK" with the following representation (the Memento-Datetime of this memento is Wed, 08 Feb 2006 07:50:19 GMT):


In www.collectionscanada.gc.ca
Then, we requested the corresponding URI-M:

http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/

from the new archive. As shown in the cURL session below, the request redirected to another URI-M:

http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

This memento has a different Memento-Datetime (Thu, 26 Oct 2006 06:02:47 GMT) for a delta of about 260 days. The content of this memento (the figure below) in the new archive is different from the content of the memento that used to be available in the original archive (the figure above).
In webarchive.bac-lac.gc.ca
$ curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20060208075019/http://www.cdc.gov/ | egrep -i "(HTTP/|^location:|^Memento-Datetime)"

HTTP/1.1 302 Found
Location: http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/
HTTP/1.1 200 OK
Memento-Datetime: Thu, 26 Oct 2006 06:02:47 GMT

The figure below shows a set of screenshots taken for the memento within 14 months. The screenshots with a blue border are representations of mementos in the original archive (www.collectionscanada.gc.ca) before it is moved to the new archive. The screenshots with a red border is the home page of the new archive before we manually detected the corresponding URI-Ms in the new archive. The screenshots with a green border shows the representations resulting from requesting the memento from the new archive (webarchive.bac-lac.gc.ca). The representation before the archive's change (blue border) is different from the representation of the memento after the change (green border).


Replayed the memento 33 times within 14-months.

Observation 3: New features available in the new archive because of the upgraded replay tool

The new archive (webarchive.bac-lac.gc.ca) uses an updated version of OpenWayback (i.e., OpenWayback Release 1.6.0 or later) that enables new features, such as raw mementos and Memento support. These features were not supported by the original archive that was running OpenWayback Release 1.4 (or earlier) .

Raw mementos

At replay time, archives transform the original content of web pages to appropriately replay them (e.g., in a user’s browser). Archives add their own banners to provide metadata about both the memento being viewed and the original page. Archives also rewrite links of embedded resources in a page so that these resources are retrieved from the archive, not from the original server.

Many archives allow accessing unaltered, or raw, archived content (i.e., retrieving the archived original content without any type of transformation by the archive). The most common mechanism to retrieve the raw mementos is by adding "id_" after the timestamp in the requested URI-M.

The feature of retrieving the raw mementos was not provided by the original archive (www.collectionscanada.gc.ca). However, it is supported by the new archive. For example, to retrieve the raw content of the memento identified by the URI-M

http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/

we add "id_" after the timestamp as shown in the cURL session below:

curl --head --location --silent http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247id_/http://www.cdc.gov/ | egrep -i "(HTTP/|^Memento-Datetime)"

HTTP/1.1 200 OK
Memento-Datetime: Thu, 26 Oct 2006 06:02:47 GMT

Memento support

The Memento protocol is supported by most public web archives including the Internet Archive. The protocol introduces two HTTP headers for content negotiation. First, Accept-Datetime is an HTTP Request header through which a client can request a prior state of a web resource by providing the preferred datetime, for example,

Accept-Datetime: Mon, 09 Jan 2017 11:21:57 GMT.

Second, the Memento-Datetime HTTP Response header is sent by a server to indicate the datetime at which the resource was captured, for instance,

Memento-Datetime: Sun, 08 Jan 2017 09:15:41 GMT.

The Memento protocol also defines:

  • TimeMap: A resource that provides a list of mementos (URI-Ms) for a particular original resource, 
  • TimeGate: A resource that supports content negotiation based on datetime to access prior versions of an original resource. 
The cURL session below shows the TimeMap of the original resource (http://www.cdc.gov/) available in the new archive. The TimeMap indicates that the memento with the Memento-Datetime Wed, 08 Feb 2006 07:50:19 GMT (as described above) is not available in the new archive.

$ curl http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/

<http://www.cdc.gov/>; rel="original",
<http://webarchive.bac-lac.gc.ca:8080/wayback/timemap/link/http://www.cdc.gov/>; rel="self"; type="application/link-format"; from="Thu, 26 Oct 2006 06:02:47 GMT"; until="Fri, 09 Oct 2015 13:26:42 GMT",
<http://webarchive.bac-lac.gc.ca:8080/wayback/http://www.cdc.gov/>; rel="timegate",
<http://webarchive.bac-lac.gc.ca:8080/wayback/20061026060247/http://www.cdc.gov/>; rel="first memento"; datetime="Thu, 26 Oct 2006 06:02:47 GMT",
<http://webarchive.bac-lac.gc.ca:8080/wayback/20151009132642/http://www.cdc.gov/>; rel="last memento"; datetime="Fri, 09 Oct 2015 13:26:42 GMT"

It is possible that two archives use the same version of OpenWayback but with different configuration options, such as whether to support Memento framework or not:

 <bean name="standardaccesspoint" class="org.archive.wayback.webapp.AccessPoint">
  <property name="accessPointPath" value="${wayback.url.prefix}/wayback/"/>
  <property name="internalPort" value="${wayback.url.port}"/>
  <property name="serveStatic" value="true" />
  <property name="bounceToReplayPrefix" value="false" />
  <property name="bounceToQueryPrefix" value="false" />
  <property name="enableMemento" value="true" />

or how to respond to (raw) archival redirects (thanks to Alex Osborne for help in locating this information):

<!-- WARN CLIENT ABOUT PATH REDIRECTS -->
<bean class="org.archive.wayback.replay.selector.RedirectSelector">
 <property name="renderer">
   <bean class="org.archive.wayback.replay.JSPReplayRenderer">
     <property name="targetJsp" value="/WEB-INF/replay/UrlRedirectNotice.jsp" />
   </bean>
 </property>
</bean>
...
<!-- Explicit (via "id_" flag) IDENTITY/RAW REPLAY -->
<bean class="org.archive.wayback.replay.selector.IdentityRequestSelector">
  <property name="renderer" ref="identityreplayrenderer"/>
</bean>


Observation 4: The HTTP status code may change in the new archive 

The HTTP status codes of URI-Ms in the new archive might not be identical to the HTTP status code of the corresponding URI-Ms in the original archive. For example, the HTTP request of the URI-M:

http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/

to the original archive resulted in the following "302" redirects before it ended up with the HTTP status code "404":

http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.berlin.gc.ca/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (302)
http://www.collectionscanada.gc.ca/webarchives/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (404)

We requested the corresponding URI-M from the new archive, it ended up with the HTTP status code "200":

http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.berlin.gc.ca/ (Redirect by JavaScript (JS))
http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181041/http://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
http://webarchive.bac-lac.gc.ca:8080/wayback/20070220181204/http://www.international.gc.ca/global/errors/404.asp?404%3Bhttp://www.dfait-maeci.gc.ca/canadaeuropa/germany/ (Redirect by JS)
http://webarchive.bac-lac.gc.ca:8080/wayback/20071115025620/http://www.international.gc.ca/canada-europa/germany/ (302) 
http://webarchive.bac-lac.gc.ca:8080/wayback/20071115023828/http://www.international.gc.ca/canada-europa/germany/ (200

The list of all 351 URI-Ms is shown below. The file contains the following information:
  • The URI-M from the original archive (original_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the original archive (final_original_URI-M). 
  • The HTTP status code of the final URI-M from the original archive (final_original_URI-M_status_code).
  • The URI-M from the new archive (new_URI-M).
  • The final URI-M after following redirects, if any, of the URI-M from the new archive (final_new_URI-M).
  • The HTTP status code of the final URI-M from the new archive (final_new_URI-M_status_code).
  • The difference (in seconds) between the Memento-Datetimes of the final URI-Ms (delta).
  • Whether the URI-Rs of the final URI-Ms are identical or not (same_final_URI-Rs). The different URI-Rs are labeled with "No", otherwise "-".
  • Whether the status codes of the final URI-Ms are identical or not (same_final_URI-Ms_status_code). The different status codes are labeled with "No", otherwise "-".
  • The first 49 rows contain the information of the missing mementos.


Conclusions

When Library and Archives Canada migrated their archive in May 2018, 49 of the 351 mementos we were tracking resurfaced in the new archive with a change in Memento-Datetime, URI-R, or the final HTTP status code. In three cases, the HTTP status codes of mementos in the new archive change from the status codes in the original archive. Also, updating/upgrading a web archival replay tool (e.g., OpenWayback and PyWb) may affect how migrated mementos are indexed and replayed. In general, with any memento migration, we recommend that when possible requests of mementos to the original archives to be redirected to their corresponding mementos in the new archive (e.g., the case of The End of Term Archive explained above).

In the upcoming posts we will provide some details about changes in the archives: 

--Mohamed Aturban

Saturday, August 24, 2019

2019-08-24: Six WS-DL Classes Offered for Fall 2019

https://xkcd.com/2180/

A record six WS-DL courses are offered for Fall 2019:
I am on research leave for Fall 2019 and will not be teaching.

Dr. Brunelle's CS 891 is especially suitable for incoming graduate students that would like an introduction on how to read research papers and give presentations.

If you're interested in these classes, you need to take them this semester.  Although subject to change, a likely offering of WS-DL classes for Spring 2020 is:
  • CS 395 Research Methods in Data and Web Science, Dr. Michael L. Nelson
  • CS 480/580 Introduction to AI, Dr. Vikas Ashok
  • CS 495/595 Introduction to Data Mining, Dr. Sampath Jayarathna
  • CS 800 Research Methods, Dr. Michele C. Weigle
Dr. Wu has a course buyout and will be not be teaching in Spring 2020.  

--Michael

Wednesday, August 14, 2019

2019-08-14: Building the Better Crowdsourced Study - Literature on Mechanical Turk

The XKCD comic "Study" parodies
 the challenges of recruiting study participants.

As part of "Social Cards Probably Provide For Better Understanding Of Web Archive Collections" (recently accepted for publication by CIKM2019), I had to learn how to conduct user studies. One of the most challenging problems to solve while conducting user studies is recruiting participants. Amazon's Mechanical Turk (MT) solves this problem by providing a marketplace where participants can earn money by completing studies for researchers. This blog post summarizes the lessons I have learned from other studies that have successfully employed MT. I have found parts of this information scattered throughout different bodies of knowledge, but not gathered in one place; thus, I hope it is a useful starting place for future researchers.

MT is by far the largest source of study participants, with over 100,000 available participants. MT is an automated system that facilitates the interaction of two actors: the requester and the worker. A worker signs up for an Amazon account and must wait a few days to be approved. Once approved, MT provides the worker with a list of assignments to choose from. A Human Interface Task (HIT) is an MT assignment. Workers perform HITs for anywhere from $0.01 up to $5.00 or more. Workers earn as much as $50 per week completing these HITs. Workers are the equivalents of subjects or participants found in research studies.

Workers can browse HITs to complete via Amazon's Mechanical Turk.
Requesters are the creators of HITs. After a worker completes a HIT, the requester decides whether or not to accept the HIT and thus pay the worker. Requesters use the MT interface to specify the amount to be paid for a HIT, how many unique workers per HIT, how much time to allot to workers, and when the HIT will no longer be available for work (expire). Also, requesters can specify that they only want workers with specific qualifications, such as age, gender, employment history, or handedness. The Master Qualification is assigned automatically by the MT system based on the behavior of the workers. Requesters can also specify that they only want workers with precise approval rates.

Requesters can create HITs using the MT interface, which provides a variety of templates.
The HITs themselves are HTML forms entered into the MT system. Requesters have much freedom within the interface to design HITs to meet their needs, even including JavaScript. Once the requester has entered the HTML into the system, they can preview the HIT to ensure that it looks and responds as expected. When the requester is done creating the HIT, they can then save it for use. HITs may contain variables for links to visualizations or other external information. When the requester is ready to publish a HIT for workers to perform, they can submit a CSV file containing the values for these variables. MT will create one HIT per row in the CSV file. Amazon will require that the requester deposit enough money into their account to pay for the number of HITs they have specified. After the requester pays for the HITs, workers can see the HIT and then begin their submissions. The requester then reviews each submission as it comes in and pays workers.

The MT environment is different from that used in traditional user studies. MT participants can use their own devices to complete the study wherever they have a connection to the Internet. Requesters are limited in the amount of data that they can collect on MT participants. For each completed HIT, the MT system supplies the completion time and the responses provided by the MT participant. A requester may also employ JavaScript in the HIT to record additional information.

In contrast, traditional user studies allow a researcher to completely control the environment and record the participant's physical behavior. Because of these differences, some scholars have questioned the effectiveness of MT's participants. To assuage this doubt, Heer et al. reproduced the results of a classic visualization experiment. The original experiment used participants recruited using traditional methods. Heer recruited participants via MT and demonstrated that the results were consistent with the original study. Kosara and Ziemkiewicz reproduced one of their previous visualization studies and discovered that MT results were equally consistent with the earlier study. Bartneck et al. conducted the same experiment with both traditionally recruited participants and MT workers. They also confirmed consistent results between these groups.

MT is not without its criticism. Fort, Adda, and Cohen raise questions on the ethical use of MT, focusing on the potentially low wages offered by requesters. In their overview of MT as a research tool, Mason and Suri further discuss such ethical issues as informed consent, privacy, and compensation. Turkopticon is a system developed by Irani and Silberman that helps workers safely voice grievances about requesters, including issues with payment and overall treatment.

In traditional user studies, the presence of the researcher may engender some social motivation to complete a task accurately. MT participants are motivated to maximize their revenue over time by completing tasks quickly, leading some MT participants to not exercise the same level of care as a traditional participant. Because of the differences in motivation and environments, MT studies require specialized design. Based on the work of multiple academic studies, we have the following advice for requesters developing meaningful tasks with Mechanical Turk:
  • complex concepts, like understanding, can be broken into smaller tasks that collectively provide a proxy for the broader concept (Kittur 2008)
  • successful studies ensure that each task has questions with verifiable answers (Kittur 2008)
  • limiting participants by their acceptance score has been successful for ensuring higher quality responses (Micallef 2012, Borkin 2013)
  • participants can repeat a task – make sure each set of responses corresponds to a unique participant by using tools such as Unique Turker (Paolacci 2010)
  • be fair to participants; because MT is a competitive market for participants, they can refuse to complete a task, and thus a requester's actions lead to a reputation that causes participants to avoid them (Paolacci 2010)
  • better payment may improve results on tasks with factually correct answers (Paolacci 2010, Borkin 2013, PARC 2009) – and can address the ethical issue of proper compensation
  • being up front with participants and explaining why they are completing a task can improve their responses (Paolacci 2010) – this can also help address the issue of informed consent
  • attention questions can be useful for discouraging or weeding out malicious or lazy participants that may skew the results (Borkin 2013, PARC 2009)
  • bonus payments may encourage better behavior from participants (Kosara 2010) – and may also address the ethical issue of proper compensation
MT provides a place to recruit participants, but recruitment is only one part of successfully conducting user experiments. To create successful user experiments, I recommend starting with "Methods for Evaluating Interactive Information Retrieval Systems with Users" by Diane Kelly.

For researchers starting down the road of user studies, I recommend starting first with Kelly's work and then circling back to the other resources noted here when developing their experiment.

-- Shawn M. Jones

Saturday, August 3, 2019

2019-08-03: Searching Web Archives for Unattributed Deleted Tweets From Politwoops

Tweet URL: https://twitter.com/derekwillis/status/1127234631865118731

On May 11th 2019, Derek Willis, who works at Propublica and also maintains the Politwoops project, tweeted a list of deleted tweet ids found by Politwoops that could not be attributed to any Twitter handle being tracked by Politwoops. This was an opportunity for us to revisit our interest in using web archives to uncover the deleted tweets. Although we were unsuccessful in finding any of the deleted tweet ids in web archives provided by Politwoops, we are documenting our process for coming to this conclusion.

Politwoops  

Politwoops is a web service which tracks deleted tweets of elected public officials and candidates running for office in the USA and 55 other countries. The Politwoops USA is supported by Propublica.

Creating Twitter handles list for the 116th Congress 

In a previous post, we discussed the challenges involved in creating a data set of Twitter handles for the members of Congress and provided a data set of Twitter handles for the 116th Congress. A member of Congress can have multiple Twitter accounts which can be categorized into official, personal, and campaign accounts. We made a decision of creating a data set of official Congressional Twitter accounts over their personal or campaign accounts because we did not want to track the personal tweets from the members of Congress. For this reason, our data set has a one-to-one mapping between a member of Congress and their Twitter handle listing all the current 537 members of Congress with their official Twitter handles. However, Politwoops has a one-to-many mapping between a member of Congress and their Twitter handles because it tracks all the Twitter handles for a member of Congress.  We expanded our data set of Twitter handles for the 116th Congress by using the rest of the Twitter handles that Politwoops tracks in addition to those we have in our data set. For example, our data set of Twitter handles for the 116th Congress has @RepAOC as the Twitter handle for Rep. Alexandria Ocasio-Cortez while Politwoops lists @AOC and @RepAOC as her Twitter handles.  
Figure 1: Screenshot of  Rep. Alexandria Ocasio-Cortez's Politwoops page highlighting the two handles (@AOC, @RepAOC) Politwoops  tracks for her

Creating the President, the Vice-President, and Committee Twitter handles list

Politwoops USA tracks members of Congress, the President, and the Vice-President. Propublica provided the list of Twitter handles being tracked by Politwoops in the data sources list provided at Propublica Congress API. Furthermore, we found a subset of the committee Twitter handles to be present in Politwoops which have not been advertised in their data sources list via the Propublica API. With no complete list of the committee Twitter handles being tracked by Politwoops, we used the CSPAN list of committee handles
Figure 2: CSPAN Twitter Committee List showing the SASC Majority committee's Twitter handle, @SASCMajority
Figure 3: Politwoops returns a 404 for SASC Majority committee's Twitter handle, @SASCMajority
Figure 4: Screenshot of the @HouseVetAffairs committee Twitter handle being tracked by Politwoops

List of Different Approaches Used to Find Deleted Tweets using the Web Archives   

Internet Archive cdx Server API

The Internet Archive cdx Server API can be used to list all the mementos in the index of Internet Archive for a URL or URL prefix. We can broaden our search for a URL with the URL match scope option provided by the cdx Server API. In our study, we have used the URL match scope of "prefix".
The URL http://web.archive.archive.org/cdx/search/cdx?url=https://twitter.com/repaoc&matchType=prefix searches for all the URLs in Internet Archive with the prefix https://twitter.com/repaoc. Using this approach, we received all the different URL variants that exist in Internet Archive index file for @RepAOC.
Excerpt of the response received from the Internet Archive's cdx server API for @RepAOC
com,twitter)/repaoc 20190108184114 https://twitter.com/RepAOC text/html 200 GBB2ADFZOLTFQAPQACVT2XFVBVSEEHT5 42489
com,twitter)/repaoc 20190109161007 https://twitter.com/RepAOC text/html 200 SLZHJQKN25URYRWQUQI7DW5JZD5M5E6F 43004
com,twitter)/repaoc 20190109200548 https://twitter.com/RepAOC text/html 200 DWGHG6CSHBE7OETXJD3TEINEWKV372DJ 45123
com,twitter)/repaoc 20190120082837 https://twitter.com/repaoc text/html 200 JVHASBSCBHPGKCVR7GBVOYRM4H5KQYBP 53697
com,twitter)/repaoc 20190126051939 https://twitter.com/repaoc text/html 200 YRE4RPA46F7PTQNBQUMHKCLWLL2WUXE2 56420
com,twitter)/repaoc 20190202170000 https://twitter.com/RepAOC text/html 200 6VS73H6XD5T2TVRC4UJXNT2D6FCNZWMJ 55388
com,twitter)/repaoc 20190207211032 https://twitter.com/repaoc text/html 200 NQQI4UJ6TUMHS36JATOY35D7P255MEIA 56378
com,twitter)/repaoc 20190221024247 https://twitter.com/RepAOC text/html 200 K6B3P7IRHIXTZSPXRWUPSBCRZ2HCWBZB 56678
com,twitter)/repaoc 20190223102039 https://twitter.com/RepAOC text/html 200 OO2U6EUXYTGGEE2Q3ARQJ4SI4QGF2CLR 58008
com,twitter)/repaoc 20190223180906 https://twitter.com/RepAOC text/html 200 HC6RCIVTTUV6JU35PA2JZ256E7RXY2MN 56799
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190305195452 https://twitter.com/RepAOC text/html 200 XH646QWCIOJ4KB4LCPQ6P6MMYSTDMNAA 58315
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59586
com,twitter)/repaoc 20190306232948 https://twitter.com/RepAOC text/html 200 UL2KWN3374FHMP2JFV4TUWODVLEBKZY6 59587
com,twitter)/repaoc 20190307011545 https://twitter.com/RepAOC text/html 200 R5PQUDWVYCZGAH3B4LVSBQXFXZ5MVXSY 59388
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59430
com,twitter)/repaoc 20190307214043 https://twitter.com/RepAOC text/html 200 GWIJQTMZPFZEJPUT47H2ORDCSF4RP5EX 59431
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190309213407 https://twitter.com/RepAOC text/html 200 WDEQBQN552GO2S6SB4IOKLW7M7WDWPCG 59293
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59498
com,twitter)/repaoc 20190310215135 https://twitter.com/RepAOC text/html 200 MLSCN7ITZVENNMB6TBLCI6BXCR3PSL4Z 59499

Example for a status URL
com,twitter)/repaoc/status/1082706172623376384 20190108201259 http://twitter.com/RepAOC/status/1082706172623376384 unk 301 3I42H3S6NNFQ2MSVX7XZKYAYSCX5QBYJ 447
The tweet id from the status URL was compared with the list of deleted tweet ids provided by Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the 116th Congress

For this approach, we fetched all the mementos from web archives for the 116th Congress between 2019-01-03 and 2019-05-15 using MemGator, a Memento aggregator service.
For example, we queried multiple web archives for Rep. Karen Bass's Twitter handle, @RepKarenBass, to fetch all the mementos for her Twitter profile page. All the embedded tweets from the memento were parsed and compared with the deleted list of tweet ids from Politwoops. Using this approach we did not find any matching tweet ids. 
Example of a URI-M  in CDXJ format


20190201043735 {"uri": "http://web.archive.org/web/20190201043735/https://twitter.com/RepKarenBass", "rel": "memento", "datetime": "Fri, 01 Feb 2019 04:37:35 GMT"}

Figure 5: Screenshot of the memento for Rep. Karen Bass's Twitter profile page with 20 embedded tweets
Output upon parsing the fetched mementos
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827923375880359937Timestamp: 1486227289|||TweetText: Sometimes the best way to stand up is to sit down. Happy Birthday Rosa Parks. #OurStory #BlackHistoryMonthpic.twitter.com/fjPMeD3RzX
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827593256988860417Timestamp: 1486148583|||TweetText: I urge the Gov't of Cameroon to respect the civil and human rights of all of its citizens. See my full statement: http://bass.house.gov/media-center/press-releases/rep-bass-condemns-intimidation-against-english-speaking-population …
TweetType: RT|||ScreenName: RepKarenBass|||TweetId: 827292997100376064Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OTR|||ScreenName: RepBarbaraLee|||TweetId: 827285964674441216Timestamp: 1486075318|||TweetText: Join me in wishing @HouseGOP happy #GroundhogDay! After spending 7 years looking for a viable #ACA alternative, they still have nothing.pic.twitter.com/miqwtKM06L
TweetType: OT|||ScreenName: RepBarbaraLee|||TweetId: 827201943323938816Timestamp: 1486055286|||TweetText: This month is National Children’s Dental Health Month (NCDHM). This year's slogan is "Choose Tap Water for a Sparkling Smile"pic.twitter.com/gk1cj8oTK9
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196902273929217Timestamp: 1486054084|||TweetText: On the growing list of things I shouldn't have to defend my stance on, add #UCBerkeley, 1 of our nation's most prestigious pub. universities
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196521347166209Timestamp: 1486053993|||TweetText: .@realDonaldTrump: #UCBerkeley developed immunotherapy for cancer!
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196386512871425Timestamp: 1486053961|||TweetText: .@realDonaldTrump: Do you like Vitamin K? Discovered/synthesized at #UCBerkeley
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 827196102554296320Timestamp: 1486053894|||TweetText: .@realDonaldTrump What's your stance on painkillers? Beta-endorphins invented at #UCBerkeleyhttps://twitter.com/realDonaldTrump/status/827112633224544256 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826960463590207488Timestamp: 1485997713|||TweetText: Happy to see Judge Birotte of LA continue the fight towards ending Pres. Trump’s exec. order.http://www.latimes.com/local/lanow/la-me-ln-federal-order-travel-ban-20170201-story.html …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826877783787847681Timestamp: 1485978000|||TweetText: This morning, I was happy to attend @MENTORnational's Capitol Hill Day, where mentors advocate for services for all youth. Thank you!pic.twitter.com/EyVgDSIvuE
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826860675007930368Timestamp: 1485973921|||TweetText: A civil & women's rights activist, Dorothy Height helped black women throughout America succeed. #OurStory #BlackHistoryMonth #NewStamppic.twitter.com/v8wnHFpgMu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826833993874042880Timestamp: 1485967560|||TweetText: Let's not turn our backs on the latest refugees and potential citizens just because they come from Africa. More: https://bass.house.gov/media-center/press-releases/rep-bass-pens-letter-urging-president-trump-rescind-travel-ban …pic.twitter.com/J9veQNSpJu
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826822912057413633Timestamp: 1485964918|||TweetText: Trump's listening session is w people he knows and should be "listening" to all the time---campaign surrogates, supporters, employees
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826799517295058944Timestamp: 1485959340|||TweetText: 57 years ago, four Black college students sat at a lunch counter and asked for lunch. We will not go back. #OurStory #BlackHistoryMonthpic.twitter.com/ER00yv1q7B
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826606703928078336Timestamp: 1485913370|||TweetText: 7 in 10 Americans do NOT support @POTUS relentless quest to strike down Roe v Wade.  Where does #Gorsuch stand? #SCOTUS
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826579567376637952Timestamp: 1485906900|||TweetText: Proud to stand w/ my Foreign Affairs colleagues and defend dissenting diplomats..http://www.politico.com/story/2017/01/trump-immigration-ban-state-department-dissent-democrats-234433 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826547056235933697Timestamp: 1485899149|||TweetText: Treasury nominee #Mnuchin denied that his company engaged in robo-signing, foreclosing on Americans without proper review #RejectMnuchin
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826474625831890945Timestamp: 1485881880|||TweetText: Few cities on this planet have benefited so handsomely from immigration as LA. Read the @TrumanProject letter: http://ow.ly/oOhF308waXJ
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826426234422829056Timestamp: 1485870343|||TweetText: Today is the day! #GetCoveredhttps://twitter.com/JoaquinCastrotx/status/826416237223755777 …
TweetType: OT|||ScreenName: RepKarenBass|||TweetId: 826275582883270656Timestamp: 1485834425|||TweetText: Pres. Trump has replaced Yates as standing AG for standing up for millions. You can't replace us all.http://www.cbsnews.com/amp/news/trump-fires-acting-attorney-general-sally-yates/?client=safari …

From mementos of the 115th Congress

For this approach, we reused locally stored TimeMaps and mementos from the 115th Congress which we collected between 2017-01-20 and 2019-01-03. The list of Twitter handles for the 115th Congress was obtained from the data set on the 115th Congressional tweet ids released by Social Feed Manager. The request for mementos from the web archives was carried out by expanding the URI-Rs for the Twitter handle with the language and with_replies argument (thanks Sawood Alam for suggesting the language variations).
For example, we queried multiple web archives for Doug Jones's Twitter handle, @dougjones, by expanding with the language and with_replies arguments as shown below:

https://twitter.com/dougjones
Twitter supports 47 different language variations and multiple arguments such as with_replies. Upon searching for the URI-R https://twitter.com/dougjones, the web archives return all the mementos for the exact URI-R without any language variations or arguments.

Excerpt of the TimeMap response received for https://twitter.com/dougjones

20110210134919 {"uri": "http://web.archive.org/web/20110210134919/http://twitter.com:80/dougjones", "rel": "first memento", "datetime": "Thu, 10 Feb 2011 13:49:19 GMT"}
20180205201909 {"uri": "http://web.archive.org/web/20180205201909/https://twitter.com/DougJones", "rel": "memento", "datetime": "Mon, 05 Feb 2018 20:19:09 GMT"}
20180306132212 {"uri": "http://wayback.archive-it.org/all/20180306132212/https://twitter.com/DougJones", "rel": "memento", "datetime": "Tue, 06 Mar 2018 13:22:12 GMT"}
20180912165539 {"uri": "http://wayback.archive-it.org/all/20180912165539/https://twitter.com/DougJones", "rel": "memento", "datetime": "Wed, 12 Sep 2018 16:55:39 GMT"}
Upon searching for the URI-R https://twitter.com/dougjones?lang=en, the web archives return all the mementos for the language variation "en".

TimeMap response received for https://twitter.com/dougjones?lang=en

20190424140424 {"uri": "http://web.archive.org/web/20190424140424/https://twitter.com/dougjones?lang=en", "rel": "first memento", "datetime": "Wed, 24 Apr 2019 14:04:24 GMT"}
20190501165834 {"uri": "http://web.archive.org/web/20190501165834/https://twitter.com/dougjones?lang=en", "rel": "memento", "datetime": "Wed, 01 May 2019 16:58:34 GMT"}
20190509164649 {"uri": "http://web.archive.org/web/20190509164649/https://twitter.com/dougjones?lang=en", "rel": "last memento", "datetime": "Thu, 09 May 2019 16:46:49 GMT"}
A lot of mementos in the web archives contain Twitter handle URLs with the language and with_replies arguments. Therefore, we queried for the Twitter handle URL and the with_replies argument URL with 47 different language variations for each Twitter handle. In total we created 96 URLs for each Twitter handle.
https://twitter.com/dougjones?lang=en (47 URLs for 47 languages)
Total: 96 URLs for each URI-R

Example for different language variation URLs:
...

The parsed embedded tweets from the mementos was compared with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 
We also had locally stored mementos for the 115th Congress from 2017-01-01 to 2018-06-30. The data set of Twitter handles for this collection was created by taking a Wikipedia page snapshot of the current members of Congress on July 4, 2018 and using the CSPAN Twitter list on members of Congress and Politwoops to get all the Twitter handles. Upon parsing the embedded tweets from the mementos, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids. 

From mementos of the President, the Vice-President and the Committee Twitter handles list

For this analysis we fetched all the mementos for the President, the Vice-President, and committee handles between 2019-01-03 and 2019-06-30. Upon fetching the mementos and parsing embedded tweets, we compared the parsed tweets with the deleted list from Politwoops. Using this approach we did not find any matching tweet ids.  

Upon completion of our analysis, we learned to extract timestamp from any tweet id. Snowflake is a service used to generate unique ids for all the tweet ids and other objects within Twitter like lists, users, collections, etc. Snowflake generates unsigned-64 bit integers which consist of: 
  • timestamp - 41 bits (millisecond precision w/ a custom epoch gives us 69 years)
  • configured machine id - 10 bits - gives us up to 1024 machines
  • sequence number - 12 bits - rolls over every 4096 per machine (with protection to avoid rollover in the same ms)
We have created a web service, TweetedAt to extract timestamp from deleted tweet ids. Using TweetedAt, we found the timestamp of all the deleted tweet ids provided by Derek Willis from Politwoops. Of the 107 Politwoops deleted tweet ids, we found only six of the tweet ids from the 116th Congress time range and nine from the 115th Congress time range
To summarize, we were unable to find any of the deleted tweet ids provided by Derek Willis from Politwoops. We analyzed the given below sources:
  • Mementos for the 116th Congress Twitter handles between 2019-01-03 and 2019-05-15.
  • Twitter handle mementos for the committees, the President and the Vice-President between 2019-01-03 and 2019-06-30. 
  • Mementos for the 115th Congress Twitter handles between 2017-01-03 and 2019-01-03. 
  • The Internet Archive cdx server API responses on the 116th Congress Twitter handles.
There are several possible reasons for being unable to find the deleted tweet ids provided by Derek Willis from Politwoops:
  • 92 out of 107 deleted tweets were outside the date range of our analysis.
  • The mementos in a web archive are indexed by their URI-Rs. When a user changes their Twitter handle, the original resource URI-R for the user's Twitter account also changes. For example, Rep. Nancy Pelosi used the Twitter handle, @nancypelosi, during the 115th Congress but changed it to @speakerpelosi in the 116th Congress. Now querying the web archives for the mementos for Rep. Nancy Pelosi with her Twitter handle, @speakerpelosi, returns the earliest mementos from the 116th Congress. In order to get mementos prior to the 116th Congress, we need to query the web archives with Twitter handle, @nancypelosi. 
  • The data set of Twitter handles for the US Congress used in our analysis has a one-to-one mapping between a seat in the Congress and the member of Congress. If a seat in the US Congress has been held by multiple members over the Congress tenure, the data set includes the current member of Congress over the former members thus losing out on Twitter handles of the former members within the same Congress.
We analyzed the web archives for the 115th and the 116th Congress members, the President, the Vice-President, and committee Twitter handles for finding the deleted tweet ids provided to us by Derek Willis from Politwoops. Despite being unable to find any match for the deleted tweet ids from our analysis, we will continue to investigate as we learn more.  We welcome any information that might aid our analysis.

-----
Mohammed Nauman Siddique
(@m_nsiddique)