Summary: a woman hit a bicyclist participating in a race (the cyclist apparently was not seriously injured) and then bragged about it on Twitter. The cyclist was apparently not going to report the event, but her bragging changed his mind and he contacted the police:
@emmaway20 we have had tweets ref an RTC with a bike. We suggest you report it at a police station ASAP if not done already & then dm us
— Norwich Police (@NorwichPoliceUK) May 19, 2013
Interestingly, unlike the Twitpic examples in the previous post, the Instagram images do not have a thumbnail on Topsy (the thumbnails are pulled directly from instagram.com). I won't be posting about all ill-advised tweets that will surely occur in the future, but this example was too good to pass up. I predict the driver has earned a mention on an upcoming "Wait Wait... Don't Tell Me!"
Marc Spaniol, from Max Planck Institute for Informatics, Germany, welcomed the audience in the opening note of the workshop. He emphasized on the target of the workshop to build a community of interest in the temporal web.
Omar Alonso, from Microsoft Silicon Valley, was the keynote speaker with presentation entitled: “Stuff happens continuously: exploring Web contents with temporal information”. Omar divided his presentation into three parts: Time in document collection, Social data, and Exploring the web using time.
In the Time in document collection, Omar gave an intro about the temporal dimension of the document. He defined the characteristics of the temporal by first defining “What is Time?”. The time may be used in normalized format or hierarchy format. The time has 4 types: times; duration; sets, which may explicit (i.e., May 2, 2012) or implicit (i.e., labor day); or relative expressions. There are different approaches to extract the temporal expressions like: Temporal Tagger, Named Entity Recognition (NER) for time. We can express the time in TimeML format. Omar explained that people care about temporality because it describes their landmarks and evolution such as: Winning a game for soccer player or financial quarters for accountant. Including the crowd with the temporal, we can achieve a complete annotated calendar for free by combining all the hot topics for the year.
Then, Omar explained the effect of social media on the concepts of “Temporal and Document”.
Twitter has limited the document to 140 characters, Time in Twitter is supported by: Trending topics, e.g., Mothers day; hashtags, e.g., #tempweb2013; cashtag, hash tag started with $ for financial information (e.g., $apple); and group chats, people tweet in specific time to discuss specific topic.
Time in Facebook is known by the Timeline, photos over time, and the generic events.
Temporally-Aware Signals. User interests may be time sensitive, for example tweeting about recent, seasonal, or ongoing activities.
Community Question Answering (CQA) also has a temporal dimension. CQA helps the user to answer the questions that the user can't answer using the web search engines. Some answers don't change through the time (i.e., what is the distance to the moon?), others are time-sensitive.
Reddit, which is a sharing platform popular in US, has also a Time dimension. Reddit is so popular to attract famous people to communicate with the crowd.
Reviewing systems such as: Amazon, yelp, and Foursquare holds a temporal characteristics as the review may be changed through the time.
Time in Wikipedia is tracked by the evolution of edits by users.
After that, Omar moved to the last part of his presentation that exploration of the web using time. The correlating between the different sources on the web by time could give us a better understanding of the event, for example relation between hashtag and Wikipedia article. CrowdTiles is an example of combining the Web, Twitter, and Wikipedia as a part of Bing search results
While this approach works well with the popular events, it need a modification for looking back for the not popular events. Combining different data sources introduces new research questions: how to manage the duplicate and the near duplicates, what is the temporal precedence between them, how to rank the results by temporal value, and how to evaluate the success of these techniques.
Session 1: Web Archiving
Daniel Gomes, from the Portuguese Web Archive, gave a presentation entitled “A Survey of Web Archive Search Architectures”. Daniel gave an overview about the current search paradigms in the web archives. The use-cases showed that the users demand google-like search from the web archives. The survey found from the web archives under the experiment: 89% have URL search, 79% have metadata, 67% have fulltext search. These numbers had been computed based on the publications about the web archives and the authors experience.
Then, Daniel gave three examples about the search systems on the web archives. Portuguese web archive provides fulltext search for 1.2B docs using NutchWAX. The partition technique is based on time and document. EverLast is p2p architectures, the tasks (crawling, versioning, and indexing) are distributed between different nodes. Wayback machine is a url search architecture, it used flat sorted files (called CDX) to index the webpages. Daniel proposed a new one single portal for across-web archives search. The new system has the challenge of the spread of the web archive data across different systems and technologies. The system has a prototype that was tested on the Portuguese web archive and showed a good results. The new system requires a new rank mechanism of search results from different sources additional to design a user interface that combines these sources.
The next presentation in the session was my presentation that entitled “Archival HTTP Redirection Retrieval Policies”. In this presentation, we studied the URI lookup in the web archive taking in the consideration the HTTP redirection status for the live or archived URI. We proposed two new measurements: Stability, computing the change of the URI status and location through time; and Reliability, computing the percentage of mementos that will end with 200 HTTP status to the total number of mementos per TimeMap. Finally, we proposed two new retrieval policies.
Daniel Gomes gave another presentation entitled “Creating a Billion-Scale Searchable Web Archive”. He gave their experience to build the Portuguese web archive. First, they integrated data from three collections, some of them were on CD formats. They built tools to convert the saved web files to arc format. Then, Portuguese web archive started their own live web crawl on 2007, focusing on Portuguese speaking domain except .br. They built Hertirix add-on, called DeDuplicator, that saves 41% disk space on weekly crawl and 7% for daily one, with total 26.5 TB/year. The Portuguese Web Archive has enabled fulltext searching, has internationalization support, and has a new graphical design.
Session 2: Identifying and leveraging time information
Julia Kiseleva, from Emory University, presented “Predicting temporal hidden contexts in web sessions”. In her presentation, Julia analyzed web log as a set of user actions. She aimed to find contexts that help to build more accurate local models. Julia built a user navigation graph, she used to partition mechanisms. Horizontal partition based on context (e.g., Geographical position) and Vertical position based on the action alphabet (e.g., Ready to buy or Just Browsing). Julia used http://www.mastersportal.eu/ in her experiement. Also, she suggested using sitemap to define the set of applicable steps.
Omar Alonso presented “Timelines as Summaries of Popular Scheduled Events”. Omar built a framework with minute level granularity to compare the events during the game with the social media reactions. Omar gave some examples from World Cup and the tweets about the game. The results showed a strong relationship between the game events and the user activities on Twitter.
Session 3: Studies & Experience Sharing
Lucas Miranda presented “Characterizing Video Access Patterns in Mainstream Media Portals”. Lucas studied the video access patterns on the major Brazilian media providers. Lucas showed some figures that summarized their results.
Hideo Joho, from University of Tsukuba, presented “A Survey of Temporal Web Search Experience”. Hideo studied the temporal aspects in the web search by surveying 110 persons to answer 18 questions related to their recent search experience. Hideo showed quantitative and qualitative analysis for his results.
We often encounter web services that take a very long time to respond to our HTTP requests. In the case of an eventual network failure, we are forced to issue the same HTTP request again. We frequently consume web services that do not support REST. If they did, we could utilize the full range of HTTP methods while retaining the functionality of our application, even when the external API we utilize in our application changes. We sometime wish to set up a web service that takes job requests, processes long running job queues and notifies the clients individually or in groups. HTTP does not allow multicast or broadcast messaging. HTTP also requires the client to stay connected to the server while the request is being processed.
Introducing HTTP Mailbox - An Asynchronous RESTful HTTP Communication System. In a nutshell, HTTP Mailbox is a mailbox for HTTP messages. Using its RESTful API, anyone can send an HTTP message (request or response) to anyone else independent of the availability, or even the identity of recipient(s). The HTTP Mailbox stores these messages and delivers them on demand. Each HTTP message is encapsulated in the body of another HTTP message and sent to the HTTP Mailbox using a POST method. Similarly, the HTTP Mailbox encapsulates the HTTP message in the body of its response when a GET request is made to retrieve the messages.
Tunneling HTTP traffic over HTTP was also explored in the Relay HTTP. But the Relay HTTP relays the live HTTP traffic back and forth and does not store HTTP messages. It works like a proxy server to only overcome JavaScript's cross-origin restriction in Ajax requests. The Relay HTTP still requires the client and server along with the relay server to meet in time.
Store and forward nature of the HTTP Mailbox is inspired by Linda. We have taken the simplicity of Linda model and implemented it using HTTP on the scale of the Web. This approach has enabled asynchronous, indirect, time-uncoupled, space-uncoupled, individual, and group communication over HTTP. Time-uncoupling refers to no need of sender and recipient(s) meeting in time to communicate while space-uncoupling refers to no need of sender and recipient(s) knowing each other's identity to communicate. The HTTP Mailbox also enabled utilization of the full range of HTTP methods otherwise unavailable to standard clients and servers.
The above figure shows the lifecycle of a typical HTTP message using the HTTP Mailbox in four steps. We will walk through this example to explain how it works. Assume that the client wants to send the following HTTP message to the server at example.com/tasks/1.
Step 3, example.com makes an HTTP GET request to the HTTP Mailbox server to retrieve its messages. To retrieve the most recent message sent to "http://example.com/tasks" a request will look like this:
> GET /hm/http://example.com/tasks HTTP/1.1> Host: example.net
Step 4, the response from the HTTP Mailbox will contain the most recent message sent to "http://example.com/tasks". The response will also include a "Link" header that will give the URLs to navigate through the chain of messages for that recipient.
< HTTP/1.1 200 OK< Date: Thu, 20 Dec 2012 02:10:22 GMT< Link: <http://example.net/hm/id/aebed6e9>; rel="first",< <http://example.net/hm/id/5ecb44e0>; rel="last self",< <http://example.net/hm/id/85addc19>; rel="previous",< <http://example.net/hm/http://example.com/tasks>; rel="current"< Via: Sent by 127.0.0.1< on behalf of http://example.org/alice< delivered by http://example.net/< Content-Type: message/http; msgtype: request< Content-Length: 108< < PATCH /tasks/1 HTTP/1.1< Host: example.com< Content-Type: text/task-patch< Content-Length: 11< < Status=Done
A tech report is published on arXiv, describing the HTTP Mailbox in details. A reference implementation of the HTTP Mailbox can be found on GitHub.
We have already used the HTTP Mailbox successfully in the following applications.
Preserve Me! - a distributed web object preservation system that establishes social network among web objects and uses the HTTP Mailbox for its communication needs.
Preserve Me! Viz - a dynamic and interactive network graph visualization tool to give insight of the Preserve Me! graph and communication.
Where else can we use the HTTP Mailbox?
Warrick - a tool to restore lost websites. It can use HTTP Mailbox to accept service requests and status notification.
Carbon Dating the Web - a tool to find out the age of a resource at a given URL. This process usually takes few minutes to complete each request in the queue. It can utilize the HTTP Mailbox to accept service requests and send the response when ready.
Device notifications - related to software updates, general application messaging.
Any application that needs asynchronous RESTful HTTP messaging.
You're probably thinking "the Library of Congress". And you're right, since 2010 they have been (see the announcements from Twitter and LC). But LC is currently providing access only to researchers, and the scale of the archive makes access challenging (see LC's January 2013 white paper that provides a status update on the project).
To say I think this joint project between LC and Twitter is exciting and important is an understatement; I could go on about the scholarly importance, the cultural and technological record, the phenomena of social media, etc. So I was surprised (but in retrospect, should not have been) when almost immediately afterwards projects like noloc.org surfaced so you could opt out of the archiving of your public tweets.
However, while you might be able to prevent LC from archiving your tweets, companies like Topsy are archiving them, or at least some of them. Tospy is one of my new, favorite sites in part because they archive tweets; not necessarily because archiving them is the right thing to do (tm), but presumably because: 1) it allows them to build interesting services on top of the tweets, and 2) deleting them is probably more work than not deleting them*. Hany SalahEldeen and I began exploring Topsy in the context of his research on temporal intent in social media link sharing.
Although I think they think their primary business model is searching the social web, to me the most interesting services are generating the retweet and link neighborhoods for tweets. For example:
provides all the various tweets that linked to: cottagelabs.com/news/meeting-the-oaipmh-use-case-with-resourcesync. Topsy will promote your status to "influential" or "highly influential", presumably based on a mix of followers and/or retweets (e.g., Clay Shirky is "highly influential" with 274k followers, but Farrah is "influential" with less than 500 followers but presumably many retweets). Tweets about links can be "interesting", and that appears to be based on a tweet having different text from the HTML title of the target link.
Let's look at some examples of how Topsy is archiving at least some tweets. This September 28, 2012 story on BBC.com cited our TPDL 2012 paper, but also started with a nice quote for motivation and context:
On January 28 2011, three days into the fierce protests that would eventually oust the Egyptian president Hosni Mubarak, a Twitter user called Farrah posted a link to a picture that supposedly showed an armed man as he ran on a "rooftop during clashes between police and protesters in Suez". I say supposedly, because both the tweet and the picture it linked to no longer exist. Instead they have been replaced with error messages that claim the message – and its contents – "doesn’t exist".
It's true: although the user "Farrah" still exists, she has deleted many of her tweets from during the Egyptian Revolution. For example, the tweet and the picture linked to in the tweet are 404:
But if we prepend the twitpic URI with "topsy.com/" to get:
Topsy will also archive the tweets marked for deletion from the LC archive. For example, tweets like this from the author of noloc.org no longer exist and presumably were deleted before inclusion in the LC archive:
Note that the above link is a relative offset, so the actual tweets might scroll off that page. This reflects a limitation of the service at least with respect to being an actual archive: it offers only a limited window (at least for the free service) of 100 pages of 10 tweets each. For active accounts this 1000 tweet window will scroll by quickly. For example, the right-wing politics site twitchy.com was giddy when a White House staffer mistook/misspelled "congenital" as "congenial" in this now deleted June 29, 2012 tweet:
But at the time of this writing, 100 pages back only takes you to January 29, 2013 so we can't see if Topsy has archived this tweet.
In my recent presentation at the 2013 IIPC meeting, I mentioned the zombie movie trope of not using the word "zombie" to describe zombies (i.e., no one in a zombie movie has ever heard of zombies). I drew the parallel of not "using the a-word" -- perhaps the best, commercially viable archives don't use the word "archive". I don't believe Topsy markets its services as an "archive", but that is what is providing (modulo the 100 page limitation as well as not supporting archival protocols like Memento). On the other hand, the word "archive" denotes a certain level of permanency, and who knows if Topsy will survive in the marketplace? This list from tweetsmarter.com has a number of social media companies, many of which are now defunct. If Topsy goes under, most likely its extensive archives will disappear as well. True, most of the material won't be missed, but historically important material, such as Farrah's live tweeting of the Egyptian Revolution will disappear with it. And since it is not clear how to monetize archives and with actual archives such as WebCite running a donation campaign, we should be reminded that what we perceive as "archives" are really just web sites. So who will archive the archives?
* = I don't have any details about how Topsy is designed, their business relationship with Twitter, or anything of the like. Nor have I paid for a "pro" account or anything like that. All observations are from my position of being outside and looking in.
From April 22--26, Michael Nelson and I attended International Internet Preservation Consortium (IIPC) General Assembly 2013 that was hosted by the National and University Library of Slovenia in Ljbuljana, Slovenia. This year is the ten-year anniversary of the IIPC. GA this year has the theme of "What were the past challenges? and how can we plan the future of IIPC?". Also, this year, Old Dominion University becomes an official member of the IIPC. The GA has been organized into five days.
Day 1: Monday, April 22, 2013
IIPC General Assembly.
Mateja Komel Snoj, the director of the National and University Library Slovenia, and Alenka Kavčič – Čolić, the Head of Library Research Center at National and University Library Slovenia opened the days welcomed the attendance and showed their pleasure for hosting IIPC GA in Slovenia. Mateja emphasized the importance of the digital preservation and the rule of National and University Library Slovenia in the preservation of the Slovenian digital heritage. Then, the steering committee chair, Cathy Hartman from University of North Texas, started her welcome note by thanking the hosting institution and the planning committee for their effort to organize the GA this year. Cathy announced the theme for the GA "How can we plan for the future?", also she announced the nomination for IIPC outstanding award for 2011-2012. Helen Hockx-Yu from British Library spoke about IIPC @ 10. Helen displayed a video from IIPC GA 2011 that discussed the web archiving challenges on that time. Helen defined the keywords of IIPC success by: "Collaboration" in common projects (e.g., Heritrix), and "Cost Effectiveness" in using budget to secure more projects (e.g., WARC ISO standards). She highlighted the importance of web archiving for all fields: computer science, political sciences, journalism, etc. She ended with reminding the IIPC members to think productively where they are going.
The second new member was Valério Pereira da Silva from Organisation Information Archivierung (OIA). OIA provided web archiving service - collection the web, hosting, and providing access - to their partners including BMW and Siemens. Aaron Bins and Abigail Potter, the former program and communication officers respectively, gave an update about their efforts from 2010-2013. Aaron extended the IIPC programs beyond the technical track. Abigail announced that IIPC has 44 members from around the world. Her efforts include promotional video, DPC award nomination, web archiving use case report, and LinkedIn group. Mary Pitt, the new program and communication officer, introduced herself to the community and gave an idea about her background. Finally, Clément Oury, IIPC treasure, from Bibliothèque nationale de France gave an update about the prior financial year.
After the break, Hansueli Locher from Swiss National Library presented how to access to e-Helvetica. e-Helvetica has a mix of existing tools with developments of new interfaces to fullfill their challenges: integration between different digital materials using web based with the scalability and fulltext indexation for different languages, metadata, and access rights. Then, it was my presentation about ArcLink: Additional API support for Wayback Machines. ArcLink is a complete system to extract, preserve, and access to the link structure for the archived web. ArcLink has an XML/RDF interface that could be integrated with a Wayback Machine.
The morning session concluded with members' updates. The British Library announced the new legal deposit legislation to capture the digital content in UK, and the challenge to define UK websites. Kris Carpenter from the Internet Archive announced the completion of "End of Term" collaborative project that preserved public government information in the end of US presidential and congressional terms. They had 5000+ URIs.
After lunch, we had the 10 years of IIPC panel, facilitated by Cathy Hartman and the panelist: Gordon Mohr, Internet Archive; Birgit Henriksen, Royal Library of Denmark; Thorsteinn Hallgrimsson, National and University Library of Iceland. The panelists remembered the early construction of IIPC on 2003. They agreed on the initial established goals: collaboration was important between the small countries and the big institutions, and the development of new tools. IIPC members have a broad view of their problems, and they can share the problems and solutions. They discussed the main challenges through the 10 years. Initially, it was how to convince the home organization about IIPC's ideas, then how to promote them to the world. From the harvesting perspective, the evolution of the web and interfaces made the crawling and replaying of the web harder. From the access perspective, there was a lack of defined set of APIs and tools between the archives, and the quality of the archived corpus.
In the next session, Cathy Hartman and Gildas Illien provided an overview about the discussion paper that covered the previous and the future efforts of the web archives. The discussion paper had various questions to the community: Does IIPC need any update in the organization structure?, How to increase the communication between the members?, and What should we access and what should not? The community members came with interesting issues such as: Outreach to other institution, e.g., UNESCO and Google; Making the web archiving more attractive to the users, e.g., Personal Digital Archiving tools; and Focusing on the common issues between the members. After that, we divided into three discussion groups to discuss these ideas closely, the results of these groups were discussed in the beginning of the second day. Day 2: Tuesday, April 23, 2013 Workshops and Small Group Meetings - IIPC members Only.
The second day started with an overview of the outcomes from the discussion groups. The suggestions included: continuing the common tools developments like Heritrix and Wayback, outreach for other organizations, educating the members about the existence of Memento APIs, and taking benefits from the surveys between the institutions.
After the break, two workshops were running on parallel. I attended the quality assurance workshop. The workshop focused on the quality assurance tools that help the digital curator.
Helen Hockx-Yu from BL presented the new update in QA module in Web Curator Tool for the selective crawl. The new update reduces the required number of clicks that made the process easier for the curators. The tool provides the curator with a thumbnail and colored indicator about the page quality. The tool also allows the curator to enter his feedback about the bad pages. Next tool from BL was presented by Andy Jackson. Andy gave a demonstration about using Monitrix, Heritrix crawl monitoring tool, in domain crawl. Monitrix gives statistics about the different crawling processes such as: rate of capture, crawling by TLD, redirection, and robots exclusion. Monitrix has a detailed information about crawling domains by number of URIs, mime type, and discovered viruses.
Brenda Reyes, from University of North Texas, gave a presentation about the current state and practices of QA in web archives. Brenda found that there is a little public information about QA activities. In her study, she interviewed curators from UNT and Library of Congress, additional to email the other web archives. Brenda asked the people to complete a survey about the quality of the web archive.
The next session was the Memento Aggregator update. Herbert Van de Sompel, from Los Alamos National Lab, announced the new Memento Internet Draft 7.0.
Herbert covered various topics related to Memento such as:
SiteStory transactional archive, Wikipedia's Memento compliance challenge, and IIPC aggregator progress. Herbert explained the current two Memento aggregator implementations: Distributed search and Metadata harvesting, and explored two novel Memento aggregator architectures: Rule based and Context dependent. Herbert argued the web archives to move to be Memento complaint especially enabling the Wayback Machine memento interface.
Michael Nelson, from Old Dominion University, presented various resulted related to the Memento Aggregator. Michael showed that the best caching policy for distributed search aggregator is 15 days. Michael found the coverage of the archived web depends on the source of the URIs, his research on profiling the web archives illustrated the coverage per TLD and languages.
David Rosenthal, from Stanford University, discussed How to Make Memento Successful. Rosenthal discussed some problems about Memento and how to solve it. Memento needs client support, such as the Mementofox plugin, or gateway support such as: British Library gateway. The second problem is the quality of the archived web page. Memento should be able to indicate and filter the pages based on quality. Rosenthal discussed the funding and publicity problems of Memento aggregator.
The open technical discussion about Memento emphasized the usage of Memento beyond the web archive, such as content management systems. Also, the Memento team should change the Memento language from requirements to benefits to make the archive compliant.
The education and training workshop discussed the various education and training opportunities that are provided by IIPC such as: Web Archive Workshop at BnF, PhD sponsorship, and Staff Exchange.
At evening, we had a tour in Ljubljana including the National and University Library Slovenia; then we had the dinner at City Castle of Ljubljana.
Day 3: Wednesday, April 24, 2013
Working Group Meetings - IIPC Members.
Day 4: Thursday, April 25, 2013
Scholarly Access to Web Archives: Progress, Requirements and Challenges - OPEN TO PUBLIC.
The fourth day focused on the scholarly and researcher usage for the web archives. Janko Klasinc, from National and University Library Slovenia, presented the web archiving at the National and University Library of Slovenia. Janko announced that the Slovenian web archive is currently public. The legal deposit law was effective in 2006. Janko explained the challenges for crawling .si domains, and demonstrated the new Wayback Annotator tool that enables the user to create and annotate its own collection of archived pages.
Session 1: Scholarly Use of Web Archives
Niels Brügger, from Aarhus University, had a presentation entitled "Scholarly Use of Web Archives Initiatives and Fundamental Tools". Brügger gave an overview about three main projects: Netlab, RESAW, FUTARC. NetLab is a research infrastructure project to study the internet materials a part of Danish Digital Humanities Lab. NetLab has various subprojects: Digital footprints (An APIs to get data from facebook), Network analysis of Danish parliamentary elections 2001-2011, Cross media production and communication. RESAW project aims to integrate the existing national research infrastructure to study the archived web materials; it includes 40 web archives, research groups, and scholars. RESAW is driven by researchers questions and extends the web materials to include emails and apps. FUTARC project aims to develop fundamentals tools to study the web archives.
FUTARC studied the corpus creations problems such as: Identifying the URLs, Completeness which should be considered during the archiving, and Versioning challenges that includes the inconsistency in temporal and spatial dimensions.
Sophie Gebeil, from Aix-Marseille-Université, presented "Memories of North African immigration on the web". Sophie showed that the web enables the North African immigrants to remember their forgotten events such as memories of marsh of equality on 1983 and general moves for independence on 1961. The witnesses for these events put their comments on the web. The historian needs to understand the web archives to use it in their research.
Meghan Dougherty, from Loyola University Chicago, presented "Researcher Engagement with Web Archives".
Meghan conducted a set of interviews with archivist, technician and researchers between 2008 and 2010. Meghan summarized the researchers' obstacles in using the web archives as the following: collaboration between the researchers and the technician; the need of building customized tools that fit their work; finally, they need some help to select the right tool that meet their needs.
Helen Hockx-Yu, from British Library, presented "Web archives in the eyes of scholars". On 2012, they surveyed 49 researchers divided into two groups: those who already use the Archive for research (26%) and those who have not used the Archive (74%). The study summarized the requirements of the web archives to fullfill the scholarly use such as: Availability of the data online; Access to collection crawl log, policy, and configuration; and Quality of the web archive version. Helen found that accessing to the unavailable websites is the first use-case for the web archive, so we can't compete with the live web.
Session 2: Research Tools for Scholarly Use of Web Archives
Ditte Laursen and Bjarne Andersen, from Netarchive.dk, presented their work to make an automated screen capturing to study the interplay between real-time internet and live television in cross-media productions. The case study was "the Voice" that streamed on the TV concurrently with updates and discussions on facebook and TV homepage. The Danish archive summarized the challenges as: Synchronization between the crawlers, the crawler ability to capture the micro change, and the limitation of archiving stream content. Danish archive developed a tool to record the various stream concurrently; the tool is supported with web interface to manage the recording jobs.
Andrew Jackson, from British Library, presented "Seeing in the dark: Discovery and data‐mining of restricted web archives". Andy discussed the problem of how to provide an access to the restricted archive. The case study was JISC collection for .uk sites from 1996-2010; there was a challenge to make the data publicly accessible. Andy suggested publishing the data as CC0 open data sets, providing richer APIs, and building analytical access services. From the .uk collection, Andy showed results for GeoIndexes, TopLevel Domains linkage, formats, and mimetypes. For .uk collection, BL provides WAT files, and fulltext search.
Julien Masanès, from Internet Memory Foundation, presented "Leveraging social web to propel your archiving campaign". Julien started with the researchers' challenges in using the web archive. The temporal dimension of the data, the metadata about the data, and the format of the data (i.e., researchers can't use WARC files). Internet Memory realized the need for infrastructure for research. First, they replaced WARCs with BigTable technologies. Second, they built Extractor module on the top of data store. Finally, the user can build execution chain pipeline on the top of the Extractor. They already built applications for near duplicate multimedia video, named entity evolution, and temporal query.
Then, Julien demonstrated Archive The Net tool to find relevant materials for the special event based on the user query for archiving. The tool has two stages: Setup, the user will submit a new crawl including name and description, additional to target keywords and hashtags; Discovery, tool will construct a set of twitter users who share links about the topic; Collection, the tool will collect, crawl, and rank the links to create a campaign for browsing or preservation. You can try the tool on Beta version.
Hugo Huurdeman, from WebART team, presented "WebART: Facilitating Scholarly Use Of Web Archives". WebART
project aims to combine the development of the web archive with the research activity to develop tools that help the researchers. Additional to the Wayback Machine interface, WebART is support with WebArtist, fulltext search for Detuch news aggregator. WebART hosted a winter school in the university of Amsterdam where they gave the tools to the researchers for analysis. The school work resulted various applications, for example, word frequency, co-word analysis, outlink analysis, geomapping location wire service, and image temporal listing. Hugo summarized the challenges as the following: Data quality and quantity, the search system, and the user interface.
The last session was given by Clément Oury, from National Library of France, about
"Building a master's degree on digital archiving". The degree is given by enssib, librarian school in France. The colloquium combines theoretical modules, such as legal deposits and heritage, research driven archiving, and third party archiving; and practical experience by on selection, harvesting, and access side.
Day 5: Friday, April 26, 2013
Workshops - OPEN TO PUBLIC.
The last day was open to public, IIPC ran two workshops: Twittervane, a crowdsourceing technique to capture the seeds URLs from twitter; and Future Web, it discusses the new evolution of the Web and the current challenges with archiving for capturing, replaying, and data mining.
In the course of our research we often needed to determine
when a certain web resource was created. In numerous cases, this question is
fairly straightforward to answer by examining the resource itself. Articles
often have publishing datetime stamps, social media contributions have posting
time, and others you can estimate the creation date from reading the resource
itself. This process is simple upon manually examining the resource, but when
the dataset of resources is large it is harder to automate.
To solve this problem we conducted several experiments to
determine when the resource was created automatically. When a resource is
created it often gets indexed in the search engines, archived in the public
archives, and shared in the social media thus leaving trails of existence. We
trace those trails of existence and use the first appearance of the first trail
as a close estimate of the creation date. The timeline below illustrates a
common scenario of the lifetime of a resource.
We also examined the existence of a last modified timestamp
in the resource’s header and the feasibility of using it as an estimate of
creation date. We also examine the resource’s backlinks and in turn estimate
their creation date which could be easier to extract, which gives us an insight
on when the resource was created too.
In order to test the accuracy of our estimation we collected
1200 resources which we can manually extract the creation date from different
sources. We tested our model and were able to estimate a creation date to over
75% of the resources and 33% having the exact creation date. After validating our model we utilized it in building an age
estimation service which if provided with the resource’s URL would return a
JSON object of the creation dates from each source (search engines, archives,
social media, backlinks, and others) and the estimated lowest creation date.
You can use the service at: http://cd.cs.odu.edu/cd/<YOUR_URL_HERE>
We published the code implemented as well in GitHub. You can
download it from: https://github.com/HanySalahEldeen/CarbonDate along with the instructions to install. To
use this service, you should register with Bitly and Topsy and get their
corresponding API keys. Second, modify the config file by adding your keys.
Finally, launch server.py on your designated IP and port.
This work has been published at the third annual Temp Web workshop at the WWW 2013 conference in Rio de Janeiro, Brazil.
- Hany M. SalahEldeen, Michael L. Nelson, Carbon Dating The Web: Estimating the Age of Web Resources, Proceedings of TempWeb03, WWW 2013. (Also available as a Technical Report http://arxiv.org/abs/1304.5213).
On April 5-6, I was pleased to have the opportunity to meet and network with many successful senior women as well as graduate students from other universities in CRA-W Graduate Cohort, which was held in Boston, MA. Grad Cohort, which began in 2004, aims to increase the ranks of senior women in computing by building and mentoring nationwide communities of women through their graduate studies. Grad Cohort accepts women students in their first, second, or third year of graduate school in computer science and engineering. They provide sessions for each of the three years. Since I am now in my third year of my computer science Ph.D., I attended third year sessions, which I'm going to talk about in the rest of the blog post. The workshop included a mix of formal presentations and informal discussions.
In the first day's afternoon, there was a Poster Session for participants to talk about their research. I presented a poster entitled "Access Patterns for Robots and Humans in Web Archives". The poster contains an analysis of the user access patterns of web archives using the Internet Archive's Wayback Machine access logs, the details of which will be published at JCDL 2013.
After registration, breakfast was provided and followed with the welcome session by Lori Clarke (University of Massachusetts-Amherst), Sandhya Dwarkadas (University of Rochester), and Lori Pollock (University of Delaware). They started with an overview about CRA-W Grad Cohort Workshop emphasizing the goals of the workshop, which is held annually. The main goals of the workshop are to learn the process of doing research, providing easily insights into career paths, communicate and connect with people. The workshop has three tracks corresponding to students' the first, second, and third year of graduate school in computer science and engineering. They introduced the speakers and thanked the Sponsors of the 2013 CRA-W Grad Cohort Program.
The first session, which was titled "Preparing Your Thesis Proposal and Becoming a Ph.D. Candidate", was given by Julia Hirschberg (Columbia University). Julia started her presentation with a nice photo of her cute cats, then started her presentation with advice I admired, "It's fine to have a family and successful career". She emphasized the reasons of writing Ph.D. proposals and described the proposal process from the brainstorming and planning stage to getting the proposal to the committee. Julia also spoke of the presentation of the proposal and gave some tips on it. She gave hints for handling hard questions and how it is important to be prepared for them. At the end, it was a good experience I got from the open discussions about other schools' processes in doing Ph.D. proposals.
In the second session, since I have to specify a topic for my proposal in the next couple of months, I attended the "Finding a Research Topic" session with second year Grad Cohort. The session was given by Carla Brodley (Tufts University). Carla discussed the strategies of choosing the topic and addressed how this choice may affect our career plans (e.g., academic research or industrial research). She explained the thesis equation (Advisor + Topic = Dissertation). Before she ended the discussion, she contrasted "what is Ph.D. Research" against "what is not Ph.D. Research". After the session, it was time for lunch and networking with grad students, grouped by similar areas of research. To facilitate this, each table was labeled with a different field of computer science and engineering.
After the lunch, the third session titled "Ph.D. Academic Career Paths: Research, Teaching, and Administration" focused on the different career paths in academia. The session was given by Mary Lou Soffa (University of Virginia), Erin Solovey (MIT), and Tiffani Williams (Texas A&M University). The speakers started with a bio about themselves explaining how they got their jobs. They spoke of their roles of research, teaching, and service, and how they differ in different academic institutions. Furthermore, they discussed the challenges of research, teaching, and service, and the required skills for success in each one.
After the second break, Hillery Hunter (IBM), Kathryn McKinley (Microsoft Research), and Amanda Stent (AT&T) spoke of different personal stories to all Grad Cohorts in a session titled "Strategies for Human-Human Interaction". They focused on strategies for interaction with colleagues and the challenges of being a woman in a computing technology career. The stories were about uncomfortable situations that may arise and how to react. Upon the session's completion, they held a panel session to answer the audience's questions.
After the closing of the 4th session, the conference held a poster session. As above, I presented my poster titled "Access Patterns for Robots and Humans in Web Archives". Along with my poster, 83 other attendees presented posters from a wide range of computing technology field. At the end of the day, we had a great time in the reception, which was hosted by Microsoft Research and Google.
Day Two:
Day two started off with breakfast then a session titled "Ph.D. Non-Academic Career Paths: Industrial Research and Development" by Maryam Kamvar Garrett (Google), Suju Rajan (Yahoo!), and Amanda Stent (AT&T)". The focus of the session was on the different career paths and job opportunities in industry for Ph.D. students. They spoke of the challenges, skills, and experiences needed for success in industry careers. They gave tips for helping the grad students in making the choice of job (e.g., the importance of knowing the number of hours of each job which will be suitable for your life).
After a short break, Natalie Enright Jerger (University of Toronto), Hillery Hunter (IBM), and Erin Solovey (MIT) presented the second session "Ph.D. Job Search". Each one of the speakers gave us an idea about their job timeline and how they got their current jobs, declaring the obstacles they faced and how they overrode them. They raised a question "Why we should search for a job now?" and then said you have to remember three things:
You are intelligent
You are expert in something
You can do it.
They gave tips for joining jobs in academia, such as active collaboration by attending conferences and communicating with people. They spoke of how to prepare for a job, preparing the application, preparing for the interview, what to do after the interview, and deciding between different offers. They highlighted how it is important to have external expert in your field who also will have time to write reference letters for you when you apply for a job.
The last session of the second day was "Balancing Graduate School and Personal Life" by A.J. Brush (Microsoft Research). The session addressed different strategies for balancing TA duties, the course work, the research program, and personal life. She said that it is important to know when to stop and take a break and how this is important in resuming your career with fresh mind.
At the end, "Wrap-Up and Final Remarks" were given by Lori Clarke, Sandhya Dwarkadas, and Lori Pollock. They thanked the speakers and the sponsors of the 2013 CRA-W Grad Cohort Program. Furthermore, they mentioned that they will participate in Grace Hopper Celebration of Women in Computing (GHC), the largest conference for technical women in the world.
It was great to meet all of these senior women in the computing technology field in addition to networking with graduate students and exchanging the experience with them about our research areas. At the end of the second day, our friends from MIT took us in a tour in Computer Science building.