2013-05-13: Temporal Web Workshop 2013 Trip Report
On May 13, Hany SalahEldeen and I attended the third Temporal Web Analytic Workshop, collocated with WWW 2013 in Rio De Janeiro, Brazil.
Marc Spaniol, from Max Planck Institute for Informatics, Germany, welcomed the audience in the opening note of the workshop. He emphasized on the target of the workshop to build a community of interest in the temporal web.
Omar Alonso, from Microsoft Silicon Valley, was the keynote speaker with presentation entitled: “Stuff happens continuously: exploring Web contents with temporal information”. Omar divided his presentation into three parts: Time in document collection, Social data, and Exploring the web using time.
In the Time in document collection, Omar gave an intro about the temporal dimension of the document. He defined the characteristics of the temporal by first defining “What is Time?”. The time may be used in normalized format or hierarchy format. The time has 4 types: times; duration; sets, which may explicit (i.e., May 2, 2012) or implicit (i.e., labor day); or relative expressions. There are different approaches to extract the temporal expressions like: Temporal Tagger, Named Entity Recognition (NER) for time. We can express the time in TimeML format. Omar explained that people care about temporality because it describes their landmarks and evolution such as: Winning a game for soccer player or financial quarters for accountant. Including the crowd with the temporal, we can achieve a complete annotated calendar for free by combining all the hot topics for the year.
Then, Omar explained the effect of social media on the concepts of “Temporal and Document”.
While this approach works well with the popular events, it need a modification for looking back for the not popular events. Combining different data sources introduces new research questions: how to manage the duplicate and the near duplicates, what is the temporal precedence between them, how to rank the results by temporal value, and how to evaluate the success of these techniques.
Then, Daniel gave three examples about the search systems on the web archives. Portuguese web archive provides fulltext search for 1.2B docs using NutchWAX. The partition technique is based on time and document. EverLast is p2p architectures, the tasks (crawling, versioning, and indexing) are distributed between different nodes. Wayback machine is a url search architecture, it used flat sorted files (called CDX) to index the webpages. Daniel proposed a new one single portal for across-web archives search. The new system has the challenge of the spread of the web archive data across different systems and technologies. The system has a prototype that was tested on the Portuguese web archive and showed a good results. The new system requires a new rank mechanism of search results from different sources additional to design a user interface that combines these sources.
The next presentation in the session was my presentation that entitled “Archival HTTP Redirection Retrieval Policies”. In this presentation, we studied the URI lookup in the web archive taking in the consideration the HTTP redirection status for the live or archived URI. We proposed two new measurements: Stability, computing the change of the URI status and location through time; and Reliability, computing the percentage of mementos that will end with 200 HTTP status to the total number of mementos per TimeMap. Finally, we proposed two new retrieval policies.
Daniel Gomes gave another presentation entitled “Creating a Billion-Scale Searchable Web Archive”. He gave their experience to build the Portuguese web archive. First, they integrated data from three collections, some of them were on CD formats. They built tools to convert the saved web files to arc format. Then, Portuguese web archive started their own live web crawl on 2007, focusing on Portuguese speaking domain except .br. They built Hertirix add-on, called DeDuplicator, that saves 41% disk space on weekly crawl and 7% for daily one, with total 26.5 TB/year. The Portuguese Web Archive has enabled fulltext searching, has internationalization support, and has a new graphical design.
Hany SalahEldeen, from Old Dominion University, presented “Carbon Dating The Web: Estimating the Age of Web Resources”. Hany estimated the creation date of the URIs based on different sources such as: crawling date from Google, first observation on the web archives, or the first mention in the social media.
Omar Alonso presented “Timelines as Summaries of Popular Scheduled Events”. Omar built a framework with minute level granularity to compare the events during the game with the social media reactions. Omar gave some examples from World Cup and the tweets about the game. The results showed a strong relationship between the game events and the user activities on Twitter.
Hideo Joho, from University of Tsukuba, presented “A Survey of Temporal Web Search Experience”. Hideo studied the temporal aspects in the web search by surveying 110 persons to answer 18 questions related to their recent search experience. Hideo showed quantitative and qualitative analysis for his results.
LAWA project wrote a TempWeb 2013 Roundup report.
----
Ahmed AlSum
Marc Spaniol, from Max Planck Institute for Informatics, Germany, welcomed the audience in the opening note of the workshop. He emphasized on the target of the workshop to build a community of interest in the temporal web.
Omar Alonso, from Microsoft Silicon Valley, was the keynote speaker with presentation entitled: “Stuff happens continuously: exploring Web contents with temporal information”. Omar divided his presentation into three parts: Time in document collection, Social data, and Exploring the web using time.
In the Time in document collection, Omar gave an intro about the temporal dimension of the document. He defined the characteristics of the temporal by first defining “What is Time?”. The time may be used in normalized format or hierarchy format. The time has 4 types: times; duration; sets, which may explicit (i.e., May 2, 2012) or implicit (i.e., labor day); or relative expressions. There are different approaches to extract the temporal expressions like: Temporal Tagger, Named Entity Recognition (NER) for time. We can express the time in TimeML format. Omar explained that people care about temporality because it describes their landmarks and evolution such as: Winning a game for soccer player or financial quarters for accountant. Including the crowd with the temporal, we can achieve a complete annotated calendar for free by combining all the hot topics for the year.
Then, Omar explained the effect of social media on the concepts of “Temporal and Document”.
- Twitter has limited the document to 140 characters, Time in Twitter is supported by: Trending topics, e.g., Mothers day; hashtags, e.g., #tempweb2013; cashtag, hash tag started with $ for financial information (e.g., $apple); and group chats, people tweet in specific time to discuss specific topic.
- Time in Facebook is known by the Timeline, photos over time, and the generic events.
- Temporally-Aware Signals. User interests may be time sensitive, for example tweeting about recent, seasonal, or ongoing activities.
- Community Question Answering (CQA) also has a temporal dimension. CQA helps the user to answer the questions that the user can't answer using the web search engines. Some answers don't change through the time (i.e., what is the distance to the moon?), others are time-sensitive.
- Reddit, which is a sharing platform popular in US, has also a Time dimension. Reddit is so popular to attract famous people to communicate with the crowd.
- Reviewing systems such as: Amazon, yelp, and Foursquare holds a temporal characteristics as the review may be changed through the time.
- Time in Wikipedia is tracked by the evolution of edits by users.
While this approach works well with the popular events, it need a modification for looking back for the not popular events. Combining different data sources introduces new research questions: how to manage the duplicate and the near duplicates, what is the temporal precedence between them, how to rank the results by temporal value, and how to evaluate the success of these techniques.
Session 1: Web Archiving
Daniel Gomes, from the Portuguese Web Archive, gave a presentation entitled “A Survey of Web Archive Search Architectures”. Daniel gave an overview about the current search paradigms in the web archives. The use-cases showed that the users demand google-like search from the web archives. The survey found from the web archives under the experiment: 89% have URL search, 79% have metadata, 67% have fulltext search. These numbers had been computed based on the publications about the web archives and the authors experience.Then, Daniel gave three examples about the search systems on the web archives. Portuguese web archive provides fulltext search for 1.2B docs using NutchWAX. The partition technique is based on time and document. EverLast is p2p architectures, the tasks (crawling, versioning, and indexing) are distributed between different nodes. Wayback machine is a url search architecture, it used flat sorted files (called CDX) to index the webpages. Daniel proposed a new one single portal for across-web archives search. The new system has the challenge of the spread of the web archive data across different systems and technologies. The system has a prototype that was tested on the Portuguese web archive and showed a good results. The new system requires a new rank mechanism of search results from different sources additional to design a user interface that combines these sources.
The next presentation in the session was my presentation that entitled “Archival HTTP Redirection Retrieval Policies”. In this presentation, we studied the URI lookup in the web archive taking in the consideration the HTTP redirection status for the live or archived URI. We proposed two new measurements: Stability, computing the change of the URI status and location through time; and Reliability, computing the percentage of mementos that will end with 200 HTTP status to the total number of mementos per TimeMap. Finally, we proposed two new retrieval policies.
Daniel Gomes gave another presentation entitled “Creating a Billion-Scale Searchable Web Archive”. He gave their experience to build the Portuguese web archive. First, they integrated data from three collections, some of them were on CD formats. They built tools to convert the saved web files to arc format. Then, Portuguese web archive started their own live web crawl on 2007, focusing on Portuguese speaking domain except .br. They built Hertirix add-on, called DeDuplicator, that saves 41% disk space on weekly crawl and 7% for daily one, with total 26.5 TB/year. The Portuguese Web Archive has enabled fulltext searching, has internationalization support, and has a new graphical design.
Session 2: Identifying and leveraging time information
Julia Kiseleva, from Emory University, presented “Predicting temporal hidden contexts in web sessions”. In her presentation, Julia analyzed web log as a set of user actions. She aimed to find contexts that help to build more accurate local models. Julia built a user navigation graph, she used to partition mechanisms. Horizontal partition based on context (e.g., Geographical position) and Vertical position based on the action alphabet (e.g., Ready to buy or Just Browsing). Julia used http://www.mastersportal.eu/ in her experiement. Also, she suggested using sitemap to define the set of applicable steps.Hany SalahEldeen, from Old Dominion University, presented “Carbon Dating The Web: Estimating the Age of Web Resources”. Hany estimated the creation date of the URIs based on different sources such as: crawling date from Google, first observation on the web archives, or the first mention in the social media.
Omar Alonso presented “Timelines as Summaries of Popular Scheduled Events”. Omar built a framework with minute level granularity to compare the events during the game with the social media reactions. Omar gave some examples from World Cup and the tweets about the game. The results showed a strong relationship between the game events and the user activities on Twitter.
Session 3: Studies & Experience Sharing
Lucas Miranda presented “Characterizing Video Access Patterns in Mainstream Media Portals”. Lucas studied the video access patterns on the major Brazilian media providers. Lucas showed some figures that summarized their results.Hideo Joho, from University of Tsukuba, presented “A Survey of Temporal Web Search Experience”. Hideo studied the temporal aspects in the web search by surveying 110 persons to answer 18 questions related to their recent search experience. Hideo showed quantitative and qualitative analysis for his results.
LAWA project wrote a TempWeb 2013 Roundup report.
----
Ahmed AlSum
Comments
Post a Comment