Monday, July 22, 2013

2013-07-15: Temporal Intention Relevancy Model (TIRM) Data Set

In the third anniversary of the Haiti earthquake, president Barack Obama held a press conference and discussed the need to keep helping the Haitian community and to invest more in rebuilding the economy. A user was watching the press conference tweeted about it on the 14th of January, and
provided a link to the streamed news.  A couple of days later when I read this tweet and clicked on the link and instead of seeing anything related to the press conference, Haiti, or President Obama, I got a stream feed of the Mercedes-Benz Super Dome in New Orleans in preparation for the 2013 Super Bowl. It is worth mentioning that at the time of writing this blog the tweet above was actually deleted, proving that social posts don't persist throughout time as we discussed in our earlier post.

This scenario illustrates the problem we are trying to detect, model, and solve. The inconsistency between what is intended at the time of sharing and what the reader sees at the time of clicking the link in the tweet.
It is evident that resources change, relocate, or even disappear. In some cases it is tolerable but in other times when it is related to sharing significantly important content (e.g., related to a revolution, protest, corruption claims, and others).

From these observations we decided to perform experiments to detect and model this "user intention" of the author at the time of tweeting and measure how accurately it is perceived by the reader at any point in time. In our JCDL 2013 paper, we deduced that the problem of intention is not straightforward and in order to correctly model it a mapping should be performed to transform the intention problem to a relevancy and change problem. 

Amazon's Mechanical Turk is utilized initially in a direct manner to collect data from workers about intention, unfortunately this approach produced very low accuracy in inter-rater agreement.
After a closer look at the most popular tasks on Mechanical Turk, we found out that categorization and classification problems are the most prominent. The questions that are asked to the workers are simpler and require far less explanation.

We introduce the Temporal Intention Relevancy Model or TIRM to illustrate the mapping between intention and relevancy. Let's consider the following tweet from Pfizer.  The tweet has a link which leads to the newsletter that is updated with the latest announcements of the company.

At any point in time this page is still relevant to tweet, thus we can deduce that the intention behind posting this tweet is to check whatever the current state of the page is. In other words, if the page changed from its initial state at the time of tweeting and it is still relevant we can assume the intention is: current state.

Similarly, we notice a different pattern upon inspecting a tweet posted on the day Michael Jackson died and linking to CNN.com. The front page of CNN.com has definitely changed since the time of the tweet and the content is no longer relevant to the tweet.
Thus, the author's intention was for the reader to see the state of the page at the time he tweeted about it. In conclusion, if the page changed and is no longer relevant to the tweet we can assume that the author's intention is: past state of the resource. So, we dig it up from the web archives.

In a large number of social posts the resource remains unchanged and still relevant to the post. In this case we assume that this is state of the resource at the point in time when the author published this post, but also since it is unchanged a current version will do as well.
Finally, when the resource is changed and has never been related to the post. Then in this case we do not have enough information to decide which user intention the author wanted to convey. This scenario happens often in spam posts.


 We use Mechanical Turk to collect the training data for our model along with multiple features related to the social post, such as its nature, archivability, social presence, and resource’s content.


The resulting dataset was utilized in extracting 39 different textual and semantic features that was used to train a classifier to implement the TIRM. We argue that this gold standard dataset will pave the way for future temporal intention based studies. Currently, we are extending the experiments and refining the utilized features.

For further details, please refer to the paper:

Hany M. SalahEldeen, Michael L. Nelson. Reading the Correct History? Modeling Temporal Intention in Resource Sharing. Proceedings of the Joint Conference on Digital Libraries JCDL 2013, Indianapolis, Indiana. 2013, also available as a technical report http://arxiv.org/abs/1307.4063

- Hany SalahEldeen

No comments:

Post a Comment