2018-12-03: Using Wikipedia to build a corpus, classify text, and more

Wikipedia is an online encyclopedia, available in 301 different languages, and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas.
Using Wikipedia as a corpus:
Wikipedia has been used to create corpora that can be used for text classification or annotation. In “Named entity corpus construction using Wikipedia and DBpedia ontology” (LREC 2014), YoungGyum Hahm et al. created a method to use Wikipedia, DBpedia, and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language.
Fabian Suchanek used Wikipedia, WordNet, and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “YAGO: A large ontology from Wikipedia and Wordnet” (Web Semantics 2008), describes how this dataset was created.
Using Wikipedia to classify entities:
In the paper, Entity extraction, linking, classification, and tagging for social media: a Wikipedia-based approach” (VLDB Endowment 2013), Abhishek Gattani et al. created a method that accepts text from social media, such as Twitter, and then extracts important entities, matches the entity to Wikipedia links, filters, classifies the text, and then creates tags for the text. The data used is called a knowledge base (KB). Wikipedia was used as a KB and its graph structure is converted into a taxonomy. For example, if we have the following tweet “Obama just gave a speech in Hawaii”, then the entity extraction selects the two tokens “Obama” and “Hawaii”. Then the resulting tokens are paired with a Wikipedia link (Obama, en.wikipedia.org/wiki/Barack_Obama) and (Hawaii, en.wikipedia.org/wiki/Hawaii). This step is called entity linking. Finally, the classification and tagging of the tweet are set to “US politics, President Obama, travel, Hawaii, vacation”, which is referred to social tagging. The actual process to go from tweet to tag takes ten steps. The overall architecture is shown in Figure 1.
https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYZErcwndJo9OJAE1i_rOHId1wW4wN0a40pqNia0c7MwWC9TiZOvncGztbjcje3fazEwVOOtWqk8985bagoS-ZoIC92fAVvgcBX0Ljt159u03b2djsDhaN8NhdZGycK_hLEQl0CSsB2p0/s1600/Screen+Shot+2018-12-02+at+1.10.09+PM.png
  1. Preprocess: detect the language (English), and select nouns and noun phrases
  2. Extract pair of (string, Wiki link): using the text in the tweet, the text is matched to Wikipedia links and is paired, where the pair of (string, Wikipedia) is called a mention
  3. Filter and score mentions: remove certain pairs and score the rest
  4. Classify and tag tweet: use mentions to classify and tag the tweet
  5. Extract mention features
  6. Filter mentions
  7. Disambiguate: select between topics, e.g. is apple categorized to a fruit or a technology?
  8. Score mentions
  9. Classify and tag tweet: use mentions to classify and tag the tweet
  10. Apply editorial rules
This dataset used in this paper was described in “Building, maintaining, and using knowledge bases: a report from the trenches” (SIGMOD 2013) by Omkar Deshpande et al. In addition to using Wikipedia, the Web and social context were used for the process of tagging the tweet more correctly. After collecting tweets, they gather web context for tweets, which is getting the link included in the tweet if exists and extracting its content, title, and other information. Then entity extraction is performed, followed by link, classify, and tag. Next, the tweet with the tag is used to create a social context of the user, hashtag, and web domains. This information is saved and used for new tweets that need to be tagged. They also used the web and social context for each node in the KB, and this is saved for future usage.
Abhik Jana et al. added Wikipedia links on the keywords in scientific abstracts in WikiM: Metapaths Based Wikification of Scientific Abstracts” (JCDL 2017). This method helped the reader determine if they are interested in reading the full article. They first step was to detect important keywords in the abstract, which they call mentions, using tf-idf. Then a list of candidate Wikipedia links, which they call candidate entries, were selected for each mention. The candidate entries are ranked based on similarity. Finally, a single candidate entry with the highest similarity score is selected for each mention.
Using Wikipedia to cluster documents:
Xiaohuo Hu et al. used Wikipedia in clustering documents in “Exploiting Wikipedia as External Knowledge for Document Clustering” (KDD 2009). In this work, documents are enriched with Wikipedia concepts and category information. Both exact concept match and related concepts are included. Then similar documents are combined based on document content, content from Wikipedia is added, and category information is added. This method was used on three datasets: TDT2, LA Times, and 20-newsgroups. Different methods were used to cluster the documents:
  1. Cluster-based on word vector
  2. Cluster-based on concept vector
  3. Cluster-based on category vector
  4. Cluster-based on the combination of word vector and concept vector
  5. Cluster-based on the combination of word vector and category vector
  6. Cluster-based on the combination of concept vector and category vector
  7. Cluster-based on the combination of word vector, concept vector, and category vector
They found that with all three datasets, clustering based on word and category vector (method #5) and clustering based on word, concept, and category vector (method #7) always had the best results.
Using Wikipedia to annotate documents:
Wikipedia was used to annotate documents, such as in the paper “Wikipedia as an ontology for describing Documents” (ICWSM 2008) by Zareen Sab Sayed et al. Wikipedia text and links were used to identify topics related to some terms in a given document. In this work, three methods were tested using the article text, the article text and categories with spreading activation, and the article text and links with spreading activation. However, the accuracy of the work depends on some factors such as that a Wikipedia page might link to a non-relevant article, the presence of links between related concepts, and the extent of having a concept appear in Wikipedia.
Using Wikipedia to create recommendations:
Wiki-Rec uses Wikipedia to create semantically based recommendations. This technique is discussed in the paper “Wiki-rec: A semantic-based recommendation system using Wikipedia as an ontology” (ISDA 2010) by Ahmed Elgohary et al. They predicted terms common to a set of documents. In this work, the user reads a document and evaluates it. Then using Wikipedia, all the concepts in the document are annotated and stored. After that, the user's profile is updated based on the new information. By matching the user's profile with other user's profiles that contain similar interests, a list of recommended documents is presented to the user. The overall system model is shown in Figure 2.
Using Wikipedia to match ontologies:
Other work, such as “WikiMatch -Using Wikipedia for ontology match” (OM 2012) by Sven Hurtling and Heiko Paulheim, used Wikipedia information to determine if two ontologies are similar, even if they are in different languages. In this work, the Wikipedia search engine is used to get articles related to a term. Then for the articles, all language links are retrieved. Two concepts are compared by comparing the articles' titles. However, this approach is time-consuming because of querying Wikipedia.
In conclusion, Wikipedia is not only an information source, it has also been used as a corpus to classify entities, cluster documents, annotate documents, create recommendations, and match ontologies.
-Lulwah M. Alkwai

Comments