2023-08-10: Collecting 1.5 Million Multilingual News Stories' URLs

The significance of multilingual datasets for Cross-Language Information Retrieval (CLIR) research cannot be exaggerated. The lack of publicly available, annotated, multilingual datasets prompted this effort. Moreover, Arabic is one of the languages for which annotated datasets are not widely available. In this post, we present a collection of Arabic and English news URLs dataset collected from four different major Arabic and English news outlets that are geared towards Arabs, English speaking Arabs, Arabic speakers around the world, and English speakers who are interested in the Arabic narratives of world news (Aljazeera Arabic, Alarabiya, Aljazeera English, and Arab News). Such a collection can be used to conduct research on web archiving, news similarity, machine translation (MT), Arabic Named Entity Recognition and Classification (NERC), and perform a systematic comparisons of different approaches.

We examined multiple ways to collect news stories' URLs from multiple news outlets in Arabic and English for CLIR research. After settling on  what we discovered to be the best method, we collected 1.5 million news stories from four different news outlets and examined link rot in each website. Furthermore, we developed a method to eliminate links that do not lead to a news story. There are other ways to collect news stories that seem promising but we have not tried them yet including web archives and web crawlers.

Collection methods we tried: 

In this list of methods, I discuss the pros and cons of each method based on my experience trying to collect the dataset.

1. RSS/Atom: Monitoring RSS/Atom feeds and collecting URLs from them

This method requires a simple script to monitor RSS/Atom feeds from different news websites that offer RSS service and save the URLs in a database or in the file system.

Limitations: 

a. RSS/Atom does not, in general, distribute old news stories, therefore, this method cannot collect historical data (news stories that happened in the past). While it is possible to retrieve historic information from some Atom feeds, it is up to the news website to make it available. For example, it is not provided in aljazeera.com or arabnews.com. This method requires a service/script that monitors RSS feeds (every minute for active news websites) on a 24/7 basis and for a very long time (in tens of years for less active websites) to collect a large amount of URLs that could make a useful dataset. The method will collect data from the time the service gets started going forward. It cannot collect data from the past unless there are archived copies of RSS pages in public web archives. 

b. RSS is not supported by all news websites. In some cases, the RSS feed is either abandoned or not functional. For example,  most RSS feeds on BBC Arabic have not been working for about 10 years including Middle East RSS feed. There is a list of all RSS feeds from BBC Arabic

c. Not all stories are included in the RSS feed. While most popular news stories make it to the RSS feed if the news website offers RSS service, less important stories are eliminated from the RSS. 

d. For most major news websites, multiple RSS feeds need to be monitored for each news website since most large news outlets have multiple RSS feeds (one for world, one for politics, one for Sports, and so on).

2. Twitter: Extracting links from tweets

Limitations: 

In addition to the limitations of Twitter's API, since Elon Musk bought Twitter, the platform ended free access to its API and launched paid plans for using it. I calculated the cost and time that are needed to collect a dataset that is comparable in size to the dataset I have collected. Assuming Twitter's data is complete and that every tweet collected has a link to a news story with no duplicates, the dataset would've costed $15,000 and would have taken 13 years to collect using their Basic plan or $25,000 and 5 months using the Pro plan. I was able to collect the dataset is less than 4 months free of charge. Furthermore, there are limits that are specific to accounts that belong to news channels including:

a. While it is becoming more and more popular for news websites to tweet about their most recent news stories along with the link to the story, not every story gets a tweet.

b. Multiple Twitter accounts have to be monitored for each news channel since most large news channels have multiple official Twitter accounts (one for live, one for world, one for sports, and so on).

3. Web Crawling/Scraping

Limitations: 

a. A dedicated web scraper must be written for each website to work effectively. 

b. This method is prone to breaking when the website structure is changed. In such case, the script must be updated to accommodate the changes in the news outlets' websites. 

c. It takes a long time to collect the data and the chance of getting blocked for making too many requests (429 status code) or service unavailable (503 status code) is high.

4. GDELT

Limitations: 

a. While GDELT offers a massive collection of multilingual news datasets, it has a predefined set of news sources that it fetches from (every 15 minutes). For researchers who are looking to study specific news outlets, they may or may not be able to find these news outlets among the ones GDELT has covered.

b. For researchers who want to limit their research to a few news outlets, a few languages, etc, dealing with the massive datasets from GDELT requires more resources and computation power to be able to consume, clean, and manipulate datasets of such magnitude. Our method provides more flexibility on news outlets selection and the resources needed to consume the data. Selecting the news sources and languages is left up to the researcher.

c. News outlets change their websites' content management systems every few years to accommodate new technologies and scale up; they often change their URLs' pattern and redirect their old URLs to the new ones. GDELT does not provide the new URLs. It is easy to write a script to follow the redirects and find the final destination. However, for datasets of GDELT's sizes, it is another time-consuming step that will require even more resources and it is likely to cause servers to block the script for making too many requests (429 status code) or service unavailable (503 status code) if not performed with caution. Nevertheless, I took that step and followed the redirects for all GDELT URLs for one day's worth of data (2013-04-01).

 d. Although large, GDELT is far from complete; this is especially true for historical data. GDELT project does not offer stories URLs for news published before 2013-04-01. Our dataset has stories' URLs  from 1999-01-09 to 2022-09-09. Moreover, I downloaded the GDELT 1.0 Global Knowledge Graph (GKG), which begins on 2013-04-01 and found that most stories' URLs we collected and included in our dataset for 2013-04-01 do not exist in GDELT GKG. These are the only collected URLs from aljazeera.com and arabnews.com in GDELT's GKG for 2013-04-01 (21 stories):

http://www.aljazeera.com/news/asia/2013/04/2013412275825670.html
http://www.aljazeera.com/news/asia-pacific/2013/04/20134212236684885.html
http://www.aljazeera.com/news/asia-pacific/2013/04/2013414483571148.html
http://www.aljazeera.com/news/middleeast/2013/04/201341224611981664.html
http://www.aljazeera.com/news/middleeast/2013/04/2013422481638781.html
http://www.aljazeera.com/news/africa/2013/04/201341215956231181.html
http://www.arabnews.com/news/446774
http://www.arabnews.com/news/446759
http://www.aljazeera.com/news/asia/2013/04/201341115459305164.html
http://www.aljazeera.com/news/middleeast/2013/04/2013418149731751.html
http://www.aljazeera.com/news/middleeast/2013/04/201341195914246178.html
http://www.aljazeera.com/news/asia-pacific/2013/04/2013411080656491.html
http://www.aljazeera.com/news/africa/2013/04/20134119386947435.html
http://www.aljazeera.com/news/asia-pacific/2013/04/20134154936420663.html
http://www.aljazeera.com/news/asia-pacific/2013/04/20134116214243786.html
http://www.aljazeera.com/news/americas/2013/04/201341883613809.html
http://www.aljazeera.com/news/asia-pacific/2013/04/201341105214824622.html
http://www.aljazeera.com/news/middleeast/2013/04/201341154620669520.html
http://www.aljazeera.com/news/middleeast/2013/04/20134116945897855.html
http://www.aljazeera.com/news/middleeast/2013/04/2013411947295350.html
http://www.aljazeera.com/news/africa/2013/04/201341124357441209.html

I have only studied these two outlets in GDELT because I have a collection from these two outlets to compare with GDELT's collection. Just like I neglected studying other outlets in GDELT for that day because I have not collected stories from them, I also did not include alarabiya.net and aljazeera.net in the comparison because GDELT did not start collecting stories for languages other than English until 2015-02-19. In other words, our dataset has a minimum of 16 years (1999-2015) worth of data that GDELT, by design, does not offer at all.

As far as data completeness, per news outlet, our dataset is much more complete than GDELT. The following table shows the number of URLs for one day's worth of data (2013-04-01) studied for GDELT's GKG and our dataset for aljazeera.com and arabnews.com.

News OutletGDELTOur datasetSharedGDELT onlyOur dataset only
Total URLs collected21128183107
Aljazeera English194418125
Arab News2840282

In all, while GDELT is a much larger dataset with a lot more NLP capabilities, our dataset offers more up-to-date URLs, is more manageable, and 36 times more complete, per outlet, for the data collected on 2013-04-01.

5. Sitemaps

Limitations:

I have written scripts that are customized for each news outlet website. While the performance is very high compared to ultimate-sitemap-parser 0.5 and the data does not need to be reorganized by date, a separate custom script is needed for each website depending on the structure of the sitemap for each site. It is also not guarantied that the news website offers an organized sitemap (a sitemap of sitemaps based on published date). The script must be updated if the sitemap structure is updated and data collection needs to be redone (fetching new data after the structure change). In order to overcome this issue, I decided to use a library to collect links from sitemaps for news outlets’ websites. Using the library proved to be more effective since it works on any news outlet website without customization, however, the performance is not as high as custom parsers and it could be time consuming for larger multi-level sitemaps. It is still possible to fetch all links for the largest news website in a few hours.

After collecting all news stories from all sitemaps for all selected news outlets, the stories need be grouped by date in order to avoid comparing two stories that report different events that happened on different dates. Furthermore, grouping stories by date allows researchers to limit their research to a subset of stories based on their published date if needed. This step is time consuming since all stories have to be downloaded. Also, the library is not 100% reliable since finding the published date of a story is not always possible.

While collecting the URLs to build this dataset, I verified that the story is still accessible from the live web which eliminates problems related to link rot (something that has to be done separately when using GDELT). 

Collection methods we have not tried yet:

1. Web Crawling tools: A wide range of free and commercial web crawling tools is available on the web. I have not tested any of them yet.

2. Web Archives: Web archiving is probably the most popular topic among WSDL group members at ODU. It is one of my favorite research topics, however, through a quick run of Memgator on a subset of the data, I found that a good portion of the collected URIs are not archived on the web. 

Examples:

https://www.arabnews.com/node/400848
https://www.arabnews.com/node/400821
https://www.arabnews.com/node/400855
https://www.arabnews.com/node/400810
https://www.arabnews.com/node/400834
https://www.arabnews.com/node/400730
https://www.arabnews.com/node/400765
https://www.arabnews.com/node/400840
https://www.arabnews.com/node/400844
https://www.arabnews.com/node/400731
https://www.arabnews.com/node/400798
https://www.arabnews.com/node/400843
https://www.aljazeera.com/program/the-stream/2011/12/12/has-the-arab-spring-taken-an-islamist-turn
https://www.aljazeera.com/news/2011/12/12/peru-cabinet-shuffle-brings-crackdown-fears
https://www.aljazeera.com/sports/2011/12/12/poland-boss-smuda-asks-for-fans-support
https://www.aljazeera.net/videos/2011/12/12/شركة-سيارات-أجرة-مخصصة-للنساء-فقط-أحمد
https://www.aljazeera.net/videos/2011/12/12/كتاب-شعارات-وهتافات-الثورة-المصرية
https://www.aljazeera.net/videos/2011/12/12/مخاوف-من-انزلاق-االيمن-في-دوامة-العنف
https://www.aljazeera.net/videos/2011/12/12/طائفة-اليزيدية-في-كردستان-العراق
https://www.aljazeera.net/videos/2011/12/12/المسلمون-في-بريطانيا-و-عام-على
https://www.aljazeera.net/videos/2011/12/12/إختتام-أعمال-الهيئات-المالية-العربية
https://www.aljazeera.net/videos/2011/12/12/النظرة-الإسرائيلية-لاتفاق-السلام-مع

One other avenue that could be taken after collecting the dataset is pushing the unarchived URIs to one of the web archives like The Internet Archive, but that's another blog post for another effort.

Results:

Using our method allowed us to find out which URIs are no longer available on the live web to be eliminated from our dataset. Furthermore, we were able to identify URIs that most likely do not lead to a news story (categories, home pages, about us, authors, contact us, etc). In this dataset, we consider non-textual news stories such as stories published in a video format and recorded shows as news stories. It is possible to eliminate these stories from the dataset by setting a threshold for the number of words or characters in the page to be considered as a news story. The table below shows our results. I used red font to indicate that the result is the lowest among news outlets in terms of percentage; green font is used to indicate the highest. In terms of data quality, Aljazeera English scored the highest at 0.999 indicating that a very large percentage of the URIs collected from the sitemap link to news stories. On the other hand, the percentage for Arab News was the lowest at 0.809.

News OutletSitemap linksLinks not returning 200Non-news LinksNews storiesNews links / Sitemap links
Alarabiya136,88129581136,2710.996
Aljazeera Arabic816,892939294815,6590.998
Aljazeera English277,52514161277,3500.999
Arab News307,1792858,634248,5170.809

The dataset is available on Github.

Comments