Wednesday, September 13, 2017

2017-09-13: Pagination Considered Harmful to Archiving



Figure 1 - 2016 U.S. News Global Rankings Main Page as Shown on Oct 30, 2015


Figure 2 - 2016 U.S. News Global Rankings Main Page With Pagination Scheme as Shown on Oct 30, 2015
https://web.archive.org/web/20151030092546/https://www.usnews.com/education/best-global-universities/rankings

While gathering data for our work in measuring the correlation of university rankings by reputation and by Twitter followers (McCoy et al., 2017), we discovered that many of the web pages which comprised the complete ranking list for U.S. News in a given year were not available in the Internet Archive. In fact, 21 of 75 pages (or 28%)  had never been archived at all. "... what is part of and what is not part of an Internet resource remains an open question" according to research concerning Web archiving mechanisms conducted by Poursadar and Shipman (2017).  Over 2,000 participants in their study were presented with various types of web content (e.g., multi-page stories, reviews, single page writings) and surveyed regarding their expectation for later access to additional content that was linked from or appeared on the main page.  Specifically, they investigated (1) how relationships between page content affect expectations and (2) how perceptions of content value relate to internet resources. In other words, if I save the main page as a resource, what else should I expect to be saved along with it?

I experienced this paradox first hand when I attempted to locate an historical entry from the 2016 edition of the U.S. News Best Global University Rankings.  As shown in Figure 1, October 30, 2015 is a particular date of interest because on the day prior, a revision of the original ranking for the University at Buffalo-SUNY was reported. The university's ranking was revised due to incorrect data related to the number of PhD awards. A re-calculation of the ranking metrics resulted in a change of the university's ranking position from a tie at No. 344 to a tie at No. 181.


Figure 3 - Summary of U.S. News
https://web.archive.org/web/*/https://www.usnews.com/education/best-global-universities/rankings



 

Figure 4 - Capture of U.S. News Revision for Buffalo-SUNY
https://web.archive.org/web/20160314033028/http://www.usnews.com/education/best-global-universities/rankings?page=19

A search of the Internet Archive, Figure 3, shows the U.S. News web site was saved 669 times between October 28, 2014 and September 3, 2017. We should first note that regardless of the ranking year you choose to locate via a web search, U.S. News reuses the same URL from year to year. Therefore, an inquiry against the live web will always direct you to their most recent publication. As of September 3, 2017, the redirect would be to the 2017 edition of their ranking list. Next, as shown in Figure 2, the 2016 U.S. News ranking list consisted of 750 universities presented in groups of 10 spread across 75 web pages. Therefore, the revised entry for the University at Buffalo-SUNY at rank No. 181 should appear on page 19, Figure 4.

Page No.
Captures
Start Date
End Date
1
669
10/28/2014
09/03/2017
2
434
10/28/2014
08/15/2017
3
171
10/28/2014
07/17/2017
4
43
10/28/2014
01/13/2017
5
37
10/28/2014
01/13/2017
6
7
10/28/2014
06/29/2017
7
4
09/29/2015
01/13/2017
8
1
01/12/2017
01/12/2017
9
2
01/28/2016
01/12/2017
10
1
01/12/2017
01/12/2017
11
2
01/12/2017
02/22/2017
12
1
02/22/2017
02/22/2017
13
1
02/22/2017
02/22/2017
14
2
02/22/2017
03/30/2017
15
2
03/12/2017
03/12/2017
16
2
03/30/2017
06/30/2017
17
2
06/20/2015
07/16/2017
18
4
06/19/2015
07/16/2017
19
3
06/18/2015
07/16/2017
Table 1 - Page Captures of U.S. News (Abbreviated Listing)

While I could readily locate the main page of the 2016 list as it appeared on October 30, 2015, I noted that subsequent pages were archived with diminishing frequency and over a much shorter period of time. We see in Table 1, after the first three pages, there can be a significant variance in the frequency with which the remaining site pages are crawled. And, as was noted earlier, more than a quarter (28%) of the ranking list cannot be reconstructed at all. Ainsworth and Nelson examined the degree of temporal drift that can occur during the display of sparsely archived pages using the Sliding Target policy allowed by the web archive user interface (UI); namely many years in just a few clicks. Since a substantial portion of the U.S. News ranking list is missing, it is very likely the web browsing experience will result in a hybrid list of universities that encompasses different ranking years as the user follows the page links.


Figure 5 - Frequency of Page Captures

Ultimately, we found page 19 had been captured three times during the specified time frame. However, the page containing the revised ranking that was of interest, Figure 4, was not available in the archive until March 14, 2016; almost five months after the ranking list had been updated. Further, in Figure 5, we note heavy activity for the first few and last few pages of the ranking list which may occur because, as shown on Figure 2, these links are presented prominently on page 1. The remaining pages 3 through 5 must be discovered manually by clicking on the next page. We note, in Figure 5, here the sporadic capture scheme for these intermediate pages.

Current web designs which feature pagination create a frustrating experience for the user when subsequent pages are omitted in the archive. It was my expectation that all pages associated with the ranking list would be saved in order to maintain the integrity of the complete listing of universities as they appeared on the publication date. My intuition is consistent with Poursadar and Shipman, who among their other conclusions, noted that navigational distance from the primary page can affect perceptions regarding what is considered to be viable content that should be preserved.  However, for multi-page articles, nearly 80% of the participants in their study considered linked information in the later pages as part of the resource. This perception was especially profound "when the content of the main and connected pages are part of a larger composition or set of information" as in perhaps a ranking list.


Overall, the findings of Poursadar and Shipman along with our personal observations indicate that archiving systems require an alternative methodology or domain rules that recognize when content spread across multiple pages represent a single collection or a composite resource that should be preserved in its entirety. From a design perspective, we can only wonder why there isn't a "view all" link on multi-page content such as the U.S. News ranking list. This feature might present a way to circumvent paginated design schemes so the Internet Archive can obtain a complete view of a particular web site; especially if the "view all" link is located on the first few pages which appear to be crawled most often. On the other hand, the use of pagination might also represent a conscious choice by the web designer or site owner as a way to limit page scraping even though people can still find a way to do so. Ultimately, the collateral damage associated with this type of design scheme is an uneven distribution in the archive; resulting in an incomplete archival record.

Sources:

Scott G. Ainsworth and Michael L. Nelson. "Evaluating sliding and sticky target policies by measuring temporal drift in acyclic walks through a web archive." International Journal on Digital Libraries 16: 129-144. DOI: 10.1007/s00799-014-0120-4

Corren G. McCoy, Michael L. Nelson, Michele C. Weigle, "University Twitter Engagement: Using Twitter Followers to Rank Universities." 2017. Technical Report. arXiv:1708.05790.

Faryaneh Poursardar and Frank Shipman, "What Is Part of That Resource?User Expectations for Personal Archiving.", Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries, 2017.
-- Corren (@correnmccoy)


No comments:

Post a Comment