Wednesday, November 22, 2017

2017-11-22: Deploying the Memento-Damage Service





Many web services such as archive.isArchive-ItInternet Archive, and UK Web Archive have provided archived web pages or mementos for us to use. Nowadays, the web archivists have shifted their focus from how to make a good archive to measuring how well the archive preserved the page. It raises a question about how to objectively measure the damage of a memento that can correctly emulate user (human) perception.

Related to this, Justin Brunelle devised a prototype for measuring the impact of missing embedded resources (the damage) on a web page. Brunelle, in his IJDL paper (and the earlier JCDL version), describes that the quality of a memento depends on the availability of its resources. The straight percentage of missing resources in a memento is not always a good indicator of how "damaged" it is. For example, one page could be missing several small icons whose absence users never even notice, and a second page could be missing a single embedded video (e.g., a Youtube page). Even though the first page is missing more resources, intuitively the second page is more damaged and less useful for users. The damage value ranges from 0 to 1, where a damage of 1 means the web page lost all of its embedded resources. Figure 1 gives an illustration of how this prototype works.

Figure 1. The overview flowchart of Memento Damage
Although this prototype has been successfully proven to be capable of measuring the damage, it is not user ready. Thus, we implemented a web service, called Memento-Damage, based on the prototype.

Analyzing the Challenges

Reducing the Calculation Time

As previously explained, the basic notion of damage calculation is mirroring human perception of a memento. Thus, we analyze the screenshot of the web page as a representation of how the page looks in the user's eyes. This screenshot analysis takes the most time of the entire damage calculation process.

The initial prototype is built on top of the Perl programming language and used PerlMagick to analyze the screenshot. This library dumps the color scheme (RGB) of each pixel in the screenshot into a file. This output file will then be loaded by the prototype for further analysis. Dumping and reading the pixel colors of the screenshot take a significant amount of time and it is repeated according to the number of stylesheets the web page has. Therefore, if a web page has 5 stylesheets, the analysis will be repeated 5 times even though it uses the same screenshot as the basis.

Simplifying the Installation and Making It Distributable
Before running the prototype, users are required to install all dependencies manually. The list of dependencies is not provided. Users must discover it themselves by identifying the error that appears during the execution. Furthermore, we need to ‘package’ and deploy this prototype into a ready-to-use and distributable tool that can be used widely in various communities. How? By providing 4 different ways of using the service:
Solving Other Technical Issues
Several technical issues that needed to be solved included:
  1. Handling redirection (status_code  = 301, 302, or 307).
  2. Providing some insights and information.
    The user will not only get the final damage value but also will be informed about the detail of the process that happened during the crawling and calculation process as well as the components that make up the value of the final damage. If an error happened, the error info will also be provided.  
  3. Dealing with overlapping resources and iframes. 

Measuring the Damage

Crawling the Resources
When a user inputs a URI-M into the Memento-Damage service, the tool will check the content-type of the URI-M and crawl all resources. The properties of the resources, such as size and position of an image, will be written into a log file. Figure 2 summarizes the crawling process conducted in Memento-Damage. Along with this process, a screenshot of the website will also be created.

Figure 2. The crawling process in Memento-Damage

Calculating the Damage
After crawling the resources, Memento-Damage will start calculating the damage by reading the log files that are previously generated (Figure 3). Memento-Damage will first read the network log and examine the status_code of each resource. If a URI is redirected (status_code = 301 or 302), it will chase down the final URI by following the URI in the header location as depicted in Figure 4. Each resource will be processed in accordance with its type (image, css, javascript, text, iframe) to obtain its actual and potential damage value. Then, the total damage is computed using the formula:
\begin{equation}\label{eq:tot_dmg}T_D = \frac{A_D}{P_D}\end{equation}
where:
     $ T_D = Total Damage \\
        A_D = Actual Damage \\
        P_D = Potential Damage $

The formula above can be further elaborated into:
$$ T_D = \frac{A_{D_i} + A_{D_c} + A_{D_j} + A_{D_m} + A_{D_t} + A_{D_f}}{P_{D_i} + P_{D_c} + P_{D_j} + P_{D_m} + P_{D_t} + P_{D_f}} $$
where each subscript notation represent image (i), css (c), javascript (j), multimedia (m), text, and iframe (f), respectively. 
For image analysis, we use Pillow, a python image library that has better and faster performance than PerlMagick. Pillow can read pixels in an image without dumping it into a file to speed up the analysis process. Furthermore, we modify the algorithm so that we only need to run the analysis script once for all stylesheets.
Figure 3. The calculation process in Memento-Damage

Figure 4. Chasing down a redirected URI

Dealing with Overlapping Resources

Figure 5. Example of a memento that contains overlapping resources (accessed on March 30th, 2017)
URIs with overlapping resources such the one illustrated in Figure 5 need to be treated differently to prevent the damage value from being double counted. To solve this problem, we created a concept of rectangle (Rectangle = xmin, ymin, xmax, ymax). We perceive the overlapping resources as rectangles and calculate the size of the intersection area. The size of one of the overlapped resource will be reduced by the intersection size, while the other overlapped resource will get the whole full size. Figure 6 and Listing 1 give the illustration of the rectangle concept.
Figure 6. Intersection concept for overlapping resources in an URI
def rectangle_intersection_area(a, b):
    dx = min(a.xmax, b.xmax) - max(a.xmin, b.xmin)
    dy = min(a.ymax, b.ymax) - max(a.ymin, b.ymin)
    if (dx >= 0) and (dy >= 0): 
        return dx * dy
Listing 1. Measuring image rectangle in Memento-Damage

Dealing with Iframes

Dealing with iframe is quite tricky and requires some customization. First, by default, crawling process cannot access content inside of iframe using native javascript or JQuery selector due to a cross-domain problem. This problem becomes more complicated when this iframe is nested in another iframe(s). Therefore, we need to find a way to switch from main frame to the iframe. To handle this problem, we utilize the API provided by PhantomJS that facilitates switching from one iframe to another. Second, the location properties of the resources inside of iframe are calculated relative to that particular iframe position, not to the main frame position. It could lead to a wrong damage calculation. Thus, for a resource located inside an iframe, its position must be computed in a nested calculation by taking into account the position of its parent frame(s)

Using Memento-Damage

The Web Service

a. Website
The Memento-Damage website gives the easiest way to use the Memento-Damage tool. However, since it runs on a resource-limited server provided by ODU, it is not recommended for calculating damage a large number of requests. Figure 7 shows a brief preview of the website.
Figure 7. The calculation result from Memento-Damage

b. REST API
The REST API service is part of the web service which facilitates damage calculation from any HTTP Client (e.g. web browser, CURL, etc) and gives output in JSON format. This makes it possible for the user to do further analysis with the resulting output. Using REST API, a user can create a script and calculate damage on a few number of URIs (e.g. 5).
The default REST API usage for memento damage is:
http://memento-damage.cs.odu.edu/api/damage/[the input URI-M]
Listing 2 and Listing 3 show examples of using Memento-Damage REST API with CURL and Python.
curl http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci
Listing 2. Using Memento-Damage REST API with Curl
import requests
resp = requests.get('http://memento-damage.cs.odu.edu/api/damage/http://odu.edu/compsci')
print resp.json()
Listing 3. Using Memento-Damage REST API as embedded code in Python

Local Service

a. Docker Version
The Memento-Damage docker image uses Ubuntu-LXDE for the desktop environment. A fixed desktop environment is used to avoid an inconsistency issue of the damage value of the same URI run by different machines with different operating systems. We found that PhantomJS, the headless browser that is used for generating the screenshot, rendered the web page in accordance with the machine's desktop environment. Hence, the same URI could have a slightly different screenshot, and thus different damage values when run on different machines (Figure 8). 


Figure 8. Screenshot of https://web.archive.org/web/19990125094845/http://www.dot.state.al.us taken by PhantomJS run on 2 machines with different OS.
To start using the Docker version of Memento-Damage, the user can follow these  steps: 
  1. Pull the docker image:
    docker pull erikaris/memento-damage
    
  2. Run the docker image:
    docker run -it -p :80 --name  
    
    Example:
    docker run -i -t -p 8080:80 --name memdamage erikaris/memento-damage:latest
    
    After this step is completed, we now have the Memento-Damage web service running on
    http://localhost:8080/
    
  3. Run memento-damage as a CLI using the docker exec command:
    docker exec -it <container name> memento-damage <URI>
    Example:
    docker exec -it memdamage memento-damage http://odu.edu/compsci
    
    If the user wants to work from the inside of the Docker's terminal, use this following command:
    docker exec -it <container name> bash
    Example:
    docker exec -it memdamage bash 
  4. Start exploring the Memento-Damage using various options (Figure 9) that can be obtained by typing
    docker exec -it memdamage memento-damage --help
    or if the user is already inside the Docker's container, just simply type:
    memento-damage --help
~$ docker exec -it memdamage memento-damage --help
Usage: memento-damage [options] <URI>

Options:
  -h, --help            show this help message and exit
  -o OUTPUT_DIR, --output-dir=OUTPUT_DIR
                        output directory (optional)
  -O, --overwrite       overwrite existing output directory
  -m MODE, --mode=MODE  output mode: "simple" or "json" [default: simple]
  -d DEBUG, --debug=DEBUG
                        debug mode: "simple" or "complete" [default: none]
  -L, --redirect        follow url redirection

Figure 9. CLI options provided by Memento-Damage

Figure 10 depicts an output generated by CLI-version Memento-Damage using complete debug mode (option -d complete).
Figure 10. CLI-version Memento-Damage output using option -d complete
Further details about using Docker to run Memento-Damage is available on http://memento-damage.cs.odu.edu/help/.

b. Library
The library version offers functionality (web service and CLI) that is relatively similar to that of the Docker version. It is aimed at the people who already have all the dependencies (PhantomJS 2.xx and Python 2.7) installed on their machine and do not want to bother installing Docker. The latest library version can be downloaded from GitHub.
Start using the library by following these steps:
  1. Install the library using the command:
    sudo pip install web-memento-damage-X.x.tar.gz
  2. Run the Memento-Damage as a web service:     memento-damage-server
  3. Run the Memento-Damage via CLI:   memento-damage <URI>
  4. Explore available options by using option --help:
    memento-damage-server --help                     (for the web service)
    or 
    memento-damage --help                                  (for CLI)

Testing on a Large Number of URIs

To prove our claim that Memento-Damage can handle a large number of URIs, we conducted a test on 108,511 URI-Ms using a testing script written in Python. The testing used the Docker version of Memento-Damage that was run on a machine with specification: Intel(R) Xeon(R) CPU E5-2660 v2 @2.20GHz, Memory 4 GiB. The testing result is shown below.

Summary of Data
=================================================
Total URI-M: 108511
Number of URI-M are successfully processed: 80580
Number of URI-M are failed to process: 27931
Number of URI-M has not processed yet: 0

With a dataset of 108,511 input URI-Ms tested, we found 80,580 URI-Ms were successfully processed while the rest, 27,931 URI-Ms, failed to process. The processing failure on those 27,931 URI-Ms happened because of the concurrent limitation access from Internet Archive. On average, 1 URI-M needs 32.5 seconds processing time. This is 110 times faster than the prototype version, which takes an average of 1 hour to process 1 URI-M. In some cases, the prototype version even took almost 2 hours to process 1 URI-M. 

From those successfully processed URI-Ms, we managed to create some visualizations to help us better understand the result as can be seen below. The first graph (Figure 11) shows the number of average missing embedded resources per memento per year according to the damage value (Dm) and the missing resources (Mm). The most interesting case appeared in 2011 where the Dm value was significantly higher than Mm. It means, although on the average the URI-Ms in 2011 only lost 4% of their resources, these losses actually caused 4 (four) times higher damages than the number showed by Mm. On the other hand, in 2008, 2010, 2013, and 2017, Dm value is lower than Mm, which implies those missing resources are less important.
Figure 11. The average embedded resources missed per memento per year
Figure 12. Comparison of All Resources vs Missing Resources

The second graph (Figure 12) shows the number of total resources in each URI-Ms and its missing resources.  The x-axis represents each URI-M sorted in descending order by the number of resources, while the y-axis represents the number of resources owned by each URI-M. This graph gives us an insight that almost every URI-M lost at least one of its embedded resources. 

Summary

In this research, we have improved the calculation method for measuring the damage on a memento (URI-M) based on the prototype from the previous version. The improvements include reducing calculation time, fixing various bugs, handling redirection and a new type of resources. We successfully developed the Memento-Damage into a comprehensive tool that has the ability to show the detail of every resource that contributes to the damage. Furthermore, it also provides several approaches for utilizing the tool such as python library and the Docker version. The testing result shows that Memento-Damage works faster compared to the prototype version and can handle a larger number of mementos. Table 1 summarizes the improvements that we made on Memento-Damage compared to the initial prototype. 

No
Subject
Prototype
Memento Damage
1.Programming LanguageJavascript + PerlJavascript + Python
2.InterfaceCLICLI
Website
REST API
3.DistributionSource CodeSource Code
Python library
Docker
4.OutputPlain TextPlain Text
JSON
5.Processing timeVery slowFast
6.Includes IFrameNAAvailable
7.Redirection HandlingNAAvailable
8.Resolve OverlapNAAvailable
9.Blacklisted URIsOnly has 1 blacklisted URI which is added manuallyAdd several new blacklisted URIs. Blacklisted URIs are identified based on a certain pattern.
10.Batch executionNot supportedSupported
11.DOM selector capabilityonly support simple selection querysupport complex selection query
12.Input filteringNAOnly process an input of HTML format
Table 1. Improvement on Memento-Damage compared to the initial prototype

Help Us to Improve

This tool still needs a lot of improvements to increase its functionality and provide a better user experience. We really hope and strongly encourage everyone, especially people who work in web archiving field, to try this tool and give us feedback. Please kindly read the FAQ and HELP before starting using the Memento-Damage. Help us to improve and tell us what we can do better by posting any bugs, errors, issues, or difficulties that you find in this tool on our GitHub

- Erika Siregar -

No comments:

Post a Comment