2023-05-25: Generative Archive Restoration

Rise of the Machines!

Machine Learning just cannot seem to keep itself out of news cycles. The third version of OpenAI's generative dialogue language model, ChatGPT, had tech giants all around scrambling in quite a circus trying to push out their own versions. Google's size and stagnation in recent years had it seeing red and bringing Sergey Brin back into the fold to aid in rushing out its own chat bot, Bard. Microsoft, comparatively, has been humming along for a while now with its own research in the AI agent space but a various headlines hint that its past and present efforts might not be paying off as well as they were hoping. You.com is a relatively new search engine leveraging machine learning in its own chat assistant YouChat and other services in an attempt to push the frontiers of a search engine through multi-modal search with integrated artificial intelligence enhancements. These efforts are all seeking to shake up how we seek and retrieve information on the web, hoping to provide information to the masses through the proxy of a Turing Test instead of hyperlink rabbit holes.

While many seek to transform the future of information retrieval and synthesis, what of the reverse direction? Similar to how 3D printing and digital technologies are being employed to help recreate ancient Iraqi artifacts and architecture destroyed by ISIS (related: Bassel Khartiabil), could we use the technologies of our present to recreate the digital past by utilizing AI to synthesize web resources that were not archived?

Image credit: Joseph Eid (https://www.theguardian.com/world/ng-interactive/2016/apr/08/palmyra-after-islamic-state-isis-visual-guide)

Much of the work I have been doing with WS-DL is examining broken web pages, particularly archived ones. Damage to a web page can stem from its structural content such as stylesheets and JavaScript code, the text and multimedia content displayed to visitors, and imported content from other locations on the Web. Measuring actual and perceived damages to a web page is not as straightforward as it might seem though. Different groups of individuals have a multitude of focuses and opinions of what constitutes "damage" and what elements of a web page are meaningful to their particular focus. While visual qualities play a big part in our perception of what is or is not damaged when looking at a web page. A web page with a missing stylesheet might be rendered in such a jumbled manner that elements are obscured, illegible and the page deemed broken. But to a robot crawling the web for hyperlinks, a digital librarian just seeking academic sources, or a blind surfer utilizing a screen reader, the fact that your hyperlinks should have a yellow-to-magenta color transition on hover doesn't matter at all as long as the text elements are accessible.

Figure 1. Damage approximations

Archives with missing images are an interesting case I have been working on lately. Should we take a look at some archived web page with a missing image resource, we can deduce that we have some amount of damage to the archived page, but how much is that image contributing? Without knowing what the image was depicting or its intended dimensions, deriving this value can be quite complicated. For instance, we have no guarantee that an image will have an accompanying width or height attribute, and accounting for dynamically-sized layouts, we can only assume a maximum upper bound based on the size of the image's parent container. We could also attempt to derive semantic importance from neighboring content, should any exist (Fig. 1). There are numerous of other considerations to look through when calculating web page damage but what if we could not only measure the damage of missing web archive content but retroactively repair or lessen the extent of the perceived damage?

I had first heard of something similar to this technique (though in a more insidious context) in a documentary about China's Internet giant Tencent. It detailed how Tencent and a company called Mirriad were seeking to utilize machine learning to retroactively inject advertisements into films and other video formats. The Business Insider penned an article that brings up the relevant crossover point of temporal violations resulting from this practice being applied to older content, such as films or television shows. It might be one thing to simply go through a film and replace all soda cans with Pepsi logos with Coca-Cola logos. Mirriad doesn't only stop there, going further by creating and inserting generated ads to scenes that never originally had them (Fig 2, below). In 2018 The Guardian quoted Mirriad executive Mark Popkiewicz, stating “It is all about putting in images that were never there,” depicting how the technology helps pave the way for advertisers to enhance their reach upon the stories and plots of media, beyond the simple product placement and awareness. In our case, we do not seek to add additional information or insert non-original media content to a damaged archive but instead, more nobly, attempt to restore what is missing or has been lost to time.

Figure 2. T-Mobile advertisement digitally added to a film scene (https://www.businessinsider.com/new-ads-products-insert-old-movies-content-mirriad-2021-5)

Intelligently Integrated Images

For the task of generating a similar image from a set of input images, ideally, we would utilize a Generative Adversarial Network. This technique is not very feasible for our task though, as it would require us to collect and store a plethora of image data to train the neural network. GANs are also typically far more specialized towards a particular subject, such as images of cats. The images we need to generate could be anything and our generator must be capable of outputting a broad range of images. In the near future, I hope research into data science and web archiving might enable the utilization of massive datasets, such as that from Common Crawl, to broaden and enhance the capabilities of these generative media technologies. This is where the current crop of dialogue agents might be able to shine!

ChatGPT, being the most publicized AI agent as of late, might be the first agent we try to reach for but ChatGPT will not generate images for us (at least not without some trickery). Many agents I looked at are also only interactive within a web page interface, are relatively expensive, or do not have some form of library or API to programmatically interact with. Midjourney is an AI agent that allows for image inputs but must be interacted with through the chat application Discord which would disqualify it for our purposes. DeepAI is an agent closer to our needs capable of generating images and can be utilized via cURL but it does have an API cost to take into account. One service that shows some promise is CLIP-as-service from Jina.AI. CLIP (standing for Contrastive Language-Image Pre-Training) can be integrated into a Python code base, has a pre-built image dataset, and additionally can attempt to analyze image content. This is very useful to draw information from neighboring images that might not or might not correctly be annotated, have no alt-text, and have a URL lacking descriptors we could extract. If there are no other images we can potentially draw information from, we might be able to synthesize a prompt from the text of the page itself. Ideally, this would come from the most localized text near a missing image's intended location on the web page but might need to be scoped out as necessary until a meaningful amount of information can be gathered. In the worst case, we could utilize all of the text on a page (filtering out any junk text, of course) to derive phrases or keywords and then use those to create a prompt. Even if the generated images might not be entirely accurate, if they can be close enough in approximation, they might provide a more visually appealing experience than simply having missing images littered about an archived page. There are, of course, caveats to this approach in that some pages might have a gallery of various images that aren't too important individually but then there are plenty of cases where the specific content of an image is vitally important.

In the screenshot below (Fig 3) we can see an archived USGS page with two missing GIFs that would normally be a primary focus of the page. These missing images do include size attributes and even have some text but it is not clear if these are static or animated GIFs simply by. At this point in time, machine learning is just burgeoning into the area of animation and video and there is no agent I have found capable of taking the title and data located in the table below the missing image and generating a meaningful static or animated data visualization on the fly.

Figure 3. Archived page missing two animated chart graphics (https://web.archive.org/web/20150915094735/https://waterdata.usgs.gov/ms/nwis/uv/?site_no=02433500&PARAmeter_cd=00065,00060)

Because CLIP can generate images from a broad array of topics, we aren't locked into one specific pool of images (such as dogs) and can accommodate the diverse topics of multiple archived pages. After initial testing though, the biggest weakness of CLIP lies in its pre-trained image set. This dataset is immensely helpful to have as a built-in feature for testing and development but with only a little over 12,000 images it is still far from capable of handling robust and accurate image generation in comparison to the billions plus image sets used by large models like LAION-5B, utilized by Google's Imagen and other generative machine learning heavy hitters. Acquiring large image datasets doesn't come without massive time and investment or ethical shortcuts and controversy. Similar to Amazon attempting to steal your home internet unless your neighbor explicitly opts their Amazon Echo spy speaker out of the program, some of the larger contributors to the LAION-5B dataset have drawn scrutiny for placing the burden of opting out of their dragnet scraping practices on web administrators.

Image Generation

CLIP-as-service

For preliminary testing, I primarily looked at an archived GitHub page that contained various sets of images, many of which containing at least one missing image, as seen below in Figure 4. Alt text can be helpful in determining context but without applied style rules or explicit height and width attributes it can be difficult to discern the size or importance of what is missing. In these cases, it is also helpful to look at the context provided by related or neighboring images on the page.

Figure 4. Archived GitHub page containing multiple image clusters with missing images... "Enhance!"

Here, I was trying to see if I could even get applicable results from such generative prompts and how "in the ballpark" we could get. I spun up a local CLIP instance and used the remaining images from the clusters as programmatic prompt inputs. Using the pre-trained image set from CLIP I was able to get the following results using the remaining images from three out of four image sets (Fig 5). Generative tests for each image set were conducted by providing CLIP with both purely image inputs and a prompt with the images in addition to brief, appropriately constructed prompts, such as "Generate a similar image of the Eiffel Tower using the provided images", with CLIP set to generate 3 image results per prompt. In Figures 5 and 6 below, I have provided the input and respective output images retrieved from a locally run CLIP instance.

Figure 5: Poor and mediocre results from pre-trained CLIP image set

In the Eiffel Tower image set, using only the remaining images as input, CLIP returned a single non-Eiffel Tower-like monument and two irrelevant images. When including a prompt containing relevant keywords I actually received even less relevant results. I repeated this using the other image clusters on the page, particularly of children playing in piles of leaves, baseball game highlights, and the Notre Dame cathedral. The images of children and baseball also provided mediocre results, with those for the children returning images without leaves and the most relevant baseball image being a closeup of Derek Jeter giving an enthusiastic shout! Pictured below in Figure 6, the cluster for the Notre Dame cathedral, while still not particularly viable, did end up being a little more accurate.

Figure 6. Input images and output results!

Here, we at least received an actual image of a cathedral in both image and text+image prompts, albeit it is an interior image. I am not sure if the brick building on the latter result is a church, school building, or large mansion but the Eye of Sauron definitely does not belong in the results!

Microsoft TaskMatrix

During writing, I found Microsoft's Visual ChatGPT (now renamed TaskMatrix) and was able to test it out as well using the same image clusters. This time around I was able to get surprisingly better results! I looked at one of the image clusters containing various depictions of people surfing or hanging out on a beach with surfboards. One of the images from this cluster was held back at random as an expectation of what we should get back while the rest were passed to TaskMatrix to generate a new, similar image. In Figure 7 below, we can see from the input images given to the TaskMatrix prompt still do not realistically line up with our expectations. Most of the input images had black bars along the top and bottom which were interpreted to have some level of importance and included in the final image, albeit with some apparent creative freedom taken in regard to their coloring. We can see that the returned image is heavily derived from the first provided image, keeping the watermark in the bottom left, the background, and a modified version of the surfboard. This test highlights some of the logic and shortcomings behind TaskMatrix in that it does not actually properly generate new images, but rather attempts to modify and replace the details of provided images. This can be noted with the generated images of a leaf pile and Eiffel Tower above. Here, it seems that TaskMatrix took a bite of the front of the surfboard and then 'hallucinated' some of the details it could not properly reason through. Most notably, it tried to replace the surfer with what appears to be an abstract cartoon or color splotch in a crouched/standing pose similar to the other provided images. You can still see a bit of where it tried to hide the surfer's legs but had difficulty with the details on the left of the image, replacing the surfer's ankles and the surfboard cable tie with a green and brown blob of color.

Figure 7. TaskMatrix image generation test using existing images

In Figure 8, we can see the TaskMatrix prompt and the associated inputs and result where TaskMatrix was able to provide much better results for both the baseball and Eiffel Tower image clusters. The generated baseball image did not have any players on the field but it did at least display a passable image a baseball field and stadium. The Eiffel Tower image came out probably the best of the four sets, thought this looks the least tweaked and most purely synthesized composite of all tested clusters.

Figure 8. TaskMatrix generated images for a baseball game and the Eiffel Tower

Unfortunately, TaskMatrix still had trouble with the image clusters of children playing in leaf piles and Notre Dame cathedral. In this instance, for the image cluster of children it only generating a pile of leaves but no children playing (Fig. 9), though this generated image is a much closer approximation to the input images than those provided by CLIP. Surprisingly, TaskMatrix gave the least relevant result for the Notre Dame cathedral image cluster, opposite CLIP, and generated an image that is more "grayscale painting" than "realistic photograph" (Fig. 9).

Figure 9. TaskMatrix generated images for children playing in leaves and the Notre Dame cathedral

By and large, TaskMatrix proves to be a much more accurate and robust system for our purposes but this comes at the cost of using OpenAI's paid API service for the prompt calculations. In total, these four tests only cost thirty-nine cents, but to replicate this at scale would be an astronomical cost burden. The other downside to utilizing TaskMatrix (and GPT agents in general) is that their primary access through a web portal provides a programmatic barrier. TaskMatrix utilizes a service called Gradio to handle their web prompt which luckily provides a programmatic API (Fig. 10). This workaround is great but definitely feels a bit hacky rather than using a direct API.

Figure 10. Where there's a will (or an API), there's a way!

The novelty of such tasks is interesting but its efficacy and overall utility have still yet to be determined at this point, especially given the tumultuous and ever-improving results from the collective AI agent community. Many agents also have the capability to render images in a particular artist's style. This might be a useful feature to watermark or highlight that these are generated images which were not originally part of the archived page without breaking immersion as much as traditional web archive replay interfaces. This might even be more natively integrated into a browser eventually, carrying on work from Abigail Mabe looking at a Memento-Aware Chromium browser. Further work by Harvard's Library Innovation Lab is looking at generating entire web archives using ChatGPT. Given an AI's potential ability to learn base64, one day an AI might be able to faithfully generate entire archives without the need for any sort of image inputs. With web archives being a crucial tool for research, politics, and understanding a more "authentic" past, the ability to fabricate digital history is a bit terrifying in an already post-truth era. This also begs the question of how will we train future artificial intelligence systems in the future when we no longer have any reliable ground truth on which to base the accuracy of our information? The world is watching the evolution of AI agents with great interest, we don't know exactly where they might take us but I am hoping they might lead not only to a better future but also a more complete past so that we may better know ourselves.

- David Calano

Search This Blog

Web Science and Digital Libraries Research Group