2024-01-16: ALICE - AI Leveraged Information Capture and Exploration

Organizational presentations often carry a wealth of data: images, charts, and text that tell a compelling story or present findings. Programs and projects routinely rely on meetings to address various problems that arise in addition to disseminating information about our work. However, the information contained in the slide decks associated with these meetings and presentations tends to get lost after the fact. One cause of this information loss may be that slides represent sparse information about what is being spoken about during these meetings.

In this blog post, I introduce the ALICE (AI Leveraged Information Capture and Exploration) system, my proposed solution for capturing and managing knowledge and data from sources that, historically, have not been comprehensively archived. The primary challenge addressed by ALICE is the prevalent issue of information loss in presentation slide decks following their related meetings or presentations. This system is designed to methodically extract text and visual elements from slides, employing a large language model such as OpenAI's GPT-4 to convert this information into structured, machine-readable formats. The aim is to not only preserve critical data but also to enhance it with comprehensive abstracts, relevant search terms, and a structured JSON for Linking Data (JSON-LD) for effective integration into knowledge graphs. This post will explore the intricacies of ALICE and its potential to redefine the management and interpretation of presentation data within NASA. First, I'll detail the proposed ALICE system. Then, I'll dive into the specifics of the ALICE system, particularly the integration of LLMs with a Knowledge Graph (KG) to enhance the LLM's reasoning over unstructured data. I'll discuss the technical aspects of how this system operates, including the text and image extraction scripts, front-end user interface, server environment, and LLM integration.

While humans can visually interpret slide content, converting that information into structured, machine-readable formats presents a challenge — not to mention autonomously providing more detailed information from the sparse information. Essentially, we want a way to effortlessly provide a slide deck and have a system "fill in the blanks" and provide a verifiably accurate description of what was presented relative to the slide deck in question. The mere existence of PowerPoint Karaoke points to the need for this capability.

Furthermore, presentation slide decks tend to not get archived. We routinely send the slides out in an email after the fact to some email list of presentation participants but we rarely commit them to an easily indexed and searchable archive location. Previous work has been done on aligning presentations with scholarly publications like SlideSeer (JCDL '07 proceedings) by Dr. Min-Yen Kan. SlideSeer is a handy tool for researchers and academics who often share their work in two ways: as written papers and as slideshow presentations. Slideseer discovers scholarly papers that match the content of slides in a slide deck and presents them together. This way, you get to see all the information in one place, whether it's something new or something repeated in both formats. ALICE differs in that it, at least initially, processes slides for any kind of communication by reasoning over its content and indexing it into a knowledge graph. This captures not only scholarly communication but also presentations and meeting slides for various project and program meetings.

Finally, for many projects, a slide deck is the final deliverable. They are the product. There is no accompanying paper so we have no source representing a more detailed record with more complete information. In many cases, these slide decks contain information from numerous data sources and are not a presentation summarizing information from a singular source, such as a slide deck for a presentation on a research paper. For example, the majority of slide decks in the Military Industrial Powerpoint Complex archive contain information relative to various briefings and information sessions which have no singular source.

In this blog post, we'll explore how you can efficiently scrape both text and visual elements from a presentation slide deck and then harness the capabilities of a large language model to derive meaningful abstracts, extended abstracts, search terms, hashtags, and even create a structured JSON-LD representation for each slide suitable for integration into knowledge graph software. I have proposed the ALICE system to be developed for use at NASA as part of our digital transformation and knowledge management efforts.

The Proposed ALICE System

Knowledge Graph

The main purpose of this proposed system is to marry LLMs with a structural representation of knowledge as a cooperative architecture between two models — the LLM and a Knowledge Graph (KG). LLMs excel at reasoning over unstructured data. However, they could benefit from leveraging a greater understanding of structured data. Combining these two structures could benefit both and create a system that can both be queried and verified for alignment between the user's prompt and the LLM's output. Chaudri et al. define a knowledge graph as "a directed labeled graph in which domain-specific meanings are associated with nodes and edges. A node could represent any real-world entity, for example, people, company, computer, etc. An edge label captures the relationship of interest between the two nodes, for example, a friendship relationship between two people, a customer relationship between a company and person, or a network connection between two computers, etc."

A knowledge graph created using entity and relation extraction (fig 3 in Chaudri et al.)

A knowledge graph created using computer vision techniques (fig 4 in Chaudri et al.)

Figures 3 and 4 in Chaudri et al. hint at using an LLM's understanding of natural language or semantic scene understanding in images to generate the entity-relation-entity structure that is needed to build out a KG. Using the KG to improve the LLM, however, is a less clear endeavor. In Unifying Large Language Models and Knowledge Graphs: A Roadmap, Pan et al. explain that "KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages." They go on to describe three different frameworks for integrating LLMs with KGs:

KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or to enhance understanding of the knowledge learned by LLMs
LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering
Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge

My work with ALICE seeks to explore building the third framework, Synergized LLMs + KGs. It is my belief that by building this kind of framework we can both scrutinize an LLM's output against its knowledge graph and also enhance the LLM by providing greater context to both a user's prompts and to the responses it generates. Additionally, I will explore how providing structural knowledge to an LLM affects its tendency to hallucinate facts, hopefully alleviating that problem in some way.

To achieve this, a Knowledge Graph software module accesses JSON-LD data to create visual, interconnected graphs representing the information. Users can then interact with this Knowledge Graph through the front-end interface, searching for specific information nodes or traversing the relationships established between various data points.

Front-end user interface

A web-based application, where users can upload slide decks. Once uploaded, the files are sent to a server via a secure API.

Server environment

The given Python script runs its course, using the pptx library to extract images, charts, and text data from the presentation.

LLM integration

This raw extracted data is then passed on to a LLM, such as OpenAI's GPT-4, via API calls which perform the abstracting, extension, and term generation processes. The output of the LLM, consisting of the abstract, extended abstract, search terms, hashtags, and a JSON-LD KG description, is stored in a database system.

Breaking Down the Code

At the heart of our solution lies the ppt-scraper.py script, which leverages the python-pptx library to traverse through slides, extracting the text and any images and charts on each slide. I'll step through examples of each step using a presentation I gave to AIAA Scitech 2021. This presentation was retrieved from the NASA Technical Reports Server (NTRS) which will serve as the source for the dataset I will be using to finetune my own LLM, which I will talk about in my next blog post.

Text Extraction - extract_text_from_pptx: This function delves into each slide and grabs text content. It outputs a list of strings, each prefixed with the slide number for easy referencing.

Image and Chart Extraction - extract_images_and_charts_from_pptx: This function extracts images and charts from each slide. It's able to differentiate between different visual elements like pictures and charts. It further drills down into group shapes, ensuring that nested content is noticed. Here I have ChatGPT reasoning over output/slide_3/image_6.jpg:
Slide 3

output/slide_18/image_30.jpg

Slide 18

The next version of ppt-scraper.py will have the option to inject the LLM's generated description of each image into the text description to enrich our text prompts. For example:

After these scripts are run the output directory will have a structure similar to this:

From Raw Data to Knowledge

Once the data extraction is complete, the real magic begins. Leveraging a powerful AI language model, like OpenAI's GPT models, we can generate:

Abstracts: Using the extracted text, the model can provide a concise summary capturing the essence of the presentation.
Extended Abstracts: Need a more detailed summary? No problem! The model can be instructed to generate a longer, more detailed abstract, providing deeper insights into the presentation.
Hashtags: Want to index or socially share your content? The model can generate relevant hashtags based on slide content, aiding in searchability and social media visibility.
JSON-LD: JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is an ideal data format for programming environments, REST Web services, and unstructured databases such as Apache CouchDB and MongoDB. It is also suitable for feeding into knowledge graphs. They can be used in semantic web applications, enhancing the interoperability and understandability of your presentation data in software ecosystems. Here is the JSON-LD object generated for this presentation:

Of course, we can train our own large language model bootstrapped from an open-source model such as Llama and Llama 2. We want the model to be multi-modal, able to have not just text as input but also images so that we can extract as much information as possible from each slide. This is especially useful in the case that we can train a model that can accurately interpret charts and graphs with respect to its local context.

Future Work

Slide decks are currently unlikely to be submitted to NTRS, thus going unarchived. As such, initially focusing on them represents the greatest value added to our organizational knowledge stores. There's no reason, however, that we couldn't extend this kind of metadata generation to other types of documents and files with generality. As long as they can be indexed along the same axes as our initial slide deck targets into the knowledge graph, we can effectively curate project and program data to create a comprehensive view of our research at any given time. To do so would require the ability to ingest these different documents, so we would need to expand our ingest scripts or make our initial ppt-scraper.py script a general scraper.

Furthermore, scraping shapes and images separately presents a challenge with slides where images are annotated using text with various shape objects. For example this slide from Overview of UAVs for Ozone Monitoring, Adcock et al. 202 0:

This kind of slide would benefit from reasoning at the slide level. I am currently working on a solution for this as it suggests the exploration of an LLM that can reason with just images of each slide which would greatly reduce the prompting workflow.

As discussed earlier, metadata generation would greatly benefit from training a custom LLM model bootstrapped off of LLAMA/LLAMA2. This initial design uses ChatGPT-4 to generate the prompted metadata and we suffer from some periodic hallucinations which result in inaccurate summarization of given text and images. Narrowing the model's training data to align with NASA domain-specific data should alleviate these inaccuracies. For example:

Slide 3

An apparent loss of context has occurred and ChatGPT is attributing the image to a sci-fi film and describing it within that context. If this information were to be processed into ALICE's knowledge graph it would inject inaccurate relation edges and, possibly, erroneous nodes. Of course, there's always a possibility of sci-fi references in NTRS so this could be an edge case that warrants investigation.

Conclusion

In conclusion, this blog post has provided a comprehensive overview of the ALICE system. We explored the challenges of preserving information in presentation slide decks and the innovative solution offered by ALICE. This system efficiently extracts text and visual elements from slides and employs a large language model, such as OpenAI's GPT-4, to transform this data into structured, machine-readable formats and interpolates the expanded semantic context of the presentation. We delved into the integration of LLMs with Knowledge Graphs, highlighting the potential synergy between these technologies in enhancing data interpretation and retrieval.

The post also outlined the technical specifics of ALICE, including its front-end interface for uploading slide decks, server environment for data processing, and the crucial role of LLMs in generating abstracts, search terms, and JSON-LD for Knowledge Graph integration. We discussed the key functionalities of our 'ppt-scraper.py' script in extracting diverse data types from presentations and how this technology can be further evolved.

In summary, ALICE represents a significant leap forward in managing and utilizing the wealth of data hidden in presentation slide decks, promising to enhance knowledge preservation and accessibility at NASA and potentially beyond.

You can follow the development of this project at my Github repo.

- Jim Ecker

Sources

Min-Yen Kan. 2007. SlideSeer: a digital library of aligned document and presentation pairs. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (JCDL '07). Association for Computing Machinery, New York, NY, USA, 81–90. https://doi.org/10.1145/1255175.1255192

Chaudhri, V. K., Baru, C., Chittar, N., Dong, X. L., Genesereth, M., Hendler, J., Kalyanpur, A., Lenat, D., Sequeda, J., Vrandečić, D., and Wang, K. 2022. “ Knowledge graphs: Introduction, history, and perspectives.” AI Magazine 43: 17–29. https://doi.org/10.1002/aaai.12033

Pan, Shirui, et al. "Unifying Large Language Models and Knowledge Graphs: A Roadmap." arXiv preprint arXiv:2306.08302 (2023).

Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023)

Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023)

Search This Blog

Web Science and Digital Libraries Research Group