2024-01-16: ALICE - AI Leveraged Information Capture and Exploration

Organizational presentations often carry a wealth of data: images, charts, and text that tell a compelling story or present findings. Programs and projects routinely rely on meetings to address various problems that arise in addition to disseminating information about our work. However, the information contained in the slide decks associated with these meetings and presentations tends to get lost after the fact. One cause of this information loss may be that slides represent sparse information about what is being spoken about during these meetings. 

In this blog post, I introduce the ALICE (AI Leveraged Information Capture and Exploration) system, my proposed solution for capturing and managing knowledge and data from sources that, historically, have not been comprehensively archived. The primary challenge addressed by ALICE is the prevalent issue of information loss in presentation slide decks following their related meetings or presentations. This system is designed to methodically extract text and visual elements from slides, employing a large language model such as OpenAI's GPT-4 to convert this information into structured, machine-readable formats. The aim is to not only preserve critical data but also to enhance it with comprehensive abstracts, relevant search terms, and a structured JSON for Linking Data (JSON-LD) for effective integration into knowledge graphs. This post will explore the intricacies of ALICE and its potential to redefine the management and interpretation of presentation data within NASA. First, I'll detail the proposed ALICE system. Then, I'll dive into the specifics of the ALICE system, particularly the integration of LLMs with a Knowledge Graph (KG) to enhance the LLM's reasoning over unstructured data. I'll discuss the technical aspects of how this system operates, including the text and image extraction scripts, front-end user interface, server environment, and LLM integration.

While humans can visually interpret slide content, converting that information into structured, machine-readable formats presents a challenge — not to mention autonomously providing more detailed information from the sparse information. Essentially, we want a way to effortlessly provide a slide deck and have a system "fill in the blanks" and provide a verifiably accurate description of what was presented relative to the slide deck in question. The mere existence of PowerPoint Karaoke points to the need for this capability. 

Furthermore, presentation slide decks tend to not get archived. We routinely send the slides out in an email after the fact to some email list of presentation participants but we rarely commit them to an easily indexed and searchable archive location. Previous work has been done on aligning presentations with scholarly publications like SlideSeer (JCDL '07 proceedings) by Dr. Min-Yen Kan. SlideSeer is a handy tool for researchers and academics who often share their work in two ways: as written papers and as slideshow presentations. Slideseer discovers scholarly papers that match the content of slides in a slide deck and presents them together. This way, you get to see all the information in one place, whether it's something new or something repeated in both formats. ALICE differs in that it, at least initially, processes slides for any kind of communication by reasoning over its content and indexing it into a knowledge graph. This captures not only scholarly communication but also presentations and meeting slides for various project and program meetings.

Finally, for many projects, a slide deck is the final deliverable. They are the product. There is no accompanying paper so we have no source representing a more detailed record with more complete information. In many cases, these slide decks contain information from numerous data sources and are not a presentation summarizing information from a singular source, such as a slide deck for a presentation on a research paper. For example, the majority of slide decks in the Military Industrial Powerpoint Complex archive contain information relative to various briefings and information sessions which have no singular source. 

In this blog post, we'll explore how you can efficiently scrape both text and visual elements from a presentation slide deck and then harness the capabilities of a large language model to derive meaningful abstracts, extended abstracts, search terms, hashtags, and even create a structured JSON-LD representation for each slide suitable for integration into knowledge graph software. I have proposed the ALICE system to be developed for use at NASA as part of our digital transformation and knowledge management efforts. 

The Proposed ALICE System

Knowledge Graph

The main purpose of this proposed system is to marry LLMs with a structural representation of knowledge as a cooperative architecture between two models — the LLM and a Knowledge Graph (KG). LLMs excel at reasoning over unstructured data. However, they could benefit from leveraging a greater understanding of structured data. Combining these two structures could benefit both and create a system that can both be queried and verified for alignment between the user's prompt and the LLM's output. Chaudri et al. define a knowledge graph as "a directed labeled graph in which domain-specific meanings are associated with nodes and edges. A node could represent any real-world entity, for example, people, company, computer, etc. An edge label captures the relationship of interest between the two nodes, for example, a friendship relationship between two people, a customer relationship between a company and person, or a network connection between two computers, etc."

A knowledge graph created using entity and relation extraction (fig 3 in Chaudri et al.)

A knowledge graph created using computer vision techniques (fig 4 in Chaudri et al.)


Figures 3 and 4 in Chaudri et al. hint at using an LLM's understanding of natural language or semantic scene understanding in images to generate the entity-relation-entity structure that is needed to build out a KG. Using the KG to improve the LLM, however, is a less clear endeavor. In Unifying Large Language Models and Knowledge Graphs: A Roadmap, Pan et al. explain that "KGs can enhance LLMs by providing external knowledge for inference and interpretability. Meanwhile, KGs are difficult to construct and evolve by nature, which challenges the existing methods in KGs to generate new facts and represent unseen knowledge. Therefore, it is complementary to unify LLMs and KGs together and simultaneously leverage their advantages." They go on to describe three different frameworks for integrating LLMs with KGs:

  1. KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or to enhance understanding of the knowledge learned by LLMs
  2. LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering
  3. Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge
My work with ALICE seeks to explore building the third framework, Synergized LLMs + KGs. It is my belief that by building this kind of framework we can both scrutinize an LLM's output against its knowledge graph and also enhance the LLM by providing greater context to both a user's prompts and to the responses it generates. Additionally, I will explore how providing structural knowledge to an LLM affects its tendency to hallucinate facts, hopefully alleviating that problem in some way.

To achieve this, a Knowledge Graph software module accesses JSON-LD data to create visual, interconnected graphs representing the information. Users can then interact with this Knowledge Graph through the front-end interface, searching for specific information nodes or traversing the relationships established between various data points.

Front-end user interface

A web-based application, where users can upload slide decks. Once uploaded, the files are sent to a server via a secure API. 

Server environment

The given Python script runs its course, using the pptx library to extract images, charts, and text data from the presentation. 

LLM integration

This raw extracted data is then passed on to a LLM, such as OpenAI's GPT-4, via API calls which perform the abstracting, extension, and term generation processes. The output of the LLM, consisting of the abstract, extended abstract, search terms, hashtags, and a JSON-LD KG description, is stored in a database system.

Breaking Down the Code

At the heart of our solution lies the ppt-scraper.py script, which leverages the python-pptx library to traverse through slides, extracting the text and any images and charts on each slide. I'll step through examples of each step using a presentation I gave to AIAA Scitech 2021. This presentation was retrieved from the NASA Technical Reports Server (NTRS) which will serve as the source for the dataset I will be using to finetune my own LLM, which I will talk about in my next blog post.

  • Text Extraction - extract_text_from_pptx: This function delves into each slide and grabs text content. It outputs a list of strings, each prefixed with the slide number for easy referencing.

    Slide 1: James Ecker, Benjamin Kelley, Danette Allen

    AIAA SciTech, January 2021

    Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions

    Slide 2: Computer Vision During In-Space Assembly

    Slide 2: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 2: 2

    Slide 2: January 2021

    Slide 2: SciTech 2021

    Slide 3: Computer Vision During In-Space Assembly

    Slide 3: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 3: 3

    Slide 3: January 2021

    Slide 3: SciTech 2021

    Slide 3: High Degree of Variation Requires More Information

    Slide 4: Computer Vision During In-Space Assembly

    Slide 4: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 4: 4

    Slide 4: January 2021

    Slide 4: SciTech 2021

    Slide 4: High Degree of Variation Requires More Information

    Slide 4: More Sensors = More Information = More Energy Use & More Mass

    Slide 5: Mitigating the Constraints

    Slide 5: 5

    Slide 5: January 2021

    Slide 5: SciTech 2021

    Slide 5: High Degree of Variation Requires More Information

    Slide 5: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6: Mitigating the Constraints

    Slide 6: 6

    Slide 6: January 2021

    Slide 6: SciTech 2021

    Slide 6: High Degree of Variation Requires More Information

    Slide 6: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6:

    Slide 6: Maximize

    Slide 7: Mitigating the Constraints

    Slide 7: 7

    Slide 7: January 2021

    Slide 7: SciTech 2021

    Slide 7: High Degree of Variation Requires More Information

    Slide 7: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 7:

    Slide 7:

    Slide 7:

    Slide 7: Maximize

    Slide 7: Minimize

    Slide 8: Mitigating the Constraints

    Slide 8: 8

    Slide 8: January 2021

    Slide 8: SciTech 2021

    Slide 8: High Degree of Variation Requires More Information

    Slide 8: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 8:

    Slide 8:

    Slide 8:

    Slide 8: Maximize

    Slide 8: Minimize

    Slide 8: Predict 3D Mesh from Single View

    Slide 9: Mitigating the Constraints

    Slide 9: 9

    Slide 9: January 2021

    Slide 9: SciTech 2021

    Slide 9: High Degree of Variation Requires More Information

    Slide 9: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 9:

    Slide 9:

    Slide 9:

    Slide 9: Maximize

    Slide 9: Minimize

    Slide 9: Single Camera

    Slide 9: Predict 3D Mesh from Single View

    Slide 10: Related Work

    Slide 10: [1] He et al – Mask R-CNN

    [2] Gkioxari et al – Mesh R-CNN

    [3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks

    [4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation

    [5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks

    [6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation

    Slide 10: 10

    Slide 10: January 2021

    Slide 10: SciTech 2021

    Slide 11: Synthesizing Data

    Slide 11: 3D model of objects projected over 3D background in Blender

    Can be extended to full simulation environments

    ROS, Gazebo, Mujoco, etc

    Variations in observations

    Orientation of camera and light source

    Relative orientation between objects

    Number of objects in scene

    Background

    Sim to Reality Problem

    Domain Randomization

    Slide 11: 11

    Slide 11: January 2021

    Slide 11: SciTech 2021

    Slide 12: Metadata for Training Mask R-CNN

    Slide 12: 12

    Slide 12: January 2021

    Slide 12: SciTech 2021

    Slide 13: Metadata for Training Mesh R-CNN

    Slide 13: 13

    Slide 13: January 2021

    Slide 13: SciTech 2021

    Slide 14: Building the Dataset

    Slide 14: Generate a parent pool of data

    20,000 image/metadata pairs

    For each sample generated

    Extract/Calculate ground truth to build metadata

    Configure metadata to conform to model

    Sample training set from parent pool

    Sample –n (default: 1500) instances from parent pool randomly

    Split into training and validation sets

    --training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)

    Merge all training/validation set metadata into one JSON, respectively

    Slide 14: January 2021

    Slide 14: SciTech 2021

    Slide 14: 14

    Slide 15: Mask Prediction – Mask R-CNN

    Slide 15: Backbone

    Resnet-50-FPN

    Region Proposal Network

    Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects

    Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)

    Mask Prediction

    Generate a binary mask for pixels in proposed region using the aligned features

    Slide 15: January 2021

    Slide 15: SciTech 2021

    Slide 15: 15

    Slide 15: The Mask R-CNN Architecture [1]

    Slide 16: Mask Prediction – Mask R-CNN Advantages

    Slide 16: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 16: January 2021

    Slide 16: SciTech 2021

    Slide 16: 16

    Slide 16: The Mask R-CNN Architecture [1]

    Slide 17: Mask Prediction – Mask R-CNN Advantages

    Slide 17: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 17: January 2021

    Slide 17: SciTech 2021

    Slide 17: 17

    Slide 17: Masks provide a measure of visual explainability

    Slide 17: The Mask R-CNN Architecture [1]

    Slide 18: Mask Prediction – Mask R-CNN Performance

    Slide 18: 98% instance segmentation accuracy

    object

    bounding box

    mask

    Slide 18: January 2021

    Slide 18: SciTech 2021

    Slide 18: 18

    Slide 19: Mesh Prediction

    Slide 19: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Mesh Refinement

    Slide 19: January 2021

    Slide 19: SciTech 2021

    Slide 19: 19

    Slide 19: The Mesh R-CNN Architecture [2]

    Slide 20: Mesh Prediction

    Slide 20: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    Slide 20: January 2021

    Slide 20: SciTech 2021

    Slide 20: 20

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]

    Slide 21: Mesh Prediction

    Slide 21: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    2 passes

    Vertex alignment

    Graph convolution

    Vertex refinement

    Slide 21: January 2021

    Slide 21: SciTech 2021

    Slide 21: 21

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]

    Slide 22: Metrics

    Slide 22: January 2021

    Slide 22: SciTech 2021

    Slide 22: 22

    Slide 22:

    Slide 22:

    Slide 22:

    Slide 23: Results

    Slide 23:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Mesh R-CNN trained on Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 23: January 2021

    Slide 23: SciTech 2021

    Slide 23: 23

    Slide 23:

    Slide 24: Results

    Slide 24:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 24: January 2021

    Slide 24: SciTech 2021

    Slide 24: 24

    Slide 24:

    Slide 24: 2 Quadro 6000 RTX GPUs

    Slide 24: 8 Tesla V100 GPUS

    Slide 24: Requires hyperparameter tuning specific to hardware configuration

    Slide 25: Conclusion

    Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models

    Can train each model end-to-end from no data to trained model

    Future work

    Hyperparameter tuning

    Further domain randomization

    Randomize object’s rendered skin

    Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model

    Higher resolution 3D mesh prediction

    Slide 25: January 2021

    Slide 25: SciTech 2021

    Slide 25: 25

    Slide 25:

    Slide 26: References

    Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.

    Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.

    Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.

    Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.

    Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.

    Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.

    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.

    He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.

    Slide 26: January 2021

    Slide 26: SciTech 2021

    Slide 26: 26

    Slide 26:

  • Image and Chart Extraction - extract_images_and_charts_from_pptx: This function extracts images and charts from each slide. It's able to differentiate between different visual elements like pictures and charts. It further drills down into group shapes, ensuring that nested content is noticed. Here I have ChatGPT reasoning over output/slide_3/image_6.jpg:
    Slide 3

    What is going on in this image?

    image_6

    The image shows a graphical representation or simulation of a space environment where two robotic elements appear to be engaging in an in-space assembly task. There's a satellite or part of a spacecraft on the left side, which has an extended arm or appendage, possibly for manipulation or docking purposes. On the right, there's another component that seems to be floating freely or in the process of being captured or released. Below these elements, there is a view of Earth's clouds, suggesting that this scene is set in low Earth orbit. This scenario is likely a visualization or a simulated mock-up used for the planning and testing of robotic missions in space, particularly for the assembly of structures or the servicing of satellites. It may be related to computer vision systems designed to handle the challenges of space operations, such as those discussed in the presentation slides you provided earlier.

    And here it is doing the same thing over output/slide_18/image_30.jpg:
    Slide 18

    What is going on in this image?

    image_30

    The image seems to be a screenshot from a computer vision system or a simulation tool related to in-space robotic operations, similar in context to the content described in the presentation slides you shared earlier. It appears to show an annotated view of a robotic interaction with different parts during an assembly or maintenance task in space.

    Two objects are highlighted with bounding boxes and labeled with confidence scores—“nut 0.999” and “pole 0.999”—indicating that the computer vision system has identified these components with a very high degree of confidence. The black sphere may be a part of the simulation for spatial reasoning or could represent an object to be manipulated.

    The satellite or spacecraft part with solar panels is possibly a part of the simulated environment, providing context for the assembly task. The confidence scores suggest that machine learning models, such as those mentioned in the presentation (e.g., Mask R-CNN or Mesh R-CNN), are likely being used to identify and possibly predict the position of these components during the robotic mission. This type of technology is crucial for autonomous or semi-autonomous robotic systems in space, where precise identification and manipulation of various objects are necessary.

    The next version of ppt-scraper.py will have the option to inject the LLM's generated description of each image into the text description to enrich our text prompts. For example:

    Slide 3: Computer Vision During In-Space Assembly

    Slide 3: Difficulties

    Slide 3: 3

    Slide 3: January 2021

    Slide 3: SciTech 2021

    Slide 3: High Degree of Variation Requires More Information

    Slide 3: Image_8: The image shows a graphical representation or simulation of a space environment where two robotic elements appear to be engaging in an in-space assembly task. There's a satellite or part of a spacecraft on the left side, which has an extended arm or appendage, possibly for manipulation or docking purposes. On the right, there's another component that seems to be floating freely or in the process of being captured or released. Below these elements, there is a view of Earth's clouds, suggesting that this scene is set in low Earth orbit. This scenario is likely a visualization or a simulated mock-up used for the planning and testing of robotic missions in space, particularly for the assembly of structures or the servicing of satellites. It may be related to computer vision systems designed to handle the challenges of space operations, such as those discussed in the presentation slides you provided earlier.

  • After these scripts are run the output directory will have a structure similar to this:
    .
    ├── images
    │ ├── slide_1
    │ │ └── image_0.png
    │ ├── slide_2
    │ │ ├── image_1.png
    │ │ ├── image_2.png
    │ │ ├── image_3.png
    │ │ └── image_4.png
    │ ├── slide_3
    │ │ ├── image_5.png
    │ │ ├── image_6.png
    │ │ ├── image_7.png
    │ │ └── image_8.png
    │ ├── slide_4
    │ │ ├── image_9.png
    │ │ ├── image_10.png
    │ │ ├── image_11.png
    │ │ └── image_12.png
    │ ├── slide_5
    │ │ └── image_13.png
    │ ├── slide_6
    │ │ └── image_14.png
    │ ├── slide_7
    │ │ └── image_15.png
    │ ├── slide_8
    │ │ └── image_16.png
    │ ├── slide_9
    │ │ └── image_17.png
    │ ├── slide_10
    │ │ └── image_18.png
    │ ├── slide_11
    │ │ ├── image_19.png
    │ │ └── image_20.png
    │ ├── slide_12
    │ │ ├── image_21.png
    │ │ └── image_22.png
    │ ├── slide_13
    │ │ ├── image_23.png
    │ │ └── image_24.png
    │ ├── slide_15
    │ │ └── image_25.png
    │ ├── slide_16
    │ │ └── image_26.png
    │ ├── slide_17
    │ │ └── image_27.png
    │ ├── slide_18
    │ │ ├── image_28.png
    │ │ ├── image_29.png
    │ │ ├── image_30.png
    │ │ └── image_31.png
    │ ├── slide_19
    │ │ └── image_32.png
    │ ├── slide_20
    │ │ └── image_33.png
    │ ├── slide_21
    │ │ ├── image_34.png
    │ │ └── image_35.png
    │ ├── slide_22
    │ │ ├── image_36.png
    │ │ ├── image_37.png
    │ │ └── image_38.png
    │ └── slide_23
    │ ├── image_39.png
    │ └── image_40.png
    └── text_output.txt
    view raw output_dir_gree hosted with ❤ by GitHub

From Raw Data to Knowledge

Once the data extraction is complete, the real magic begins. Leveraging a powerful AI language model, like OpenAI's GPT models, we can generate:

  • Abstracts: Using the extracted text, the model can provide a concise summary capturing the essence of the presentation.

    Please generate an abstract for the following text which was scraped from a presentation slide deck:

    Slide 1: James Ecker, Benjamin Kelley, Danette Allen

    AIAA SciTech, January 2021

    Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions

    Slide 2: Computer Vision During In-Space Assembly

    Slide 2: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 2: 2

    Slide 2: January 2021

    Slide 2: SciTech 2021

    Slide 3: Computer Vision During In-Space Assembly

    Slide 3: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 3: 3

    Slide 3: January 2021

    Slide 3: SciTech 2021

    Slide 3: High Degree of Variation Requires More Information

    Slide 4: Computer Vision During In-Space Assembly

    Slide 4: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 4: 4

    Slide 4: January 2021

    Slide 4: SciTech 2021

    Slide 4: High Degree of Variation Requires More Information

    Slide 4: More Sensors = More Information = More Energy Use & More Mass

    Slide 5: Mitigating the Constraints

    Slide 5: 5

    Slide 5: January 2021

    Slide 5: SciTech 2021

    Slide 5: High Degree of Variation Requires More Information

    Slide 5: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6: Mitigating the Constraints

    Slide 6: 6

    Slide 6: January 2021

    Slide 6: SciTech 2021

    Slide 6: High Degree of Variation Requires More Information

    Slide 6: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6:

    Slide 6: Maximize

    Slide 7: Mitigating the Constraints

    Slide 7: 7

    Slide 7: January 2021

    Slide 7: SciTech 2021

    Slide 7: High Degree of Variation Requires More Information

    Slide 7: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 7:

    Slide 7:

    Slide 7:

    Slide 7: Maximize

    Slide 7: Minimize

    Slide 8: Mitigating the Constraints

    Slide 8: 8

    Slide 8: January 2021

    Slide 8: SciTech 2021

    Slide 8: High Degree of Variation Requires More Information

    Slide 8: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 8:

    Slide 8:

    Slide 8:

    Slide 8: Maximize

    Slide 8: Minimize

    Slide 8: Predict 3D Mesh from Single View

    Slide 9: Mitigating the Constraints

    Slide 9: 9

    Slide 9: January 2021

    Slide 9: SciTech 2021

    Slide 9: High Degree of Variation Requires More Information

    Slide 9: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 9:

    Slide 9:

    Slide 9:

    Slide 9: Maximize

    Slide 9: Minimize

    Slide 9: Single Camera

    Slide 9: Predict 3D Mesh from Single View

    Slide 10: Related Work

    Slide 10: [1] He et al – Mask R-CNN

    [2] Gkioxari et al – Mesh R-CNN

    [3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks

    [4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation

    [5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks

    [6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation

    Slide 10: 10

    Slide 10: January 2021

    Slide 10: SciTech 2021

    Slide 11: Synthesizing Data

    Slide 11: 3D model of objects projected over 3D background in Blender

    Can be extended to full simulation environments

    ROS, Gazebo, Mujoco, etc

    Variations in observations

    Orientation of camera and light source

    Relative orientation between objects

    Number of objects in scene

    Background

    Sim to Reality Problem

    Domain Randomization

    Slide 11: 11

    Slide 11: January 2021

    Slide 11: SciTech 2021

    Slide 12: Metadata for Training Mask R-CNN

    Slide 12: 12

    Slide 12: January 2021

    Slide 12: SciTech 2021

    Slide 13: Metadata for Training Mesh R-CNN

    Slide 13: 13

    Slide 13: January 2021

    Slide 13: SciTech 2021

    Slide 14: Building the Dataset

    Slide 14: Generate a parent pool of data

    20,000 image/metadata pairs

    For each sample generated

    Extract/Calculate ground truth to build metadata

    Configure metadata to conform to model

    Sample training set from parent pool

    Sample –n (default: 1500) instances from parent pool randomly

    Split into training and validation sets

    --training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)

    Merge all training/validation set metadata into one JSON, respectively

    Slide 14: January 2021

    Slide 14: SciTech 2021

    Slide 14: 14

    Slide 15: Mask Prediction – Mask R-CNN

    Slide 15: Backbone

    Resnet-50-FPN

    Region Proposal Network

    Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects

    Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)

    Mask Prediction

    Generate a binary mask for pixels in proposed region using the aligned features

    Slide 15: January 2021

    Slide 15: SciTech 2021

    Slide 15: 15

    Slide 15: The Mask R-CNN Architecture [1]

    Slide 16: Mask Prediction – Mask R-CNN Advantages

    Slide 16: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 16: January 2021

    Slide 16: SciTech 2021

    Slide 16: 16

    Slide 16: The Mask R-CNN Architecture [1]

    Slide 17: Mask Prediction – Mask R-CNN Advantages

    Slide 17: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 17: January 2021

    Slide 17: SciTech 2021

    Slide 17: 17

    Slide 17: Masks provide a measure of visual explainability

    Slide 17: The Mask R-CNN Architecture [1]

    Slide 18: Mask Prediction – Mask R-CNN Performance

    Slide 18: 98% instance segmentation accuracy

    object

    bounding box

    mask

    Slide 18: January 2021

    Slide 18: SciTech 2021

    Slide 18: 18

    Slide 19: Mesh Prediction

    Slide 19: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Mesh Refinement

    Slide 19: January 2021

    Slide 19: SciTech 2021

    Slide 19: 19

    Slide 19: The Mesh R-CNN Architecture [2]

    Slide 20: Mesh Prediction

    Slide 20: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    Slide 20: January 2021

    Slide 20: SciTech 2021

    Slide 20: 20

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]

    Slide 21: Mesh Prediction

    Slide 21: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    2 passes

    Vertex alignment

    Graph convolution

    Vertex refinement

    Slide 21: January 2021

    Slide 21: SciTech 2021

    Slide 21: 21

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]

    Slide 22: Metrics

    Slide 22: January 2021

    Slide 22: SciTech 2021

    Slide 22: 22

    Slide 22:

    Slide 22:

    Slide 22:

    Slide 23: Results

    Slide 23:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Mesh R-CNN trained on Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 23: January 2021

    Slide 23: SciTech 2021

    Slide 23: 23

    Slide 23:

    Slide 24: Results

    Slide 24:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 24: January 2021

    Slide 24: SciTech 2021

    Slide 24: 24

    Slide 24:

    Slide 24: 2 Quadro 6000 RTX GPUs

    Slide 24: 8 Tesla V100 GPUS

    Slide 24: Requires hyperparameter tuning specific to hardware configuration

    Slide 25: Conclusion

    Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models

    Can train each model end-to-end from no data to trained model

    Future work

    Hyperparameter tuning

    Further domain randomization

    Randomize object’s rendered skin

    Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model

    Higher resolution 3D mesh prediction

    Slide 25: January 2021

    Slide 25: SciTech 2021

    Slide 25: 25

    Slide 25:

    Slide 26: References

    Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.

    Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.

    Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.

    Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.

    Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.

    Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.

    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.

    He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.

    Slide 26: January 2021

    Slide 26: SciTech 2021

    Slide 26: 26

    Slide 26:

    This presentation, given by James Ecker, Benjamin Kelley, and Danette Allen at the AIAA SciTech in January 2021, explores the challenges and advancements in computer vision for in-space assembly through multi-agent robotic missions. It specifically delves into the generation of synthetic data for 3D mesh prediction and spatial reasoning, highlighting the difficulties encountered due to factors like illumination, angle, orientation, and movement, and the constraints of energy and mass.

    The presentation outlines how computer vision faces a high degree of variation during in-space assembly, necessitating more sensors and information, which in turn increases energy use and mass. It proposes strategies for mitigating these constraints, such as maximizing efficiency while minimizing energy use and mass, and emphasizes the importance of predicting 3D mesh from a single view.

    Significant attention is given to the development of synthetic datasets capable of training state-of-the-art 2D mask and 3D mesh prediction models. This includes using Blender for 3D model projection, domain randomization, and training models like Mask R-CNN and Mesh R-CNN. The presentation also reviews related work in the field and compares the performance of their methods with existing approaches.

    In conclusion, the presentation underscores the importance of hyperparameter tuning, further domain randomization, and the potential of extending Mesh R-CNN with Generative Adversarial Networks for enhanced 3D mesh prediction. This research contributes significantly to the field of computer vision in space robotics, offering insights into tackling the unique challenges posed by the space environment.

  • Extended Abstracts: Need a more detailed summary? No problem! The model can be instructed to generate a longer, more detailed abstract, providing deeper insights into the presentation.

    Please generate an extended abstract for the following text which was scraped from a presentation slide deck by expanding upon the topics presented based on your knowledge of them

    Slide 1: James Ecker, Benjamin Kelley, Danette Allen

    AIAA SciTech, January 2021

    Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions

    Slide 2: Computer Vision During In-Space Assembly

    Slide 2: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 2: 2

    Slide 2: January 2021

    Slide 2: SciTech 2021

    Slide 3: Computer Vision During In-Space Assembly

    Slide 3: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 3: 3

    Slide 3: January 2021

    Slide 3: SciTech 2021

    Slide 3: High Degree of Variation Requires More Information

    Slide 4: Computer Vision During In-Space Assembly

    Slide 4: Difficulties

    Illumination

    Angle

    Orientation

    Movement

    Constraints

    Energy

    Mass

    Slide 4: 4

    Slide 4: January 2021

    Slide 4: SciTech 2021

    Slide 4: High Degree of Variation Requires More Information

    Slide 4: More Sensors = More Information = More Energy Use & More Mass

    Slide 5: Mitigating the Constraints

    Slide 5: 5

    Slide 5: January 2021

    Slide 5: SciTech 2021

    Slide 5: High Degree of Variation Requires More Information

    Slide 5: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6: Mitigating the Constraints

    Slide 6: 6

    Slide 6: January 2021

    Slide 6: SciTech 2021

    Slide 6: High Degree of Variation Requires More Information

    Slide 6: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 6:

    Slide 6: Maximize

    Slide 7: Mitigating the Constraints

    Slide 7: 7

    Slide 7: January 2021

    Slide 7: SciTech 2021

    Slide 7: High Degree of Variation Requires More Information

    Slide 7: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 7:

    Slide 7:

    Slide 7:

    Slide 7: Maximize

    Slide 7: Minimize

    Slide 8: Mitigating the Constraints

    Slide 8: 8

    Slide 8: January 2021

    Slide 8: SciTech 2021

    Slide 8: High Degree of Variation Requires More Information

    Slide 8: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 8:

    Slide 8:

    Slide 8:

    Slide 8: Maximize

    Slide 8: Minimize

    Slide 8: Predict 3D Mesh from Single View

    Slide 9: Mitigating the Constraints

    Slide 9: 9

    Slide 9: January 2021

    Slide 9: SciTech 2021

    Slide 9: High Degree of Variation Requires More Information

    Slide 9: More Sensors =

    More Information =

    More Energy Use & More Mass

    Slide 9:

    Slide 9:

    Slide 9:

    Slide 9: Maximize

    Slide 9: Minimize

    Slide 9: Single Camera

    Slide 9: Predict 3D Mesh from Single View

    Slide 10: Related Work

    Slide 10: [1] He et al – Mask R-CNN

    [2] Gkioxari et al – Mesh R-CNN

    [3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks

    [4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation

    [5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks

    [6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation

    Slide 10: 10

    Slide 10: January 2021

    Slide 10: SciTech 2021

    Slide 11: Synthesizing Data

    Slide 11: 3D model of objects projected over 3D background in Blender

    Can be extended to full simulation environments

    ROS, Gazebo, Mujoco, etc

    Variations in observations

    Orientation of camera and light source

    Relative orientation between objects

    Number of objects in scene

    Background

    Sim to Reality Problem

    Domain Randomization

    Slide 11: 11

    Slide 11: January 2021

    Slide 11: SciTech 2021

    Slide 12: Metadata for Training Mask R-CNN

    Slide 12: 12

    Slide 12: January 2021

    Slide 12: SciTech 2021

    Slide 13: Metadata for Training Mesh R-CNN

    Slide 13: 13

    Slide 13: January 2021

    Slide 13: SciTech 2021

    Slide 14: Building the Dataset

    Slide 14: Generate a parent pool of data

    20,000 image/metadata pairs

    For each sample generated

    Extract/Calculate ground truth to build metadata

    Configure metadata to conform to model

    Sample training set from parent pool

    Sample –n (default: 1500) instances from parent pool randomly

    Split into training and validation sets

    --training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)

    Merge all training/validation set metadata into one JSON, respectively

    Slide 14: January 2021

    Slide 14: SciTech 2021

    Slide 14: 14

    Slide 15: Mask Prediction – Mask R-CNN

    Slide 15: Backbone

    Resnet-50-FPN

    Region Proposal Network

    Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects

    Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)

    Mask Prediction

    Generate a binary mask for pixels in proposed region using the aligned features

    Slide 15: January 2021

    Slide 15: SciTech 2021

    Slide 15: 15

    Slide 15: The Mask R-CNN Architecture [1]

    Slide 16: Mask Prediction – Mask R-CNN Advantages

    Slide 16: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 16: January 2021

    Slide 16: SciTech 2021

    Slide 16: 16

    Slide 16: The Mask R-CNN Architecture [1]

    Slide 17: Mask Prediction – Mask R-CNN Advantages

    Slide 17: Transfer Learning

    Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)

    Lowers training time and generalization error

    Region of Interest Alignment

    Each region of interest is fed into a fixed-size input fully connected (FC) layer

    Need to account for all pixels in ROI while conforming to fixed input size of FC

    Bilinear Interpolation instead of Quantization

    Slide 17: January 2021

    Slide 17: SciTech 2021

    Slide 17: 17

    Slide 17: Masks provide a measure of visual explainability

    Slide 17: The Mask R-CNN Architecture [1]

    Slide 18: Mask Prediction – Mask R-CNN Performance

    Slide 18: 98% instance segmentation accuracy

    object

    bounding box

    mask

    Slide 18: January 2021

    Slide 18: SciTech 2021

    Slide 18: 18

    Slide 19: Mesh Prediction

    Slide 19: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Mesh Refinement

    Slide 19: January 2021

    Slide 19: SciTech 2021

    Slide 19: 19

    Slide 19: The Mesh R-CNN Architecture [2]

    Slide 20: Mesh Prediction

    Slide 20: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    Slide 20: January 2021

    Slide 20: SciTech 2021

    Slide 20: 20

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20:

    Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]

    Slide 21: Mesh Prediction

    Slide 21: Extends Mask R-CNN

    Mesh Predictor

    Voxel Prediction

    Predicts a voxel occupancy grid

    Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel

    Mesh Refinement

    2 passes

    Vertex alignment

    Graph convolution

    Vertex refinement

    Slide 21: January 2021

    Slide 21: SciTech 2021

    Slide 21: 21

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21:

    Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]

    Slide 22: Metrics

    Slide 22: January 2021

    Slide 22: SciTech 2021

    Slide 22: 22

    Slide 22:

    Slide 22:

    Slide 22:

    Slide 23: Results

    Slide 23:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Mesh R-CNN trained on Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 23: January 2021

    Slide 23: SciTech 2021

    Slide 23: 23

    Slide 23:

    Slide 24: Results

    Slide 24:

    Ours (Custom Synthetic Data)

    Chamfer (lower is better)

    0.621

    F1 (higher is better)

    47.51

    Theirs (Pix3D)

    Chamfer (lower is better)

    0.306

    F1 (higher is better)

    74.84

    Slide 24: January 2021

    Slide 24: SciTech 2021

    Slide 24: 24

    Slide 24:

    Slide 24: 2 Quadro 6000 RTX GPUs

    Slide 24: 8 Tesla V100 GPUS

    Slide 24: Requires hyperparameter tuning specific to hardware configuration

    Slide 25: Conclusion

    Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models

    Can train each model end-to-end from no data to trained model

    Future work

    Hyperparameter tuning

    Further domain randomization

    Randomize object’s rendered skin

    Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model

    Higher resolution 3D mesh prediction

    Slide 25: January 2021

    Slide 25: SciTech 2021

    Slide 25: 25

    Slide 25:

    Slide 26: References

    Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.

    Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.

    Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.

    Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.

    Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.

    Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.

    Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.

    He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.

    Slide 26: January 2021

    Slide 26: SciTech 2021

    Slide 26: 26

    Slide 26:

    In the realm of autonomous space operations, the work presented by James Ecker, Benjamin Kelley, and Danette Allen at AIAA SciTech 2021 addresses a critical component: the creation and utilization of synthetic data for 3D mesh prediction and spatial reasoning in multi-agent robotic missions. The context for this research is the challenging environment of in-space assembly where conventional computer vision systems encounter unique difficulties such as variable illumination, complex angles, orientation, and movement, alongside the constraints of energy usage and mass.

    The presentation delves into the intricate balance required in space robotics: the need for extensive sensor data to manage the high variability in the operational environment, which paradoxically increases the energy and mass burden of the spacecraft. To mitigate these challenges, the team emphasizes the need to maximize information extraction while minimizing the additional resource expenditure. A notable innovation presented is the prediction of 3D meshes from a single camera view, reducing the sensor load.

    The research builds upon foundational work in computer vision, notably Mask R-CNN and Mesh R-CNN, which provide frameworks for object detection and 3D reconstruction. These models benefit from transfer learning, which allows for the initialization of network weights from pre-trained models, significantly reducing both training time and generalization error. The presentation also discusses the performance of Mask R-CNN, citing impressive instance segmentation accuracy rates.

    Advancing further, the team outlines their process for synthesizing data, utilizing tools like Blender and simulation environments such as ROS, Gazebo, and Mujoco. This process accounts for varying observational conditions, such as camera and light source orientation, and the complexity of the scene. The technique of domain randomization is highlighted as a means to bridge the sim-to-reality gap, ensuring that the synthetic data covers a broad spectrum of real-world scenarios.

    The dataset creation is a rigorous process, generating a large pool of image and metadata pairs, from which training and validation sets are sampled. This metadata is crucial for training the Mask R-CNN and Mesh R-CNN models, enabling them to understand and predict the complex spatial arrangements found in space operations.

    In terms of model architecture, the presentation notes the use of Resnet-50-FPN as the backbone for the Mask R-CNN, which uses a region proposal network and mask prediction to effectively identify and delineate objects within images.

    Looking at the results, the team evaluates their models using metrics such as the Chamfer distance and the F1 score, comparing their custom synthetic data approach to existing methods. They acknowledge the necessity for hyperparameter tuning specific to hardware configurations, which included powerful GPUs like the Quadro 6000 RTX and Tesla V100.

    The conclusion of the presentation points to the success in generating a synthetic dataset capable of training advanced 2D mask and 3D mesh prediction models. Future work is set to focus on further improving these models through hyperparameter tuning, enhanced domain randomization, and the exploration of Generative Adversarial Networks for point cloud generation, aiming for higher resolution 3D mesh prediction.

    This extended abstract encapsulates the efforts of the team in pushing the boundaries of synthetic data generation for the advancement of autonomous robotic missions in space, highlighting both the achievements and the roadmap for future developments in this cutting-edge field.

  • Hashtags: Want to index or socially share your content? The model can generate relevant hashtags based on slide content, aiding in searchability and social media visibility.

    Please provide any hashtags that you would associate with this presentation

    Based on the content of the presentation, the following hashtags could be associated with it:

    #SyntheticData

    #3DMeshPrediction

    #SpatialReasoning

    #Robotics

    #ComputerVision

    #InSpaceAssembly

    #AIinSpace

    #MultiAgentSystems

    #MachineLearning

    #DeepLearning

    #AIAASciTech

    #RoboticMissions

    #SpaceTechnology

    #MaskRCNN

    #MeshRCNN

    #Simulation

    #ArtificialIntelligence

    #GANs (Generative Adversarial Networks)

    #PointCloudGeneration

    #SpaceInnovation

    #AerospaceEngineering

    #AutonomousSystems

    #TransferLearning

    #DataScience

    #FutureOfSpaceExploration

  • JSON-LD: JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is an ideal data format for programming environments, REST Web services, and unstructured databases such as Apache CouchDB and MongoDB. It is also suitable for feeding into knowledge graphs. They can be used in semantic web applications, enhancing the interoperability and understandability of your presentation data in software ecosystems.  Here is the JSON-LD object generated for this presentation:
    {
    "@context": "http://schema.org",
    "@type": "Event",
    "name": "Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions",
    "startDate": "2021-01",
    "location": {
    "@type": "Place",
    "name": "AIAA SciTech Forum",
    "address": {
    "@type": "PostalAddress",
    "addressLocality": "Location of conference (if known)",
    "addressRegion": "Region of conference (if known)",
    "addressCountry": "Country of conference (if known)"
    }
    },
    "sponsor": {
    "@type": "Organization",
    "name": "American Institute of Aeronautics and Astronautics (AIAA)"
    },
    "performer": [
    {
    "@type": "Person",
    "name": "James Ecker"
    },
    {
    "@type": "Person",
    "name": "Benjamin Kelley"
    },
    {
    "@type": "Person",
    "name": "Danette Allen"
    }
    ],
    "workFeatured": [
    {
    "@type": "CreativeWork",
    "name": "Computer Vision During In-Space Assembly",
    "about": "Difficulties in space operations such as illumination, angle, orientation, movement, energy, and mass constraints."
    },
    {
    "@type": "CreativeWork",
    "name": "Mask R-CNN and Mesh R-CNN in Space Technology",
    "about": "Techniques for instance segmentation and 3D mesh prediction in space robotics."
    },
    {
    "@type": "CreativeWork",
    "name": "Synthesizing Data for Space Robotics",
    "about": "Using tools like Blender, ROS, Gazebo, and Mujoco for creating synthetic datasets for computer vision systems."
    }
    ],
    "about": {
    "@type": "Thing",
    "name": "Computer Vision and Robotics in Space",
    "description": "The presentation covers topics such as synthetic data generation, computer vision challenges in space, mitigation of resource constraints, and advancements in 3D mesh prediction models."
    },
    "keywords": [
    "Synthetic Data",
    "3D Mesh Prediction",
    "Spatial Reasoning",
    "Robotics",
    "Computer Vision",
    "In-Space Assembly",
    "AI",
    "Machine Learning",
    "Deep Learning",
    "GANs",
    "PointCloud Generation"
    ],
    "citation": [
    {
    "@type": "CreativeWork",
    "name": "Mask R-CNN",
    "author": [
    {
    "@type": "Person",
    "name": "Kaiming He"
    },
    {
    "@type": "Person",
    "name": "Georgia Gkioxari"
    },
    {
    "@type": "Person",
    "name": "Piotr Dollár"
    },
    {
    "@type": "Person",
    "name": "Ross Girshick"
    }
    ],
    "datePublished": "2017",
    "url": "https://doi.org/10.1109/ICCV.2017.322"
    },
    {
    "@type": "CreativeWork",
    "name": "Mesh R-CNN",
    "author": [
    {
    "@type": "Person",
    "name": "Georgia Gkioxari"
    },
    {
    "@type": "Person",
    "name": "Jitendra Malik"
    }
    ],
    "datePublished": "2019",
    "url": "https://doi.org/10.1109/ICCV.2019.00988"
    },
    {
    "@type": "CreativeWork",
    "name": "Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks",
    "author": {
    "@type": "Person",
    "name": "S. D. Sonawani et al."
    },
    "datePublished": "2020",
    "url": "http://arxiv.org/abs/2001.10673"
    },
    {
    "@type": "CreativeWork",
    "name": "3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation",
    "author": {
    "@type": "Person",
    "name": "B. Pal et al."
    },
    "datePublished": "2017",
    "url": "https://doi.org/10.1109/ICIMIA.2017.7975586"
    },
    {
    "@type": "CreativeWork",
    "name": "Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks",
    "author": {
    "@type": "Person",
    "name": "D. Valsesia et al."
    },
    "datePublished": "2019",
    "url": "https://ieeexplore.ieee.org/document/8642330"
    },
    {
    "@type": "CreativeWork",
    "name": "Spectral-GANs for High Resolution 3D Point Cloud Generation",
    "author": {
    "@type": "Person",
    "name": "S. Ramasinghe et al."
    },
    "datePublished": "2019",
    "url": "http://arxiv.org/abs/1912.01800"
    }
    ]
    }
    • Of course, we can train our own large language model bootstrapped from an open-source model such as Llama and Llama 2. We want the model to be multi-modal, able to have not just text as input but also images so that we can extract as much information as possible from each slide. This is especially useful in the case that we can train a model that can accurately interpret charts and graphs with respect to its local context.

Future Work

Slide decks are currently unlikely to be submitted to NTRS, thus going unarchived. As such, initially focusing on them represents the greatest value added to our organizational knowledge stores. There's no reason, however, that we couldn't extend this kind of metadata generation to other types of documents and files with generality. As long as they can be indexed along the same axes as our initial slide deck targets into the knowledge graph, we can effectively curate project and program data to create a comprehensive view of our research at any given time. To do so would require the ability to ingest these different documents, so we would need to expand our ingest scripts or make our initial ppt-scraper.py script a general scraper.

Furthermore, scraping shapes and images separately presents a challenge with slides where images are annotated using text with various shape objects. For example this slide from Overview of UAVs for Ozone Monitoring, Adcock et al. 2020

This kind of slide would benefit from reasoning at the slide level. I am currently working on a solution for this as it suggests the exploration of an LLM that can reason with just images of each slide which would greatly reduce the prompting workflow.

As discussed earlier, metadata generation would greatly benefit from training a custom LLM model bootstrapped off of LLAMA/LLAMA2. This initial design uses ChatGPT-4 to generate the prompted metadata and we suffer from some periodic hallucinations which result in inaccurate summarization of given text and images. Narrowing the model's training data to align with NASA domain-specific data should alleviate these inaccuracies. For example: 

Slide 3

What is going on in this image?

image_8

The image appears to be a still image from a sci-fi film during moment of space exploration or a scene involving a futuristic spacecraft approaching or orbiting a mysterious celestial body, like a planet or an artificial structure. The spherical object might be a destination, such as a new planet being explored or an alien artifact being investigated.

The detailed structure behind the sphere could be part of a larger spaceship, a space station, or a piece of technology relevant to the plot of the film. It might be engaging in activities such as scanning the sphere, preparing to dock with it, or observing it from a safe distance.

Given the genre, this could be a pivotal scene where the characters are about to uncover something significant, whether it's a technological breakthrough, first contact with alien life, or the discovery of a new world or dimension. The darkness of space and the isolated positioning of the structures suggest a sense of the unknown and the vastness of the universe, common themes in sci-fi narratives.

An apparent loss of context has occurred and ChatGPT is attributing the image to a sci-fi film and describing it within that context. If this information were to be processed into ALICE's knowledge graph it would inject inaccurate relation edges and, possibly, erroneous nodes. Of course, there's always a possibility of sci-fi references in NTRS so this could be an edge case that warrants investigation. 

Conclusion

In conclusion, this blog post has provided a comprehensive overview of the ALICE system. We explored the challenges of preserving information in presentation slide decks and the innovative solution offered by ALICE. This system efficiently extracts text and visual elements from slides and employs a large language model, such as OpenAI's GPT-4, to transform this data into structured, machine-readable formats and interpolates the expanded semantic context of the presentation. We delved into the integration of LLMs with Knowledge Graphs, highlighting the potential synergy between these technologies in enhancing data interpretation and retrieval.

The post also outlined the technical specifics of ALICE, including its front-end interface for uploading slide decks, server environment for data processing, and the crucial role of LLMs in generating abstracts, search terms, and JSON-LD for Knowledge Graph integration. We discussed the key functionalities of our 'ppt-scraper.py' script in extracting diverse data types from presentations and how this technology can be further evolved.

In summary, ALICE represents a significant leap forward in managing and utilizing the wealth of data hidden in presentation slide decks, promising to enhance knowledge preservation and accessibility at NASA and potentially beyond.

You can follow the development of this project at my Github repo.

- Jim Ecker


Sources

  • Min-Yen Kan. 2007. SlideSeer: a digital library of aligned document and presentation pairs. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (JCDL '07). Association for Computing Machinery, New York, NY, USA, 81–90. https://doi.org/10.1145/1255175.1255192
  • Chaudhri, V. K., Baru, C., Chittar, N., Dong, X. L., Genesereth, M., Hendler, J., Kalyanpur, A., Lenat, D., Sequeda, J., Vrandečić, D., and Wang, K. 2022. “ Knowledge graphs: Introduction, history, and perspectives.” AI Magazine 43: 17–29. https://doi.org/10.1002/aaai.12033
  • Pan, Shirui, et al. "Unifying Large Language Models and Knowledge Graphs: A Roadmap." arXiv preprint arXiv:2306.08302 (2023).
  • Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023)
  • Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023)

Comments