2024-01-16: ALICE - AI Leveraged Information Capture and Exploration
Organizational presentations often carry a wealth of data: images, charts, and text that tell a compelling story or present findings. Programs and projects routinely rely on meetings to address various problems that arise in addition to disseminating information about our work. However, the information contained in the slide decks associated with these meetings and presentations tends to get lost after the fact. One cause of this information loss may be that slides represent sparse information about what is being spoken about during these meetings.
In this blog post, I introduce the ALICE (AI Leveraged Information Capture and Exploration) system, my proposed solution for capturing and managing knowledge and data from sources that, historically, have not been comprehensively archived. The primary challenge addressed by ALICE is the prevalent issue of information loss in presentation slide decks following their related meetings or presentations. This system is designed to methodically extract text and visual elements from slides, employing a large language model such as OpenAI's GPT-4 to convert this information into structured, machine-readable formats. The aim is to not only preserve critical data but also to enhance it with comprehensive abstracts, relevant search terms, and a structured JSON for Linking Data (JSON-LD) for effective integration into knowledge graphs. This post will explore the intricacies of ALICE and its potential to redefine the management and interpretation of presentation data within NASA. First, I'll detail the proposed ALICE system. Then, I'll dive into the specifics of the ALICE system, particularly the integration of LLMs with a Knowledge Graph (KG) to enhance the LLM's reasoning over unstructured data. I'll discuss the technical aspects of how this system operates, including the text and image extraction scripts, front-end user interface, server environment, and LLM integration.
While humans can visually interpret slide content, converting that information into structured, machine-readable formats presents a challenge — not to mention autonomously providing more detailed information from the sparse information. Essentially, we want a way to effortlessly provide a slide deck and have a system "fill in the blanks" and provide a verifiably accurate description of what was presented relative to the slide deck in question. The mere existence of PowerPoint Karaoke points to the need for this capability.
Furthermore, presentation slide decks tend to not get archived. We routinely send the slides out in an email after the fact to some email list of presentation participants but we rarely commit them to an easily indexed and searchable archive location. Previous work has been done on aligning presentations with scholarly publications like SlideSeer (JCDL '07 proceedings) by Dr. Min-Yen Kan. SlideSeer is a handy tool for researchers and academics who often share their work in two ways: as written papers and as slideshow presentations. Slideseer discovers scholarly papers that match the content of slides in a slide deck and presents them together. This way, you get to see all the information in one place, whether it's something new or something repeated in both formats. ALICE differs in that it, at least initially, processes slides for any kind of communication by reasoning over its content and indexing it into a knowledge graph. This captures not only scholarly communication but also presentations and meeting slides for various project and program meetings.
Finally, for many projects, a slide deck is the final deliverable. They are the product. There is no accompanying paper so we have no source representing a more detailed record with more complete information. In many cases, these slide decks contain information from numerous data sources and are not a presentation summarizing information from a singular source, such as a slide deck for a presentation on a research paper. For example, the majority of slide decks in the Military Industrial Powerpoint Complex archive contain information relative to various briefings and information sessions which have no singular source.
In this blog post, we'll explore how you can efficiently scrape both text and visual elements from a presentation slide deck and then harness the capabilities of a large language model to derive meaningful abstracts, extended abstracts, search terms, hashtags, and even create a structured JSON-LD representation for each slide suitable for integration into knowledge graph software. I have proposed the ALICE system to be developed for use at NASA as part of our digital transformation and knowledge management efforts.
The Proposed ALICE System
Knowledge Graph
![]() |
A knowledge graph created using entity and relation extraction (fig 3 in Chaudri et al.) |
![]() |
A knowledge graph created using computer vision techniques (fig 4 in Chaudri et al.) |
- KG-enhanced LLMs, which incorporate KGs during the pre-training and inference phases of LLMs, or to enhance understanding of the knowledge learned by LLMs
- LLM-augmented KGs, that leverage LLMs for different KG tasks such as embedding, completion, construction, graph-to-text generation, and question answering
- Synergized LLMs + KGs, in which LLMs and KGs play equal roles and work in a mutually beneficial way to enhance both LLMs and KGs for bidirectional reasoning driven by both data and knowledge
Front-end user interface
Server environment
LLM integration
Breaking Down the Code
At the heart of our solution lies the ppt-scraper.py script, which leverages the python-pptx library to traverse through slides, extracting the text and any images and charts on each slide. I'll step through examples of each step using a presentation I gave to AIAA Scitech 2021. This presentation was retrieved from the NASA Technical Reports Server (NTRS) which will serve as the source for the dataset I will be using to finetune my own LLM, which I will talk about in my next blog post.
- Text Extraction - extract_text_from_pptx: This function delves into each slide and grabs text content. It outputs a list of strings, each prefixed with the slide number for easy referencing.
Slide 1: James Ecker, Benjamin Kelley, Danette Allen
AIAA SciTech, January 2021
Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions
Slide 2: Computer Vision During In-Space Assembly
Slide 2: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 2: 2
Slide 2: January 2021
Slide 2: SciTech 2021
Slide 3: Computer Vision During In-Space Assembly
Slide 3: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 3: 3
Slide 3: January 2021
Slide 3: SciTech 2021
Slide 3: High Degree of Variation Requires More Information
Slide 4: Computer Vision During In-Space Assembly
Slide 4: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 4: 4
Slide 4: January 2021
Slide 4: SciTech 2021
Slide 4: High Degree of Variation Requires More Information
Slide 4: More Sensors = More Information = More Energy Use & More Mass
Slide 5: Mitigating the Constraints
Slide 5: 5
Slide 5: January 2021
Slide 5: SciTech 2021
Slide 5: High Degree of Variation Requires More Information
Slide 5: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6: Mitigating the Constraints
Slide 6: 6
Slide 6: January 2021
Slide 6: SciTech 2021
Slide 6: High Degree of Variation Requires More Information
Slide 6: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6:
Slide 6: Maximize
Slide 7: Mitigating the Constraints
Slide 7: 7
Slide 7: January 2021
Slide 7: SciTech 2021
Slide 7: High Degree of Variation Requires More Information
Slide 7: More Sensors =
More Information =
More Energy Use & More Mass
Slide 7:
Slide 7:
Slide 7:
Slide 7: Maximize
Slide 7: Minimize
Slide 8: Mitigating the Constraints
Slide 8: 8
Slide 8: January 2021
Slide 8: SciTech 2021
Slide 8: High Degree of Variation Requires More Information
Slide 8: More Sensors =
More Information =
More Energy Use & More Mass
Slide 8:
Slide 8:
Slide 8:
Slide 8: Maximize
Slide 8: Minimize
Slide 8: Predict 3D Mesh from Single View
Slide 9: Mitigating the Constraints
Slide 9: 9
Slide 9: January 2021
Slide 9: SciTech 2021
Slide 9: High Degree of Variation Requires More Information
Slide 9: More Sensors =
More Information =
More Energy Use & More Mass
Slide 9:
Slide 9:
Slide 9:
Slide 9: Maximize
Slide 9: Minimize
Slide 9: Single Camera
Slide 9: Predict 3D Mesh from Single View
Slide 10: Related Work
Slide 10: [1] He et al – Mask R-CNN
[2] Gkioxari et al – Mesh R-CNN
[3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks
[4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation
[5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks
[6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation
Slide 10: 10
Slide 10: January 2021
Slide 10: SciTech 2021
Slide 11: Synthesizing Data
Slide 11: 3D model of objects projected over 3D background in Blender
Can be extended to full simulation environments
ROS, Gazebo, Mujoco, etc
Variations in observations
Orientation of camera and light source
Relative orientation between objects
Number of objects in scene
Background
Sim to Reality Problem
Domain Randomization
Slide 11: 11
Slide 11: January 2021
Slide 11: SciTech 2021
Slide 12: Metadata for Training Mask R-CNN
Slide 12: 12
Slide 12: January 2021
Slide 12: SciTech 2021
Slide 13: Metadata for Training Mesh R-CNN
Slide 13: 13
Slide 13: January 2021
Slide 13: SciTech 2021
Slide 14: Building the Dataset
Slide 14: Generate a parent pool of data
20,000 image/metadata pairs
For each sample generated
Extract/Calculate ground truth to build metadata
Configure metadata to conform to model
Sample training set from parent pool
Sample –n (default: 1500) instances from parent pool randomly
Split into training and validation sets
--training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)
Merge all training/validation set metadata into one JSON, respectively
Slide 14: January 2021
Slide 14: SciTech 2021
Slide 14: 14
Slide 15: Mask Prediction – Mask R-CNN
Slide 15: Backbone
Resnet-50-FPN
Region Proposal Network
Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects
Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)
Mask Prediction
Generate a binary mask for pixels in proposed region using the aligned features
Slide 15: January 2021
Slide 15: SciTech 2021
Slide 15: 15
Slide 15: The Mask R-CNN Architecture [1]
Slide 16: Mask Prediction – Mask R-CNN Advantages
Slide 16: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 16: January 2021
Slide 16: SciTech 2021
Slide 16: 16
Slide 16: The Mask R-CNN Architecture [1]
Slide 17: Mask Prediction – Mask R-CNN Advantages
Slide 17: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 17: January 2021
Slide 17: SciTech 2021
Slide 17: 17
Slide 17: Masks provide a measure of visual explainability
Slide 17: The Mask R-CNN Architecture [1]
Slide 18: Mask Prediction – Mask R-CNN Performance
Slide 18: 98% instance segmentation accuracy
object
bounding box
mask
Slide 18: January 2021
Slide 18: SciTech 2021
Slide 18: 18
Slide 19: Mesh Prediction
Slide 19: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Mesh Refinement
Slide 19: January 2021
Slide 19: SciTech 2021
Slide 19: 19
Slide 19: The Mesh R-CNN Architecture [2]
Slide 20: Mesh Prediction
Slide 20: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
Slide 20: January 2021
Slide 20: SciTech 2021
Slide 20: 20
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]
Slide 21: Mesh Prediction
Slide 21: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
2 passes
Vertex alignment
Graph convolution
Vertex refinement
Slide 21: January 2021
Slide 21: SciTech 2021
Slide 21: 21
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]
Slide 22: Metrics
Slide 22: January 2021
Slide 22: SciTech 2021
Slide 22: 22
Slide 22:
Slide 22:
Slide 22:
Slide 23: Results
Slide 23:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Mesh R-CNN trained on Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 23: January 2021
Slide 23: SciTech 2021
Slide 23: 23
Slide 23:
Slide 24: Results
Slide 24:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 24: January 2021
Slide 24: SciTech 2021
Slide 24: 24
Slide 24:
Slide 24: 2 Quadro 6000 RTX GPUs
Slide 24: 8 Tesla V100 GPUS
Slide 24: Requires hyperparameter tuning specific to hardware configuration
Slide 25: Conclusion
Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models
Can train each model end-to-end from no data to trained model
Future work
Hyperparameter tuning
Further domain randomization
Randomize object’s rendered skin
Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model
Higher resolution 3D mesh prediction
Slide 25: January 2021
Slide 25: SciTech 2021
Slide 25: 25
Slide 25:
Slide 26: References
Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.
Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.
Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.
Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.
Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.
Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.
He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.
Slide 26: January 2021
Slide 26: SciTech 2021
Slide 26: 26
Slide 26:
- Image and Chart Extraction - extract_images_and_charts_from_pptx: This function extracts images and charts from each slide. It's able to differentiate between different visual elements like pictures and charts. It further drills down into group shapes, ensuring that nested content is noticed. Here I have ChatGPT reasoning over output/slide_3/image_6.jpg:
Slide 3 The image shows a graphical representation or simulation of a space environment where two robotic elements appear to be engaging in an in-space assembly task. There's a satellite or part of a spacecraft on the left side, which has an extended arm or appendage, possibly for manipulation or docking purposes. On the right, there's another component that seems to be floating freely or in the process of being captured or released. Below these elements, there is a view of Earth's clouds, suggesting that this scene is set in low Earth orbit. This scenario is likely a visualization or a simulated mock-up used for the planning and testing of robotic missions in space, particularly for the assembly of structures or the servicing of satellites. It may be related to computer vision systems designed to handle the challenges of space operations, such as those discussed in the presentation slides you provided earlier.
- And here it is doing the same thing over output/slide_18/image_30.jpg:
![]() |
Slide 18 |
The image seems to be a screenshot from a computer vision system or a simulation tool related to in-space robotic operations, similar in context to the content described in the presentation slides you shared earlier. It appears to show an annotated view of a robotic interaction with different parts during an assembly or maintenance task in space.
Two objects are highlighted with bounding boxes and labeled with confidence scores—“nut 0.999” and “pole 0.999”—indicating that the computer vision system has identified these components with a very high degree of confidence. The black sphere may be a part of the simulation for spatial reasoning or could represent an object to be manipulated.
The satellite or spacecraft part with solar panels is possibly a part of the simulated environment, providing context for the assembly task. The confidence scores suggest that machine learning models, such as those mentioned in the presentation (e.g., Mask R-CNN or Mesh R-CNN), are likely being used to identify and possibly predict the position of these components during the robotic mission. This type of technology is crucial for autonomous or semi-autonomous robotic systems in space, where precise identification and manipulation of various objects are necessary.
- The next version of ppt-scraper.py will have the option to inject the LLM's generated description of each image into the text description to enrich our text prompts. For example:
- After these scripts are run the output directory will have a structure similar to this:This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
. ├── images │ ├── slide_1 │ │ └── image_0.png │ ├── slide_2 │ │ ├── image_1.png │ │ ├── image_2.png │ │ ├── image_3.png │ │ └── image_4.png │ ├── slide_3 │ │ ├── image_5.png │ │ ├── image_6.png │ │ ├── image_7.png │ │ └── image_8.png │ ├── slide_4 │ │ ├── image_9.png │ │ ├── image_10.png │ │ ├── image_11.png │ │ └── image_12.png │ ├── slide_5 │ │ └── image_13.png │ ├── slide_6 │ │ └── image_14.png │ ├── slide_7 │ │ └── image_15.png │ ├── slide_8 │ │ └── image_16.png │ ├── slide_9 │ │ └── image_17.png │ ├── slide_10 │ │ └── image_18.png │ ├── slide_11 │ │ ├── image_19.png │ │ └── image_20.png │ ├── slide_12 │ │ ├── image_21.png │ │ └── image_22.png │ ├── slide_13 │ │ ├── image_23.png │ │ └── image_24.png │ ├── slide_15 │ │ └── image_25.png │ ├── slide_16 │ │ └── image_26.png │ ├── slide_17 │ │ └── image_27.png │ ├── slide_18 │ │ ├── image_28.png │ │ ├── image_29.png │ │ ├── image_30.png │ │ └── image_31.png │ ├── slide_19 │ │ └── image_32.png │ ├── slide_20 │ │ └── image_33.png │ ├── slide_21 │ │ ├── image_34.png │ │ └── image_35.png │ ├── slide_22 │ │ ├── image_36.png │ │ ├── image_37.png │ │ └── image_38.png │ └── slide_23 │ ├── image_39.png │ └── image_40.png └── text_output.txt
Slide 3: Computer Vision During In-Space Assembly
Slide 3: Difficulties
Slide 3: 3
Slide 3: January 2021
Slide 3: SciTech 2021
Slide 3: High Degree of Variation Requires More Information
Slide 3: Image_8: The image shows a graphical representation or simulation of a space environment where two robotic elements appear to be engaging in an in-space assembly task. There's a satellite or part of a spacecraft on the left side, which has an extended arm or appendage, possibly for manipulation or docking purposes. On the right, there's another component that seems to be floating freely or in the process of being captured or released. Below these elements, there is a view of Earth's clouds, suggesting that this scene is set in low Earth orbit. This scenario is likely a visualization or a simulated mock-up used for the planning and testing of robotic missions in space, particularly for the assembly of structures or the servicing of satellites. It may be related to computer vision systems designed to handle the challenges of space operations, such as those discussed in the presentation slides you provided earlier.
From Raw Data to Knowledge
Once the data extraction is complete, the real magic begins. Leveraging a powerful AI language model, like OpenAI's GPT models, we can generate:
- Abstracts: Using the extracted text, the model can provide a concise summary capturing the essence of the presentation.
Please generate an abstract for the following text which was scraped from a presentation slide deck:
Slide 1: James Ecker, Benjamin Kelley, Danette Allen
AIAA SciTech, January 2021
Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions
Slide 2: Computer Vision During In-Space Assembly
Slide 2: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 2: 2
Slide 2: January 2021
Slide 2: SciTech 2021
Slide 3: Computer Vision During In-Space Assembly
Slide 3: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 3: 3
Slide 3: January 2021
Slide 3: SciTech 2021
Slide 3: High Degree of Variation Requires More Information
Slide 4: Computer Vision During In-Space Assembly
Slide 4: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 4: 4
Slide 4: January 2021
Slide 4: SciTech 2021
Slide 4: High Degree of Variation Requires More Information
Slide 4: More Sensors = More Information = More Energy Use & More Mass
Slide 5: Mitigating the Constraints
Slide 5: 5
Slide 5: January 2021
Slide 5: SciTech 2021
Slide 5: High Degree of Variation Requires More Information
Slide 5: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6: Mitigating the Constraints
Slide 6: 6
Slide 6: January 2021
Slide 6: SciTech 2021
Slide 6: High Degree of Variation Requires More Information
Slide 6: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6:
Slide 6: Maximize
Slide 7: Mitigating the Constraints
Slide 7: 7
Slide 7: January 2021
Slide 7: SciTech 2021
Slide 7: High Degree of Variation Requires More Information
Slide 7: More Sensors =
More Information =
More Energy Use & More Mass
Slide 7:
Slide 7:
Slide 7:
Slide 7: Maximize
Slide 7: Minimize
Slide 8: Mitigating the Constraints
Slide 8: 8
Slide 8: January 2021
Slide 8: SciTech 2021
Slide 8: High Degree of Variation Requires More Information
Slide 8: More Sensors =
More Information =
More Energy Use & More Mass
Slide 8:
Slide 8:
Slide 8:
Slide 8: Maximize
Slide 8: Minimize
Slide 8: Predict 3D Mesh from Single View
Slide 9: Mitigating the Constraints
Slide 9: 9
Slide 9: January 2021
Slide 9: SciTech 2021
Slide 9: High Degree of Variation Requires More Information
Slide 9: More Sensors =
More Information =
More Energy Use & More Mass
Slide 9:
Slide 9:
Slide 9:
Slide 9: Maximize
Slide 9: Minimize
Slide 9: Single Camera
Slide 9: Predict 3D Mesh from Single View
Slide 10: Related Work
Slide 10: [1] He et al – Mask R-CNN
[2] Gkioxari et al – Mesh R-CNN
[3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks
[4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation
[5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks
[6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation
Slide 10: 10
Slide 10: January 2021
Slide 10: SciTech 2021
Slide 11: Synthesizing Data
Slide 11: 3D model of objects projected over 3D background in Blender
Can be extended to full simulation environments
ROS, Gazebo, Mujoco, etc
Variations in observations
Orientation of camera and light source
Relative orientation between objects
Number of objects in scene
Background
Sim to Reality Problem
Domain Randomization
Slide 11: 11
Slide 11: January 2021
Slide 11: SciTech 2021
Slide 12: Metadata for Training Mask R-CNN
Slide 12: 12
Slide 12: January 2021
Slide 12: SciTech 2021
Slide 13: Metadata for Training Mesh R-CNN
Slide 13: 13
Slide 13: January 2021
Slide 13: SciTech 2021
Slide 14: Building the Dataset
Slide 14: Generate a parent pool of data
20,000 image/metadata pairs
For each sample generated
Extract/Calculate ground truth to build metadata
Configure metadata to conform to model
Sample training set from parent pool
Sample –n (default: 1500) instances from parent pool randomly
Split into training and validation sets
--training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)
Merge all training/validation set metadata into one JSON, respectively
Slide 14: January 2021
Slide 14: SciTech 2021
Slide 14: 14
Slide 15: Mask Prediction – Mask R-CNN
Slide 15: Backbone
Resnet-50-FPN
Region Proposal Network
Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects
Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)
Mask Prediction
Generate a binary mask for pixels in proposed region using the aligned features
Slide 15: January 2021
Slide 15: SciTech 2021
Slide 15: 15
Slide 15: The Mask R-CNN Architecture [1]
Slide 16: Mask Prediction – Mask R-CNN Advantages
Slide 16: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 16: January 2021
Slide 16: SciTech 2021
Slide 16: 16
Slide 16: The Mask R-CNN Architecture [1]
Slide 17: Mask Prediction – Mask R-CNN Advantages
Slide 17: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 17: January 2021
Slide 17: SciTech 2021
Slide 17: 17
Slide 17: Masks provide a measure of visual explainability
Slide 17: The Mask R-CNN Architecture [1]
Slide 18: Mask Prediction – Mask R-CNN Performance
Slide 18: 98% instance segmentation accuracy
object
bounding box
mask
Slide 18: January 2021
Slide 18: SciTech 2021
Slide 18: 18
Slide 19: Mesh Prediction
Slide 19: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Mesh Refinement
Slide 19: January 2021
Slide 19: SciTech 2021
Slide 19: 19
Slide 19: The Mesh R-CNN Architecture [2]
Slide 20: Mesh Prediction
Slide 20: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
Slide 20: January 2021
Slide 20: SciTech 2021
Slide 20: 20
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]
Slide 21: Mesh Prediction
Slide 21: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
2 passes
Vertex alignment
Graph convolution
Vertex refinement
Slide 21: January 2021
Slide 21: SciTech 2021
Slide 21: 21
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]
Slide 22: Metrics
Slide 22: January 2021
Slide 22: SciTech 2021
Slide 22: 22
Slide 22:
Slide 22:
Slide 22:
Slide 23: Results
Slide 23:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Mesh R-CNN trained on Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 23: January 2021
Slide 23: SciTech 2021
Slide 23: 23
Slide 23:
Slide 24: Results
Slide 24:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 24: January 2021
Slide 24: SciTech 2021
Slide 24: 24
Slide 24:
Slide 24: 2 Quadro 6000 RTX GPUs
Slide 24: 8 Tesla V100 GPUS
Slide 24: Requires hyperparameter tuning specific to hardware configuration
Slide 25: Conclusion
Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models
Can train each model end-to-end from no data to trained model
Future work
Hyperparameter tuning
Further domain randomization
Randomize object’s rendered skin
Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model
Higher resolution 3D mesh prediction
Slide 25: January 2021
Slide 25: SciTech 2021
Slide 25: 25
Slide 25:
Slide 26: References
Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.
Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.
Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.
Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.
Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.
Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.
He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.
Slide 26: January 2021
Slide 26: SciTech 2021
Slide 26: 26
Slide 26:
This presentation, given by James Ecker, Benjamin Kelley, and Danette Allen at the AIAA SciTech in January 2021, explores the challenges and advancements in computer vision for in-space assembly through multi-agent robotic missions. It specifically delves into the generation of synthetic data for 3D mesh prediction and spatial reasoning, highlighting the difficulties encountered due to factors like illumination, angle, orientation, and movement, and the constraints of energy and mass.
The presentation outlines how computer vision faces a high degree of variation during in-space assembly, necessitating more sensors and information, which in turn increases energy use and mass. It proposes strategies for mitigating these constraints, such as maximizing efficiency while minimizing energy use and mass, and emphasizes the importance of predicting 3D mesh from a single view.
Significant attention is given to the development of synthetic datasets capable of training state-of-the-art 2D mask and 3D mesh prediction models. This includes using Blender for 3D model projection, domain randomization, and training models like Mask R-CNN and Mesh R-CNN. The presentation also reviews related work in the field and compares the performance of their methods with existing approaches.
In conclusion, the presentation underscores the importance of hyperparameter tuning, further domain randomization, and the potential of extending Mesh R-CNN with Generative Adversarial Networks for enhanced 3D mesh prediction. This research contributes significantly to the field of computer vision in space robotics, offering insights into tackling the unique challenges posed by the space environment.
- Extended Abstracts: Need a more detailed summary? No problem! The model can be instructed to generate a longer, more detailed abstract, providing deeper insights into the presentation.
Please generate an extended abstract for the following text which was scraped from a presentation slide deck by expanding upon the topics presented based on your knowledge of them
Slide 1: James Ecker, Benjamin Kelley, Danette Allen
AIAA SciTech, January 2021
Slide 1: Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions
Slide 2: Computer Vision During In-Space Assembly
Slide 2: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 2: 2
Slide 2: January 2021
Slide 2: SciTech 2021
Slide 3: Computer Vision During In-Space Assembly
Slide 3: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 3: 3
Slide 3: January 2021
Slide 3: SciTech 2021
Slide 3: High Degree of Variation Requires More Information
Slide 4: Computer Vision During In-Space Assembly
Slide 4: Difficulties
Illumination
Angle
Orientation
Movement
Constraints
Energy
Mass
Slide 4: 4
Slide 4: January 2021
Slide 4: SciTech 2021
Slide 4: High Degree of Variation Requires More Information
Slide 4: More Sensors = More Information = More Energy Use & More Mass
Slide 5: Mitigating the Constraints
Slide 5: 5
Slide 5: January 2021
Slide 5: SciTech 2021
Slide 5: High Degree of Variation Requires More Information
Slide 5: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6: Mitigating the Constraints
Slide 6: 6
Slide 6: January 2021
Slide 6: SciTech 2021
Slide 6: High Degree of Variation Requires More Information
Slide 6: More Sensors =
More Information =
More Energy Use & More Mass
Slide 6:
Slide 6: Maximize
Slide 7: Mitigating the Constraints
Slide 7: 7
Slide 7: January 2021
Slide 7: SciTech 2021
Slide 7: High Degree of Variation Requires More Information
Slide 7: More Sensors =
More Information =
More Energy Use & More Mass
Slide 7:
Slide 7:
Slide 7:
Slide 7: Maximize
Slide 7: Minimize
Slide 8: Mitigating the Constraints
Slide 8: 8
Slide 8: January 2021
Slide 8: SciTech 2021
Slide 8: High Degree of Variation Requires More Information
Slide 8: More Sensors =
More Information =
More Energy Use & More Mass
Slide 8:
Slide 8:
Slide 8:
Slide 8: Maximize
Slide 8: Minimize
Slide 8: Predict 3D Mesh from Single View
Slide 9: Mitigating the Constraints
Slide 9: 9
Slide 9: January 2021
Slide 9: SciTech 2021
Slide 9: High Degree of Variation Requires More Information
Slide 9: More Sensors =
More Information =
More Energy Use & More Mass
Slide 9:
Slide 9:
Slide 9:
Slide 9: Maximize
Slide 9: Minimize
Slide 9: Single Camera
Slide 9: Predict 3D Mesh from Single View
Slide 10: Related Work
Slide 10: [1] He et al – Mask R-CNN
[2] Gkioxari et al – Mesh R-CNN
[3] Sonawani et al - Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks
[4] Pal et al - 3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation
[5] Valsesia et al - Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks
[6] Ramasinghe et al - Spectral-GANS for High Resolution 3D Point Cloud Generation
Slide 10: 10
Slide 10: January 2021
Slide 10: SciTech 2021
Slide 11: Synthesizing Data
Slide 11: 3D model of objects projected over 3D background in Blender
Can be extended to full simulation environments
ROS, Gazebo, Mujoco, etc
Variations in observations
Orientation of camera and light source
Relative orientation between objects
Number of objects in scene
Background
Sim to Reality Problem
Domain Randomization
Slide 11: 11
Slide 11: January 2021
Slide 11: SciTech 2021
Slide 12: Metadata for Training Mask R-CNN
Slide 12: 12
Slide 12: January 2021
Slide 12: SciTech 2021
Slide 13: Metadata for Training Mesh R-CNN
Slide 13: 13
Slide 13: January 2021
Slide 13: SciTech 2021
Slide 14: Building the Dataset
Slide 14: Generate a parent pool of data
20,000 image/metadata pairs
For each sample generated
Extract/Calculate ground truth to build metadata
Configure metadata to conform to model
Sample training set from parent pool
Sample –n (default: 1500) instances from parent pool randomly
Split into training and validation sets
--training-split (default:0.75) / 1 - --training-split (default: 1 – 0.75 = 0.25)
Merge all training/validation set metadata into one JSON, respectively
Slide 14: January 2021
Slide 14: SciTech 2021
Slide 14: 14
Slide 15: Mask Prediction – Mask R-CNN
Slide 15: Backbone
Resnet-50-FPN
Region Proposal Network
Applies a sliding window over a convolutional feature map to generate proposed bounding boxes for likely objects
Proposed regions are aligned to the feature map and sent to fully connected layer to classify a bounding box (regressor) and the object itself (soft max)
Mask Prediction
Generate a binary mask for pixels in proposed region using the aligned features
Slide 15: January 2021
Slide 15: SciTech 2021
Slide 15: 15
Slide 15: The Mask R-CNN Architecture [1]
Slide 16: Mask Prediction – Mask R-CNN Advantages
Slide 16: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 16: January 2021
Slide 16: SciTech 2021
Slide 16: 16
Slide 16: The Mask R-CNN Architecture [1]
Slide 17: Mask Prediction – Mask R-CNN Advantages
Slide 17: Transfer Learning
Use a pretrained network (Resnet) to initialize weights instead of training from scratch (random initial weights)
Lowers training time and generalization error
Region of Interest Alignment
Each region of interest is fed into a fixed-size input fully connected (FC) layer
Need to account for all pixels in ROI while conforming to fixed input size of FC
Bilinear Interpolation instead of Quantization
Slide 17: January 2021
Slide 17: SciTech 2021
Slide 17: 17
Slide 17: Masks provide a measure of visual explainability
Slide 17: The Mask R-CNN Architecture [1]
Slide 18: Mask Prediction – Mask R-CNN Performance
Slide 18: 98% instance segmentation accuracy
object
bounding box
mask
Slide 18: January 2021
Slide 18: SciTech 2021
Slide 18: 18
Slide 19: Mesh Prediction
Slide 19: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Mesh Refinement
Slide 19: January 2021
Slide 19: SciTech 2021
Slide 19: 19
Slide 19: The Mesh R-CNN Architecture [2]
Slide 20: Mesh Prediction
Slide 20: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
Slide 20: January 2021
Slide 20: SciTech 2021
Slide 20: 20
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20:
Slide 20: The Mesh R-CNN Architecture Voxel Branch[2]
Slide 21: Mesh Prediction
Slide 21: Extends Mask R-CNN
Mesh Predictor
Voxel Prediction
Predicts a voxel occupancy grid
Cubify function binarizes voxel occupancy probabilities according to a threshold and generates a cuboid triangular mesh for each likely voxel
Mesh Refinement
2 passes
Vertex alignment
Graph convolution
Vertex refinement
Slide 21: January 2021
Slide 21: SciTech 2021
Slide 21: 21
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21:
Slide 21: The Mesh R-CNN Architecture Mesh Refinement Branch[2]
Slide 22: Metrics
Slide 22: January 2021
Slide 22: SciTech 2021
Slide 22: 22
Slide 22:
Slide 22:
Slide 22:
Slide 23: Results
Slide 23:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Mesh R-CNN trained on Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 23: January 2021
Slide 23: SciTech 2021
Slide 23: 23
Slide 23:
Slide 24: Results
Slide 24:
Ours (Custom Synthetic Data)
Chamfer (lower is better)
0.621
F1 (higher is better)
47.51
Theirs (Pix3D)
Chamfer (lower is better)
0.306
F1 (higher is better)
74.84
Slide 24: January 2021
Slide 24: SciTech 2021
Slide 24: 24
Slide 24:
Slide 24: 2 Quadro 6000 RTX GPUs
Slide 24: 8 Tesla V100 GPUS
Slide 24: Requires hyperparameter tuning specific to hardware configuration
Slide 25: Conclusion
Slide 25: Generated a synthetic dataset capable of training state of the art 2D mask and 3D mesh prediction models
Can train each model end-to-end from no data to trained model
Future work
Hyperparameter tuning
Further domain randomization
Randomize object’s rendered skin
Extending Mesh R-CNN to use a Generative Adversarial Network to generate point clouds instead of voxel model
Higher resolution 3D mesh prediction
Slide 25: January 2021
Slide 25: SciTech 2021
Slide 25: 25
Slide 25:
Slide 26: References
Slide 26: He, K., Gkioxari, G., Dollár, P., and Girshick, R., “Mask R-CNN,”2017 IEEE International Conference on Computer Vision(ICCV), 2017, pp. 2980–2988. https://doi.org/10.1109/ICCV.2017.322.
Gkioxari, G., Johnson, J., and Malik, J., “Mesh R-CNN,”2019 IEEE/CVF International Conference on Computer Vision(ICCV), 2019, pp. 9784–9794. https://doi.org/10.1109/ICCV.2019.00988.
Sonawani, S. D., Alimo, R., Detry, R., Jeong, D., Hess, A., and Amor, H. B., “Assistive Relative Pose Estimation for On- orbitAssembly using Convolutional Neural Networks,”ArXiv, Vol. abs/2001.10673, 2020.
Pal, B., Khaiyum, S., and Kumaraswamy, Y. S., “3D point cloud generation from 2D depth camera images using successivetriangulation,”2017 International Conference on Innovative Mechanisms for Industry Applications (ICIMIA), 2017, pp. 129–133.https://doi.org/10.1109/ICIMIA.2017.7975586.
Valsesia, D., Fracastoro, G., and Magli, E., “Learning Localized Representations of Point Clouds with Graph- ConvolutionalGenerative Adversarial Networks,”IEEE Transactions on Multimedia, 2019.
Ramasinghe, S., Khan, S. H., Barnes, N., and Gould, S., “Spectral-GANs for High-Resolution 3D Point-cloud Generation,”CoRR, Vol. abs/1912.01800, 2019. URL http://arxiv.org/abs/1912.01800.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., and Abbeel, P., “Domain Randomization for Transferring Deep NeuralNetworks from Simulation to the Real World,”CoRR, Vol. abs/1703.06907, 2017. URL http://arxiv.org/abs/1703.06907.
He, K., Zhang, X., Ren, S., and Sun, J., “Deep Residual Learning for Image Recognition,”CoRR, Vol. abs/1512.03385, 2015.URL http://arxiv.org/abs/1512.03385.
Slide 26: January 2021
Slide 26: SciTech 2021
Slide 26: 26
Slide 26:
In the realm of autonomous space operations, the work presented by James Ecker, Benjamin Kelley, and Danette Allen at AIAA SciTech 2021 addresses a critical component: the creation and utilization of synthetic data for 3D mesh prediction and spatial reasoning in multi-agent robotic missions. The context for this research is the challenging environment of in-space assembly where conventional computer vision systems encounter unique difficulties such as variable illumination, complex angles, orientation, and movement, alongside the constraints of energy usage and mass.
The presentation delves into the intricate balance required in space robotics: the need for extensive sensor data to manage the high variability in the operational environment, which paradoxically increases the energy and mass burden of the spacecraft. To mitigate these challenges, the team emphasizes the need to maximize information extraction while minimizing the additional resource expenditure. A notable innovation presented is the prediction of 3D meshes from a single camera view, reducing the sensor load.
The research builds upon foundational work in computer vision, notably Mask R-CNN and Mesh R-CNN, which provide frameworks for object detection and 3D reconstruction. These models benefit from transfer learning, which allows for the initialization of network weights from pre-trained models, significantly reducing both training time and generalization error. The presentation also discusses the performance of Mask R-CNN, citing impressive instance segmentation accuracy rates.
Advancing further, the team outlines their process for synthesizing data, utilizing tools like Blender and simulation environments such as ROS, Gazebo, and Mujoco. This process accounts for varying observational conditions, such as camera and light source orientation, and the complexity of the scene. The technique of domain randomization is highlighted as a means to bridge the sim-to-reality gap, ensuring that the synthetic data covers a broad spectrum of real-world scenarios.
The dataset creation is a rigorous process, generating a large pool of image and metadata pairs, from which training and validation sets are sampled. This metadata is crucial for training the Mask R-CNN and Mesh R-CNN models, enabling them to understand and predict the complex spatial arrangements found in space operations.
In terms of model architecture, the presentation notes the use of Resnet-50-FPN as the backbone for the Mask R-CNN, which uses a region proposal network and mask prediction to effectively identify and delineate objects within images.
Looking at the results, the team evaluates their models using metrics such as the Chamfer distance and the F1 score, comparing their custom synthetic data approach to existing methods. They acknowledge the necessity for hyperparameter tuning specific to hardware configurations, which included powerful GPUs like the Quadro 6000 RTX and Tesla V100.
The conclusion of the presentation points to the success in generating a synthetic dataset capable of training advanced 2D mask and 3D mesh prediction models. Future work is set to focus on further improving these models through hyperparameter tuning, enhanced domain randomization, and the exploration of Generative Adversarial Networks for point cloud generation, aiming for higher resolution 3D mesh prediction.
This extended abstract encapsulates the efforts of the team in pushing the boundaries of synthetic data generation for the advancement of autonomous robotic missions in space, highlighting both the achievements and the roadmap for future developments in this cutting-edge field.
- Hashtags: Want to index or socially share your content? The model can generate relevant hashtags based on slide content, aiding in searchability and social media visibility.
- JSON-LD: JSON-LD is a lightweight Linked Data format. It is easy for humans to read and write. It is based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale. JSON-LD is an ideal data format for programming environments, REST Web services, and unstructured databases such as Apache CouchDB and MongoDB. It is also suitable for feeding into knowledge graphs. They can be used in semantic web applications, enhancing the interoperability and understandability of your presentation data in software ecosystems. Here is the JSON-LD object generated for this presentation:This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
{ "@context": "http://schema.org", "@type": "Event", "name": "Synthetic Data Generation for 3D Mesh Prediction and Spatial Reasoning During Multi-Agent Robotic Missions", "startDate": "2021-01", "location": { "@type": "Place", "name": "AIAA SciTech Forum", "address": { "@type": "PostalAddress", "addressLocality": "Location of conference (if known)", "addressRegion": "Region of conference (if known)", "addressCountry": "Country of conference (if known)" } }, "sponsor": { "@type": "Organization", "name": "American Institute of Aeronautics and Astronautics (AIAA)" }, "performer": [ { "@type": "Person", "name": "James Ecker" }, { "@type": "Person", "name": "Benjamin Kelley" }, { "@type": "Person", "name": "Danette Allen" } ], "workFeatured": [ { "@type": "CreativeWork", "name": "Computer Vision During In-Space Assembly", "about": "Difficulties in space operations such as illumination, angle, orientation, movement, energy, and mass constraints." }, { "@type": "CreativeWork", "name": "Mask R-CNN and Mesh R-CNN in Space Technology", "about": "Techniques for instance segmentation and 3D mesh prediction in space robotics." }, { "@type": "CreativeWork", "name": "Synthesizing Data for Space Robotics", "about": "Using tools like Blender, ROS, Gazebo, and Mujoco for creating synthetic datasets for computer vision systems." } ], "about": { "@type": "Thing", "name": "Computer Vision and Robotics in Space", "description": "The presentation covers topics such as synthetic data generation, computer vision challenges in space, mitigation of resource constraints, and advancements in 3D mesh prediction models." }, "keywords": [ "Synthetic Data", "3D Mesh Prediction", "Spatial Reasoning", "Robotics", "Computer Vision", "In-Space Assembly", "AI", "Machine Learning", "Deep Learning", "GANs", "PointCloud Generation" ], "citation": [ { "@type": "CreativeWork", "name": "Mask R-CNN", "author": [ { "@type": "Person", "name": "Kaiming He" }, { "@type": "Person", "name": "Georgia Gkioxari" }, { "@type": "Person", "name": "Piotr Dollár" }, { "@type": "Person", "name": "Ross Girshick" } ], "datePublished": "2017", "url": "https://doi.org/10.1109/ICCV.2017.322" }, { "@type": "CreativeWork", "name": "Mesh R-CNN", "author": [ { "@type": "Person", "name": "Georgia Gkioxari" }, { "@type": "Person", "name": "Jitendra Malik" } ], "datePublished": "2019", "url": "https://doi.org/10.1109/ICCV.2019.00988" }, { "@type": "CreativeWork", "name": "Assistive Relative Pose Estimation for On-orbit Assembly using Convolutional Neural Networks", "author": { "@type": "Person", "name": "S. D. Sonawani et al." }, "datePublished": "2020", "url": "http://arxiv.org/abs/2001.10673" }, { "@type": "CreativeWork", "name": "3D Point Cloud Generation from 2D Depth Camera Images using Successive Triangulation", "author": { "@type": "Person", "name": "B. Pal et al." }, "datePublished": "2017", "url": "https://doi.org/10.1109/ICIMIA.2017.7975586" }, { "@type": "CreativeWork", "name": "Learning Localized Representations of Point Clouds with Graph-Convolutional Generative Adversarial Networks", "author": { "@type": "Person", "name": "D. Valsesia et al." }, "datePublished": "2019", "url": "https://ieeexplore.ieee.org/document/8642330" }, { "@type": "CreativeWork", "name": "Spectral-GANs for High Resolution 3D Point Cloud Generation", "author": { "@type": "Person", "name": "S. Ramasinghe et al." }, "datePublished": "2019", "url": "http://arxiv.org/abs/1912.01800" } ] } - Of course, we can train our own large language model bootstrapped from an open-source model such as Llama and Llama 2. We want the model to be multi-modal, able to have not just text as input but also images so that we can extract as much information as possible from each slide. This is especially useful in the case that we can train a model that can accurately interpret charts and graphs with respect to its local context.
Future Work
![]() |
Slide 3 |
The image appears to be a still image from a sci-fi film during moment of space exploration or a scene involving a futuristic spacecraft approaching or orbiting a mysterious celestial body, like a planet or an artificial structure. The spherical object might be a destination, such as a new planet being explored or an alien artifact being investigated.
The detailed structure behind the sphere could be part of a larger spaceship, a space station, or a piece of technology relevant to the plot of the film. It might be engaging in activities such as scanning the sphere, preparing to dock with it, or observing it from a safe distance.
Given the genre, this could be a pivotal scene where the characters are about to uncover something significant, whether it's a technological breakthrough, first contact with alien life, or the discovery of a new world or dimension. The darkness of space and the isolated positioning of the structures suggest a sense of the unknown and the vastness of the universe, common themes in sci-fi narratives.
Conclusion
In conclusion, this blog post has provided a comprehensive overview of the ALICE system. We explored the challenges of preserving information in presentation slide decks and the innovative solution offered by ALICE. This system efficiently extracts text and visual elements from slides and employs a large language model, such as OpenAI's GPT-4, to transform this data into structured, machine-readable formats and interpolates the expanded semantic context of the presentation. We delved into the integration of LLMs with Knowledge Graphs, highlighting the potential synergy between these technologies in enhancing data interpretation and retrieval.
The post also outlined the technical specifics of ALICE, including its front-end interface for uploading slide decks, server environment for data processing, and the crucial role of LLMs in generating abstracts, search terms, and JSON-LD for Knowledge Graph integration. We discussed the key functionalities of our 'ppt-scraper.py' script in extracting diverse data types from presentations and how this technology can be further evolved.
In summary, ALICE represents a significant leap forward in managing and utilizing the wealth of data hidden in presentation slide decks, promising to enhance knowledge preservation and accessibility at NASA and potentially beyond.
You can follow the development of this project at my Github repo.
- Jim Ecker
Sources
- Min-Yen Kan. 2007. SlideSeer: a digital library of aligned document and presentation pairs. In Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (JCDL '07). Association for Computing Machinery, New York, NY, USA, 81–90. https://doi.org/10.1145/1255175.1255192
- Chaudhri, V. K., Baru, C., Chittar, N., Dong, X. L., Genesereth, M., Hendler, J., Kalyanpur, A., Lenat, D., Sequeda, J., Vrandečić, D., and Wang, K. 2022. “ Knowledge graphs: Introduction, history, and perspectives.” AI Magazine 43: 17–29. https://doi.org/10.1002/aaai.12033
- Pan, Shirui, et al. "Unifying Large Language Models and Knowledge Graphs: A Roadmap." arXiv preprint arXiv:2306.08302 (2023).
- Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971 (2023)
- Touvron, Hugo, et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023)
Comments
Post a Comment