2023-08-11: Paper Summary: "Mastering Diverse Domains through World Models"


Figure 2 Hafner et al.: The authors of this work consider four visual domains, including robot locomotion and manipulation tasks, Atari games with 2D graphics, DMLab, and Minecraft. DreamerV3 is successful in all of these diverse domains, demonstrating its ability to handle spatial and temporal reasoning challenges. 

In my last post, Paper Summary: "Beyond Classifiers: Remote Sensing Change Detection with Metric Learning" Zhang et al., I reviewed methods to detect discrete changes in temporal visual data. But what if we're concerned with the fidelity of simulated or generative data vs. the real world? In my work at NASA, I study machine learning methods for training autonomous systems in simulation. One of the biggest problems with this research direction is the Simulation-to-Reality problem, where training in simulation can result in relatively high uncertainty due to differences between the simulated representation of the environment the autonomous system will train in and the real-world environment it operates in. Think of it as the difference between a video game that takes place in a real-world location and how different it looks vs. being there in real life. We can think about this uncertainty as a type of change in the data. Quantifying this uncertainty, and understanding its origin, is a key hurdle to overcome in the pursuit of training autonomous systems without human-controlled data. One domain that researchers are working in to study these problems is called World Models. In this post, I am going to review the recently released state-of-the-art method for learning World Models by researchers at DeepMind and the University of Toronto - Mastering Diverse Domains through World Models by Hafner et al(arXiv preprint Jan 10, 2023). This paper came to my attention because it's the first algorithm to achieve collecting a diamond in Minecraft without any human data. Minecraft is a sandbox video game that was developed and published by Mojang, which was later acquired by Microsoft. The game was officially released in 2011 and has become one of the most popular video games in history, notable for its unique pixelated block-style graphics. Minecraft allows players to explore a procedurally generated 3D world, made of different types of blocks, each representing different materials such as dirt, stone, ores, tree trunks, water, and lava. Players can mine these blocks and place them elsewhere, enabling them to build structures and craft tools. The game also includes various creatures, known as mobs, including villagers, animals, and monsters. Minecraft offers different game modes, such as survival mode, where players must gather resources to build the world and maintain health, and creative mode, where players have unlimited resources and the ability to fly. Minecraft's emphasis on creativity, exploration, and survival has made it a beloved game among a diverse audience worldwide.

Training an autonomous model to play Minecraft can pose several challenges due to the complexity and open-ended nature of the game. Here are a few reasons:
  • Infinite Possibilities: Minecraft is a sandbox game that presents an almost infinite number of possible states and actions. This immense state-action space can be challenging for a model to learn effectively and efficiently.
  • Long-term Planning: Effective play often requires long-term planning and strategic decision-making. For instance, to build a complex structure, a player needs to gather resources, create tools, and possibly manage survival aspects, all while dealing with potential threats. These activities are interconnected and require foresight that may be challenging for an AI model to grasp.
  • Sparse and Delayed Rewards: Many tasks in Minecraft don't provide immediate feedback or rewards, which can make it difficult for an AI model to learn which actions are beneficial. For example, the benefits of mining certain resources or building a structure may not become apparent until much later in the game.
  • Procedurally Generated Environments: Each Minecraft world is procedurally generated and unique, which means that a model trained on one world might not perform well in another. This dynamic environment requires a model to adapt and generalize well, adding another layer of complexity to the task.
  • Multi-modal Input: The game involves various forms of inputs including visual input (the 3D world), textual input (the game's chat), and auditory input (sounds from the environment or mobs). Processing and making sense of these various inputs in a cohesive way to make decisions can be a challenging task for an AI model.
  • Creative Aspects: A significant part of Minecraft's appeal lies in its creative aspects, like building structures and creating art. Creativity is a complex cognitive function that's difficult for machines to replicate, which can limit an AI's ability to fully engage with the game's possibilities.
These factors contribute to the complexity of training an AI to play Minecraft and highlight the fact that mastering such a game involves more than just learning to optimize a score or follow a set of rules.

In recent years, the field of reinforcement learning has made significant progress in enabling computers to solve individual tasks through interaction, surpassing human capabilities in certain games like Go and Dota 2. However, applying these algorithms to new domains, such as video games or robotics tasks, often requires expert knowledge and computational resources for algorithm tuning, which can hinder scaling to large models that are expensive to train. Additionally, different domains present unique learning challenges that require specialized algorithms.

To address these issues, Hafner et al. have developed DreamerV3, a general and scalable algorithm that outperforms specialized algorithms by mastering a wide range of domains with fixed hyperparameters. DreamerV3 learns a world model from experience, which allows for rich perception and imagination training. This algorithm consists of three neural networks: the world model, which predicts future outcomes of potential actions; the critic, which judges the value of each situation; and the actor, which learns to reach valuable situations. To facilitate learning across domains with fixed hyperparameters, the researchers employed signal magnitude transformations and robust normalization techniques.

The research team also investigated the scaling behavior of DreamerV3 to provide practical guidelines for solving new challenges. They found that increasing the model size of DreamerV3 monotonically improved both its final performance and data efficiency. The team demonstrated the effectiveness of DreamerV3 by applying it to the popular video game Minecraft, which has become a focal point of reinforcement learning research. They showed that DreamerV3 was the first algorithm to collect diamonds in Minecraft from scratch, without resorting to human expert data or manually-crafted curricula. This achievement is particularly noteworthy because of the sparse rewards, exploration difficulty, and long time horizons in the procedurally generated open-world environment of Minecraft. 

The paper outlines four essential contributions:
  • DreamerV3 is a versatile algorithm that can proficiently learn a diverse set of domains with fixed hyperparameters, thereby making reinforcement learning more accessible. 
  • The study showcases DreamerV3's scalability, demonstrating how larger model sizes lead to consistent enhancements in final performance and data efficiency.
  • The study carries out a thorough assessment of DreamerV3, demonstrating its superiority over specialized algorithms across various domains. The researchers also provide the training curves of all methods for comparison purposes.
  • The researchers discover that DreamerV3 achieves a significant milestone in artificial intelligence by collecting diamonds in Minecraft without the need for human data or curricula. This accomplishment highlights the algorithm's ability to tackle long-standing challenges in the field. 

DreamerV3

The DreamerV3 algorithm includes three neural networks: the world model, the critic, and the actor. These networks are trained together without sharing gradients, using past experiences. To perform well in different domains, the networks need to be able to handle varying signal strengths and balance their objectives effectively. This is a challenging task because the researchers aim to learn across different domains with fixed hyperparameters, instead of focusing on similar tasks within the same domain. In this section, a simple method for predicting quantities of unknown orders of magnitude is explained. Then, the world model, critic, and actor are introduced, along with their learning objectives. The researchers found that combining KL balancing and free bits enabled the world model to learn without tuning. Additionally, scaling down large returns without increasing small returns allowed for a fixed policy entropy regularizer.

Figure 3 Hafner et al.: In the training process of DreamerV3, sensory inputs are transformed into a discrete representation called z_t by the world model. This process is done by a sequence model with recurrent state h_t that takes actions a_t as input. The representation z_t is then used as a learning signal to reconstruct the original sensory inputs. The actor and critic then learn from the trajectories of these abstract representations predicted by the world model. 


World Model Learning

The world model in the DreamerV3 algorithm is designed to learn efficient representations of sensory inputs through a process called autoencoding. This allows the model to predict future representations and rewards based on potential actions, which enables planning. The world model is implemented as a Recurrent State-Space Model (RSSM), as illustrated in Figure 3. The encoder first maps sensory inputs x_t to stochastic representations z_t, which are then predicted by a sequence model with a recurrent state h_t given past actions a_t−1. The model state is formed by concatenating h_t and z_t, and from this state, the model predicts rewards r_t and episode continuation flags c_t ∈ {0, 1}. Additionally, the inputs are reconstructed to ensure informative representations:


Figure 5 Hafner et al.: Multi-step video predictions are generated in both DMLab and Control Suite using a model that can predict 45 steps into the future without access to intermediate images, given 5 frames of context input and the action sequence. The model is able to learn an understanding of the 3D structure of the two environments.

The top image shows the results for DMLab and the bottom image shows the results for Control Suite. 


Figure 5 shows a visualization of long-term video predictions generated by the world model. To process visual inputs, convolutional neural networks (CNN) are used in the encoder and decoder, while low-dimensional inputs are processed using multi-layer perceptrons (MLPs). The dynamics, reward, and continue predictors are also MLPs. To obtain representations, a vector of softmax distributions is sampled, and straight-through gradients are taken through the sampling step. The world model parameters φ are optimized end-to-end to minimize the prediction loss L_pred, the dynamics loss L_dyn, and the representation loss L_rep, with corresponding loss weights β_pred = 1, β_dyn = 0.5, β_rep = 0.1, given a sequenced batch of inputs x_1:T, actions a_1:T, rewards r_1:T, and continuation flags c_1:T. 



Actor-Critic Learning

DreamerV3's agent behavior is learned from abstract sequences predicted by the world model, which includes the actor and critic neural networks. The actor-network samples actions during interaction without planning ahead. Both networks operate on model states consisting of a recurrent state and a discrete representation learned by the world model. The actor network maximizes the expected return, while the critic network predicts the return of each state considering rewards beyond the prediction horizon T=16. Bootstrapped λ-returns are computed to estimate returns that consider rewards beyond the prediction horizon.

Benchmarks

  • Proprio Control Suite comprises 18 low-dimensional continuous control tasks that require a budget of 500K environment steps. The tasks involve various activities, including controlling locomotion and manipulating robots.
  • Visual Control Suite is similar to the Proprio Control Suite except it consists of 20 continuous control tasks with only high-dimensional images as inputs and increases the environmental step budget to 1 million steps.
  • Atari 100k comprises 26 Atari games with a budget of 400K environment steps, equivalent to 100K steps after action repeat or 2 hours of real-time.
  • Atari 200M is similar to Atari 100k except that it includes 55 Atari games and increases the environmental step budget to 200 million steps.
  • BSuite consists of 23 environments, each with multiple configurations, totaling 468 configurations. The benchmark is designed to evaluate the agent's performance in various aspects, including credit assignment, robustness to reward scale and stochasticity, memory, generalization, and exploration.
  • Crafter is a procedurally generated survival game with top-down graphics and discrete actions. Its purpose is to test various agent abilities, such as wide and deep exploration, long-term reasoning and credit assignment, and generalization.
  • DMLab contains 3D environments that require an agent to learn spatial and temporal reasoning to perform well. 
  • Minecraft is a widely popular 3D crafting survival game. Acquiring diamonds in Minecraft has been a difficult problem for AI, as each episode takes place in a new procedurally generated 3D world where the player must find 12 milestones with limited rewards by gathering resources and crafting tools. To overcome the challenge of breaking blocks with a stochastic policy, which can impede progress if different actions are repeatedly sampled, the authors adopt the approach of prior work by increasing the block-breaking speed. 

Results

On various benchmark tests, DreamerV3 has shown significant improvements over previous state-of-the-art algorithms. In the open-world game Minecraft, the challenge of collecting diamonds with sparse rewards has been a long-standing problem in the field of artificial intelligence. DreamerV3, with its default hyperparameters, has been applied to this task and has shown to be the first algorithm to successfully collect diamonds in Minecraft without the use of human data. Furthermore, in various other benchmark tests, such as Proprio Control Suite, Visual Control Suite, Atari 100K, BSuite, Crafter, and DMLab, DreamerV3 has outperformed previous state-of-the-art algorithms. Notably, on DMLab, DreamerV3 matches and even exceeds the performance of the scalable IMPALA agent in only 50 million environment steps, which is a significant data-efficiency gain of over 13,000%. Overall, DreamerV3 has proven to be a highly effective and efficient algorithm for a broad range of tasks.

Figure 1 Hafner et al.: DreamerV3, which uses identical hyperparameters across different domains, surpasses specialized model-free and model-based algorithms in a variety of benchmarks and data-efficiency regimes. Additionally, DreamerV3 is the first algorithm that can learn to obtain diamonds in Minecraft, a popular video game, without any human data or domain-specific heuristics - a long-standing challenge in the field of artificial intelligence. 

The Dreamer v3 algorithm addresses the issues of varying signal magnitudes and instabilities in all of its components to achieve this goal. DreamerV3 has shown state-of-the-art performance in seven benchmarks, including continuous control from states and images, BSuite, and Crafter. It has also shown success in 3D environments that require spatial and temporal reasoning, outperforming IMPALA in DMLab tasks using 130 times fewer interactions and being the first algorithm to collect diamonds in Minecraft end-to-end from sparse rewards. However, the paper also acknowledges the limitations of DreamerV3, such as its inability to collect diamonds in Minecraft in every episode and the need for future investigations into larger models trained for multiple tasks across overlapping domains.

References

Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. arXiv E-Prints, arXiv:2301.04104. doi:10.48550/arXiv.2301.04104

Beattie, C., Leibo, J. Z., Teplyashin, D., Ward, T., Wainwright, M., Küttler, H., Lefrancq, A., Greebm S., Valdes, V.,  Sadik, A., Schrittwieser, J., Anderson, K., York, S., Cant, M., Cain, A., Bolton, A., Gaffney, S., King, H., Hassabis, D., Legg, S., Petersen,  S. (2016). DeepMind Lab. arXiv E-Prints, arXiv:1612.03801. doi:10.48550/arXiv.1612.03801

Hafner, D. (2021). Benchmarking the Spectrum of Agent Capabilities. arXiv E-Prints, arXiv:2109.06780. doi:10.48550/arXiv.2109.06780

Osband, I., Doron, Y., Hessel, M., Aslanides, J., Sezener, E., Saraiva, A., McKinney, K., Lattimore, T., Szepesvari C., Singh, S., Van Roy, B., Sutton, R., Silver, D., Van Hasselt, H. (2019). Behaviour Suite for Reinforcement Learning. arXiv E-Prints, arXiv:1908.03568. doi:10.48550/arXiv.1908.03568

Kaiser, L., Babaeizadeh, M., Milos, P., Osinski, B., Campbell, R. H., Czechowski, K., Erhan, D., Finn, C., Kozakowzski, P., Levine, S., Mohiuddin, A., Sepassi, R., Tucker, G., Michalewski, H. (2019). Model-Based Reinforcement Learning for Atari. arXiv E-Prints, arXiv:1903.00374. doi:10.48550/arXiv.1903.00374

Tassa, Y., Doron, Y., Muldal, A., Erez, T., Li, Y., de Las Casas, D., Sudden, D., Abdolmaleki, A., Merel, J., Lefrancq, A., Lillicrap, T., Riedmiller, M. (2018). DeepMind Control Suite. arXiv E-Prints, arXiv:1801.00690. doi:10.48550/arXiv.1801.00690

- Jim Ecker 

Comments