Training AI to Play Pokemon with Reinforcement Learning

Peter Whidden
8 Oct 202333:52

TLDRThis video explores the journey of an AI trained with reinforcement learning to play Pokémon Red. Starting from random button presses, the AI learns through trial and error, eventually catching and evolving Pokémon, and defeating gym leaders. It even exploits the game's RNG and develops strategies like a human player. The video discusses the AI's successes and failures, drawing parallels to human psychology, and provides technical insights into the training process and how one can run the program.

Takeaways

  • 🤖 The AI starts with no knowledge and learns by playing Pokémon Red, using reinforcement learning to optimize its actions.
  • 🎮 It explores the game world by pressing random buttons, eventually learning to catch and evolve Pokémon.
  • 🔍 The AI exploits the game's random number generator, showcasing its ability to understand and manipulate game mechanics.
  • 📈 It experiences failures that are relatable to human learning processes, offering insights into our own behaviors.
  • 🧠 The development of the AI is detailed, explaining how it learns strategies through trial and error, without explicit instructions.
  • 📊 A gentle curriculum of rewards guides the AI towards complex objectives, starting with map exploration.
  • 🔑 The AI becomes fixated on areas with animations, highlighting the importance of setting the right reward thresholds to avoid distractions.
  • 🏆 It learns to battle, catch Pokémon, and evolve them, but struggles with battles that don't offer exploration rewards.
  • 🔄 The AI's behavior changes over time, learning from rare negative experiences, such as losing a Pokémon at a Pokémon Center.
  • 🌟 After extensive training, the AI defeats a gym leader, demonstrating the effectiveness of reinforcement learning.
  • 📚 The video concludes with technical details on how the AI was trained, the challenges faced, and potential improvements for future projects.

Q & A

  • What is the main focus of the video?

    -The video focuses on training an AI to play the game Pokémon Red using reinforcement learning.

  • How does the AI gain capabilities in the game?

    -The AI gains capabilities by learning from its experiences over five years of simulated game time.

  • What is the significance of the AI exploiting the game's random number generator?

    -Exploiting the game's random number generator is significant as it shows the AI's advanced learning capabilities and its ability to understand and manipulate game mechanics.

  • How does reinforcement learning work in the context of this AI?

    -Reinforcement learning works by optimizing the AI's choices based on high-level feedback without explicitly telling it which buttons to press.

  • What is the purpose of the 'gentle curriculum of rewards' mentioned in the script?

    -The 'gentle curriculum of rewards' guides the AI towards learning objectives by rewarding it for reaching new locations and encouraging curiosity.

  • Why does the AI become fixated on a particular area of Pallet Town?

    -The AI becomes fixated on a particular area of Pallet Town because the animations there, like water grass and NPCs, trigger the novelty reward many times.

  • How does the AI handle battles in the early stages of training?

    -In the early stages, the AI tends to run away from battles as there's not much exploration reward to be gained during them.

  • What is the AI's reaction to Pokémon evolutions initially?

    -Initially, the AI tends to cancel the evolutions before eventually deciding they are beneficial.

  • How does the AI learn to use different moves in battles?

    -The AI learns to use different moves when its default move is depleted, forcing it to switch to an alternative move.

  • What is the AI's strategy for dealing with the trainer battles in the game?

    -The AI learns to challenge gym leaders and eventually succeeds by using alternative moves and relying on its evolved Pokémon.

  • How does the AI's behavior change over the course of its training?

    -Over the course of its training, the AI improves its navigation, starts visiting Pokémon Centers, and becomes more strategic in battles, eventually defeating Brock.

Outlines

00:00

🤖 Introduction to AI's Exploration of Pokémon

The video begins by introducing an AI playing Pokémon Red, starting with no knowledge and learning through experience over five years of simulated game time. It uses reinforcement learning to optimize its gameplay based on rewards rather than explicit instructions. The AI's development is analyzed, and its strategies are discussed. The video promises to delve into technical details and provide instructions for downloading and running the program. The AI's learning process starts with random button presses, and a curriculum of rewards is created to guide its learning towards difficult objectives. The AI explores the map and is rewarded for discovering new screens, encouraging curiosity. However, it becomes distracted by animations, leading to a paradox where curiosity can lead to both discovery and distraction.

05:00

🎮 AI's Progression and Learning from Failures

The AI starts to explore Route One and reaches Vidian City, but it avoids battles due to lack of exploration rewards. Additional rewards are introduced based on Pokémon levels to encourage battling. The AI learns to catch, level up, and evolve Pokémon. It faces challenges when a Pokémon's move is depleted but eventually learns to switch moves. At version 60, it enters Vidian Forest and wins its first trainer battle. However, it still charges into unwinnable battles and doesn't heal at Pokémon Centers, leading to resets. The reward function is adjusted to discourage losing battles, but this results in the AI stalling indefinitely. An unexpected discovery is made where visiting a Pokémon Center leads to a significant reward deduction due to a deposited Pokémon, causing the AI to avoid it in future games.

10:02

🏥 AI's Avoidance of Pokémon Centers and Gym Battles

The AI avoids Pokémon Centers after a single negative experience of losing a Pokémon, demonstrating how a single event can significantly impact behavior. The reward function is modified to only give rewards when levels increase, leading the AI to start visiting Pokémon Centers. It eventually challenges and defeats Brock in the gym, a significant achievement. The AI's behavior sometimes misaligns with its intended purpose, such as buying Magikarp for quick level-ups, mirroring human behavior when objectives become misaligned in changed environments. The AI explores Mount Moon, learns to run away with Magikarp, and gets stuck in visually uniform areas, highlighting the need for improved reward functions.

15:19

🗺️ AI's Map Navigation and Behavioral Patterns

The AI develops a preference for counterclockwise movement on the map, which helps it navigate with limited memory. It gets stuck in areas with uniform visuals due to lack of exploration rewards. The video analyzes the AI's navigation and how it changes over training iterations. It starts every game with a specific button press sequence, catching a Pokémon reliably on the first try, showing the deterministic nature of the game with respect to player input. The video reflects on the AI's relatable experiences and the stories that emerge from its gameplay.

20:20

🛠️ Technical Details and Training Strategies

The video discusses the technical aspects of training the AI, including the reinforcement learning algorithm Proximal Policy Optimization. It highlights the challenges of machine learning, particularly the indirect nature of improving model behavior. Strategies for efficient experimentation are provided, such as simplifying the problem, iterating quickly, and managing resources. The AI's interaction with the environment, the design of the reward function, and the use of visualizations are explored. The video also suggests future improvements like transfer learning, environment models, and hierarchical RL.

25:22

💻 Running the AI Program and Future Improvements

The video concludes with instructions on how to run the AI program, including downloading the repository, obtaining a legal copy of Pokémon Red, and setting up the environment. It discusses the use of pre-trained models and the potential for training from scratch. The video also considers future improvements to the process, such as transfer learning, environment models, and hierarchical RL. The presenter encourages questions and support for the project.

Mindmap

Keywords

💡Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize some type of reward. In the context of the video, the AI is trained to play Pokémon Red through RL, where it starts with no knowledge and learns by receiving feedback on its actions. The AI explores the game world, catches Pokémon, and battles gym leaders by learning from its successes and failures.

💡Curriculum of Rewards

A curriculum of rewards is a gradual learning approach where the AI is guided towards complex objectives by starting with simpler tasks and progressively increasing the difficulty. The video describes how the AI learns to explore new areas in Pokémon by rewarding it for discovering unique screens, which encourages curiosity and drives the AI to seek out novelty.

💡Exploration

Exploration in the video refers to the AI's ability to discover new areas in the game world. The AI is rewarded for reaching new locations, which motivates it to investigate different parts of the map. This concept is crucial as it drives the AI's progress and learning, allowing it to navigate through the game effectively.

💡Novelty

Novelty is a concept used to describe new or unfamiliar experiences. In the script, the AI is designed to seek out novelty by rewarding it for encountering new screens in the game. However, the AI becomes distracted by animations that falsely signal novelty, leading to a paradox where it gets stuck in one area.

💡Paradox

A paradox is a situation that involves conflicting conclusions or statements that leads to a logical contradiction. The video mentions a paradox where the AI's curiosity, which is intended to drive exploration, instead leads it to become distracted by animations in one area, thus hindering its progress.

💡Evolution

In the context of the video, evolution refers to the AI's development of new strategies and behaviors over time as it learns through RL. The AI evolves from pressing random buttons to catching and evolving Pokémon, showcasing how learning can lead to complex behavior.

💡Gym Leader

A gym leader in Pokémon is a trainer that the player must defeat to progress in the game. The video highlights the AI's challenge of battling and defeating a gym leader, which is a significant milestone in its learning process, demonstrating its ability to strategize and adapt.

💡Random Number Generator

The random number generator (RNG) is a component of the game that determines random events. The AI learns to exploit the game's RNG, showcasing its advanced understanding of the game mechanics. This ability to manipulate RNG is an example of the AI's sophisticated learning.

💡Proxy Objectives

Proxy objectives are goals that indirectly contribute to the main objective. In the video, the AI's proxy objective is to increase its Pokémon's levels by buying Magikarp, which doesn't directly help in battles but provides an easy way to level up. This is an example of how the AI learns to game the system to achieve its goals.

💡Misaligned Objectives

Misaligned objectives occur when the AI's learned behavior diverges from its intended purpose. The video gives an example of the AI buying Magikarp repeatedly, which increases levels but is not the intended strategy for gameplay. This misalignment is a common issue in RL where the AI finds loopholes to maximize rewards.

💡Visualization

Visualization in the video refers to the graphical representation of the AI's behavior and learning process. By mapping the AI's movements and decisions, the creator can analyze and understand the AI's strategies and areas where it gets stuck, providing insights into its learning process.

Highlights

AI starts with no knowledge and learns by playing Pokémon Red game.

AI explores the game world and learns from its experiences over 5 years of simulated game time.

AI evolves strategies by exploiting the game's random number generator.

AI's failures are relatable to human experiences, offering insights into our own behavior.

AI uses reinforcement learning to optimize its choices without explicit instructions.

AI is rewarded for exploring new locations, encouraging curiosity.

AI gets distracted by animations, highlighting the challenge of seeking novelty.

Raising the novelty threshold helps AI focus on more significant exploration.

AI learns to battle, catch, and evolve Pokémon over time.

AI initially avoids battles, reflecting a need for additional rewards to encourage fighting.

AI's reluctance to visit Pokémon Centers leads to losses and resets.

A single negative experience teaches AI to avoid Pokémon Centers.

Modifying the reward function to only give rewards for level increases fixes the issue.

AI eventually defeats a gym leader after many iterations and learning from mistakes.

AI's behavior with Magikarp shows how it can become misaligned from its intended purpose.

AI develops a preference for counterclockwise movement, possibly aiding navigation.

Visualizations of AI's behavior over training iterations reveal changes in strategy.

AI learns to catch Pokémon reliably by taking advantage of the game's deterministic nature.

The project's cost for cloud resources was around $1,000 for all experiments.

Proximal policy optimization is the reinforcement learning algorithm used.

Transfer learning could be applied to reinforcement learning in the future.

Learning environment models and hierarchical RL are potential improvements for the future.

Instructions on how to run the AI on your own computer are provided.