Can you beat our RL agent?

ProfilesArtboard-8-copy-3

Maarten Fish

Machine Learning Engineer

NOTE: The game mentioned in this post cannot be played on mobile, please open this page on a desktop/laptop.

Learning Reinforcement Learning

This post recaps some of the insights I had during my internship at Faktion, I'll explain how I went from studying basic Reinforcement Learning (RL) agents to training state-of-the-art versions in Unity-based environments like the one you just played.  Sutton & Barto’s Reinforcement Learning: An Introduction was my guide as I went from RL zero to hero. On this journey, I tamed a variety of agents wandering the gridworlds and gained a deeper understanding of the mathematical constructs that make it possible for agent-environment interactions to create intelligent behavior.

A day in the life of an Agent

Enter the gridworld, our agent finds itself at the start of an episode, a single round in the game it is about to play. The agent is presented with a start location: an X and Y value; and given the choice to step in one of four directions: right, down, left, and up. Having never seen this environment before, it randomly decides to move right. At a small cost (i.e., negative reward), the action is completed as our agent moves to a new location where it is once again presented with the same movement options to choose from. This time, tragically, bad luck leads to its downfall. Taking another step right puts the agent in a terminal state, which invokes a heavy penalty and resets the episode.  

Like waking up from a bad dream, the agent again finds himself at the start location, remembering the penalty it received, it has a bad feeling about going right and thus decides to head in the opposite direction. Staying on this course, acting mostly in favor of the left direction, it eventually reaches the finish west. This triggers a terminal state with a nice reward. Our agent has made it. Over the course of the following episodes our agent successfully optimizes its route by balancing exploitation, the re-use of the current best path, and exploration, the decision to leave the comfort of what is known in order to gain a better understanding of the environment. After finding the shortest route, our happy agent has a minimal debt to pay off from the steps it took and can keep most of its hard-earned finishing prize. 

Setting the stage

The environment is the reality in which an agent can look for a solution. In our case the environment is grid-based with some basic rules. Four directions in which to move (the action-space) and some simple ‘physics’ rules, such as a defined finite world with borders, impassable walls and pitfalls (traps) that dictate the flow of an episode depending on how an agent decides to act within. It’s in the interactions between these two components, an actor on stage, that interesting behavior takes form. The stepping error through time, a dance of loss, minimizing shame while maximizing gain. This brings us to the reward signal, the catalyst of intelligence. 

In essence, the reward signal is the feedback given to the agent for actions performed. The balance between positive and negative rewards is fragile and will sometimes influence agent behavior in counterintuitive ways. While designing an environment you need to keep this in mind in order to set your agent up for success. The illustration above demonstrates how the implementation of a step cost signals the agent to get out without setting a predefined goal! Once the agent has found the exit through random movement, it will automatically try to shorten its route in order to minimize the penalty accumulated in the episode. This is due to the core nature of the agent - which is to optimize its interaction with the surroundings. 

An actor on stage, so how does it work?

The strategy developed by the agent to maximize the reward accumulated during an episode is called the policy. When the agent enters an environment for the first time, this policy will be far from optimal and lead to seemingly random behavior. This is because the agent starts without any knowledge of which states are good (like the finish), which ones are bad (like the deadly terminal states), and which actions transfer you to states that bring you closer to either of those. By trickling down the reward an agent received at the very end of each episode over the path of states it visited and actions it took, the agent will be able to estimate state-action values. These values tell the agent what eventual reward it can expect when taking a certain action in a particular state. In our gridworld, for each state, the agent will have four state-action values or total reward estimates, one for each action. A straightforward policy is then to simply pick the action with the highest expected reward. Over time, these reward estimates (state-action values) will become increasingly accurate, and the agent will develop optimized behavior.  

One issue with this policy is that it’s too conservative. Once the agent finds any route to the finish it’ll just keep on repeating it even when it involves a detour. To avoid getting stuck in such local optima we can force the agent to explore. In practice, we alter the policy slightly by introducing a small probability that at any given step the agent will select an action at random instead of the one with the highest expected reward. This is sufficient to push the agent off the beaten track into unknown areas of the state space. 

State-action values can be stored in a simple lookup table (i.e. classic Q-learning) but can also be predicted by models when state and action spaces become too vast to manage in a table. This comes with the benefit that models can generalize expected reward across similar states while simple lookup tables don’t have that power. Following the policy then involves feeding state-action pairs into the model and selecting the most promising action based on the output.  

The pitfalls of solving a Labyrinth’s reward signal

Looking for a challenge where I could apply state-of-the-art RL with an enlightening and fun use-case, my mind immediately jumped to childhood memories playing a wooden labyrinth. An ingenious game where you get to maneuver a small metal ball by rotating the playing field along two axes. Using gravity as a device to move through a tricky maze, avoiding any pitfalls on the way to the finish. This game seemed appropriate, as I assumed it would be yet another variant within the family of grid world environments. I had tested my agent on complicated mazes before, so I went in with confidence. However, in hindsight, the ball drop mechanic introduced novel, interesting, behavior and challenges. Let’s take a closer look. 

To start off simple, let’s first introduce the agent to a basic grid world maze. There is a starting position, an abundance of impassable walls and somewhere on the grid we must find the exit, a terminal state that will end the episode. Even with all rewards set to zero, our agent will find the exit eventually when we give it a strong wanderlust, a 30% chance to pick a random action. However, once the exit is found there is no difference in reward between taking 10 steps versus 1000 so the wandering agent won’t feel the need to find the shortest path. Let’s change our reward signal to improve this. This time around, we’re going to tax the agent for every step it takes. Doing so incentivizes the agent to end the episode as soon as possible. 

Next, we’ll up the stakes by adding traps to our environment. These are terminal states with a negative reward that represent the holes in the maze. This is where the agent starts to show interesting behavior as it’s now confronted with conflicting signals. The step cost is pushing it to keep the episodes short, but explorations only seem to lead to the bad experience of encountering the holes. It’s like being set on fire and forced to run through a maze full of deadly traps. Every time the agent tries to rush its old route it is faced with disappointment as it falls down hole after hole. All these heavy penalties are overflowing the grid world with negative reward signals that have trickled down from state to state. Warning! Penalty ahead! Our agent sees no way out and in dramatic fashion resorts to the extreme, suicide. The world it has explored seems to end in a big negative experience anyways and since moving comes at a cost of its own the optimal thing to do is to end the episode as quick as possible. Yikes! Our agent has opted out. How can we give it the courage to keep pushing for that happy ending after experiencing so much disappointment? 

Let’s remove the step cost from the environment to try and save the agent. No longer being rushed, you’d expect the agent to take its dear time finding a safe route, right? Nope, instead of venturing out in the dangerous environment the agent now lingers in the safety of its familiar starting position until the maximum number of steps in an episode is reached. The end of time so to speak. We’ve given the agent the will to keep going but not the courage to explore new things in life. This is becoming weirdly philosophical right?  

Our agent seems to be in dire need of a little push so let’s try and code that into the reward signal. This time around we’ll wield both the stick and the carrot. We re-introduce the step cost, but now also provide a small positive reward when an action results in exploring a previously unseen state. As a result, the agent initially prefers the starting area but is now being lured out of its comfort zone since lingering there will melt its hard-earned reward away. Our agent has suddenly conquered the fear of imminent failure. The prospect of being rewarded for exploration has effectively put the overall danger signal, present in the labyrinth, to shame. The rush of reward through exploration is finally larger than the penalty of failure. In due time all the agent’s exploration alongside the mistakes made will trickle down back to the start where the agent now has access to a more accurate world model of the labyrinth, giving it more and more confidence with each passing episode. The enduring exploration inevitably leads the agent to the labyrinth’s finish eventually. From that point onwards, the urge to minimize the step cost will drive the agent towards the shortest path. The maze has been solved. 

From grid-based Maze to physics enabled Labyrinth

Once the agent completed this watered down, 2D grid based version of the wooden labyrinth I wanted something more authentic. How would the same reward signal fare in an environment affected by simulated real-world physics? What would happen if gravity and momentum affected the ball’s movement? We swap our agent’s discrete action space, the limited choice in 1 out of 4 directions, for a continuous action space. Rather than having a static choice, the agent may now adjust the rotation of the playing field along two axes affecting the ball’s speed and direction. The simple ‘physics’ that we implemented in the 2D version also get an upgrade: impassable walls, as well as the playing field itself, become physical objects and dropping the ball gets a more literal implementation. The agent’s observation space should also adapt to these changes, previously the current location was sufficient to define the state. This time around, new crucial factors come into play like the ball’s speed, velocity and the current rotation of the board in two dimensions. Adding these values to the agent’s input allows it to sense the information required to navigate the Labyrinth.

Treading the footsteps of our agent

Finally, we've arrived at the Labyrinth, the game you played at the top if this page is the exact environment our agent was tasked to solve. Were you able to match the agent’s performance?  

Isn’t it almost magical that our agent completed this challenge even though it had no visual feedback whatsoever and relied completely on blind touch and action-reaction. This is made possible by a well designed reward signal alongside the state’s Markov property: Given the present, the future does not depend on the past; refering to the philosophy that all relevant information for decision making is contained within the current situation and the observations experienced now.  

On our grid world, actions selected based on their valued yield, move the agent from one state to the next and the reward associated with that transition updates the state-action value. This decision is made on the spot, evaluating the given observations. When an agent is deciding which action is best for the current situation it goes through all possible scenarios and, based on past experiences, picks the most promising course of action. Planning a way out that will also yield the most rewarding state transitions effectively perfects the policy. 

Optimizing your digital twin

Digital twins and RL agents have become a real passion of my colleagues and myself here at Faktion. We would gladly take a closer look at your agents to see where we can improve the reward signaling or help build you a brand-new tailor-made environment. This will lead to higher performance overall as well as faster and more efficient training. But most of all, intelligent business optimization, bootstrapping your digital twin with the state-of-the-art in reinforcement learning. Don’t hesitate to reach out if you want to find out what Faktion can do for you. 

Additional reading sources

To the reader whose thirst for knowledge has not yet been quenched, the following section provides some additional reading material. A deeper dive into technical aspects crucial for environment building and properly dealing with the agents that inhabit them.  

Our first stop is Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto. This beautiful and rich anthology is available online for free and serves as the starting point for anyone interested in the wonderful world of RL.  

From my older notes on the tabular methods, I highly recommend scrolling through these three revealing introductory notebooks on reinforcement learning as well as the original labyrinth project: 

-Temporal Difference (TD): How the error between new and old return estimates improves state-action values over time. 

-Planning and Learning with Tabular Methods: Illustrates some of the improvements made to the vanilla RL-algorithms. 

-The original 2D Labyrinth: Early notebook from during my internship. 

Game development is my all-time favourite pastime, providing me with a fun and engaging platform to improve my coding skills. Over the years I’ve acquired some handy tools when it comes to crafting the perfect environments for optimized learning. Aren’t these the agent's playground after all? Unity has been my go-to engine for some years now. Especially with their recent push into the Machine Learning field, through their ML-Agents toolkit as well as offering specialized tools for synthetic data generation; this will be the topic for another blogpost. Unity’s ml-agents offers all the tools required to build any RL-environment similar to OpenAI’s Gym. They additionally offer a wide variety of agent brains (various learning algorithms).  

For the ‘brains’ however, I opted for a more scalable option: Rays’ RLLIB. This dense package filled to the brim with state-of-the-art algorithms is open-sourced and truly feels like opening the RL equivalent to Pandora’s Box! Taking a look at their extensive collection of implemented algorithms (well documented too!) quickly reveals their hard work and dedication to democratizing intelligence. Rllib conveniently offers a handy listener tool that hooks directly into the Unity editor for easy training.  

My algorithm-of-choice-award goes to Proximal Policy Optimization (PPO). In a nutshell, this robust improvement on the Actor-Critic (AC) methods allows for multiple passes over the same batch. This pooled collection of experiences can be built by multiple workers simultaneously. The AC algorithm in turn is an improvement on the Policy Gradient (PG) methods, referring to a deep learning approach to policy optimization. Similarly, the Actor part of AC is in charge of the policy whereas the Critic provides valuable feedback on actions taken comparable to the state-action value. 

Get in touch!

Inquiry for your POC

=
Scroll to Top