In this scenario, the hero (agent) navigates through a grid world with the aim of reaching a chest. The hero is capable of performing four actions: moving up, down, left, or right. Before training the agent, you have the flexibility to modify the map. This is done by left-clicking to position trees and right-clicking to eliminate them. Additionally, the locations of both the hero and the chest can be altered by dragging and dropping. You can also edit the map whilst the training is paused.

Reinforcement learning

This agent will show how Dyna-Q agents differ from typical Q-Learning agents that i have made up to yet. Unlike conventional Q-Learning agents, Dyna-Q agents have the ability to plan.

Planning

Typically, Q-Learning is model-free, implying that the agent doesn’t require any knowledge about the environment to learn and succeed. Whereas planning agents do need a model, they need to understand what actions lead to what states.  An agent can be given a model by a human prior to training or it can make its own model of the world whilst it interacts with the environment. To make its own model, the agent must remember what actions it took in what states as well as the reward and resulting new state it received for taking that action. This memorization process can be as straightforward as below:

History[ state, action] = [new_state, reward]

History [ Tile_12, Action_Right] = [Tile_13, -1] 

This stored experience is called a “model”. The agent can refer to this model now to see what happens when it takes an action in a certain state. Now there is usually some form of downtime between taking actions, like in chess when the agent is waiting for the opponent to make a move. During this time, the agent uses this model to "relive" some actions it took to further solidate the value of those actions. It's a bit like us thinking back on moves we made previously and trying to hammer down a mistake we made to remember not to do it again. In the game above you can choose the amount of moments to "relive". The agent will take an action then plan for the amount of steps selected before taking another action again. This planning can help the agent find the chest much quicker than a typical Q-Learning agent.

Why does planning speed things up?

Q-Learning agents assign a value to every possible action in every state. This value should tell you how much reward you are going to get if you take this action in this state and follow the same policy for the rest of the episode. After performing an action, the value they give to that specific action is determined by the sum of the reward received (value of now) and the value of the best possible action in the next state (value of the future). Including the value of the next state is called bootstrapping. In the initial episode, all these values stay at zero—given the rewards are zero, and the bootstrapping values are also zero—until the agent reaches the state just before the chest. The action taken in this state will earn a +1 reward, updating its value.

In the second episode, the very last state will be updated again, but now the second to last state will also update, thanks to it bootstrapping from the value of the last state. With every new episode, these values start propagating backwards slowly. However, planning agents operate differently. In the first episode, similar to Q-learning agents, only the last state before the chest gets updated. But post this update, the agent can rethink over states and actions using its model which allows that +1 reward to propagate backwards quicker. The screenshot in the sidebar shows this, with 'n' being the amount of planning steps.

To  see an example of this, run the simulation with no planning. You will find that after 10 episodes it is still really struggling to make sense of the environment. If you restart the simulation but with planning steps set to 50, you will see that in the first episode the agent is just as lost as first time round. After the agent has found the chest once though you will notice a significant increase in speed to finding an optimal path. When ready, turn off exploration to see what path the agent thinks is optimal.

Leave a comment

Log in with itch.io to leave a comment.