About

Tower Of Hanoi is a mathematical puzzle that involves moving a stack of disks from one peg to another, following certain rules. You can only move one disk at a time, and you can't place a larger disk on top of a smaller one. Your goal is to move the entire tower from the first peg to the middle peg, following the set rules.

The Reinforcement Learning Agent

If you click the "Agent solve" button you will see a reinforcement learning agent complete the game for you. The agent learned how to play using the SARSA algorithm, which is a technique I have used and described in my previous environments. Basically, the agent explores the game by taking random actions at the start. It observers the rewards it gets from these random actions and from this it can improve its decision making, by taking actions that give more reward. In this game, the "states" are simply where the disks are on the pegs at any moment. This is represented as a list of 18 integers (6 for each peg) with a number indicating size of disk (0,1,2,3,4,5) and -1 to signify an empty space. There are 6 actions that the agent can do ( move disk from peg 0 to peg 1, move disk from peg 0 to peg 2, and so on) , not all will be available at the same time as some actions would violate the rules ( place a big disk onto a smaller one ). The only reward the agent got was when it completed the tower on the middle peg, at which it received a positive 100 reward for.  This sounds like a simple reward, but it came with its own challenges.

The problem with sparse rewards

Imagine a tiny maze where the only goal for the agent is to escape, earning a +100 reward for doing so. Initially, the agent knows nothing about the maze. It has to fumble around, taking random actions, until it lucks out and finds the exit. Once it does, it starts learning. It will then have to repeat the fumbling around until it lucks its way to the end again. Since the maze is small, it won't take long for the agent to figure out which moves lead to that +100 reward. Now imagine if the maze was massive with many dead ends, would the agent ever make it to the end?. It's important to note that an agent cannot learn without rewards, if it never finds its way to the reward then the agent will never learn what is a good action and what is bad. This is what can happen with the Tower of Hanoi. The amount of random actions the agent has to take to luck its way to completion was far too long for me. It's not impossible to learn Tower of Hanoi this way, but it can be very slow.  Sometimes it would take more than 20,000 actions before lucking its way to completion. By that point, it's tried nearly every possible move. How does it know which moves actually helped it win? It doesn't. So, even after it gets that first win, it has to keep fumbling in the dark many more times before it gets the hang of things. If you have an even more complicated environment, chances are the agent will never find a reward. This is the problem of sparse rewards. One solution is to introduce incremental rewards throughout the game. For example, you could offer a smaller reward for correctly placing each disk, rather than only rewarding the final goal. This technique is called Reward Shaping but it's sometimes hard to design the right rewards to let the agent learn. Also, if you add rewards for certain milestones, the agent might focus on achieving those milestones instead of learning its own strategy.  Ideally, you want to give the agent a simple, clear goal—like completing the game. This way the agent can find its own strategy and its own optimal way of doing things. If the game is too complex though, the agent might never stumble upon that win and thus never learn. To help fix this, i used a technique called Reverse Curriculum Learning.

Reverse Curriculum Learning

Typically, you want a simple and clear goal for the agent and you want to make sure the agent has the ability to learn it. The problem is that some environments can be too complex and the agent might never find the goal through random actions. Reverse Curriculum Learning helps to fix this by making the agent start backwards. Typically, the agent starts in a default state with all disks on the left peg and learns from there. In Reverse Curriculum Learning we flip it, making the agent start the environment when it's very close to being completed. The agent starts with all disks on the middle peg, apart from the single smallest disk. It then spent 1500 episodes learning how to put the small disk on top of the tower to win the game. 1500 episodes might seem overkill and it was. Starting backwards ensures the agent can find the reward and learn what states and actions are beneficial when there is only one disk left. Once it has fully learnt the best way to put one disk onto the tower we push its starting position further back. The agent will now start when there are four completed disks in the middle and two that need sorting. We do that for another 1500 episodes and then repeat, until it's finally learning to sort all disks onto the middle peg. By the time the agent gets back to starting in the default start state, it will already know the best actions to take regardless of how many disks are completed and how many need sorting. 

Further reading on this topic can be found here: https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/

Problems / Curiosities

  • At one point, my agent seemed to be in a halfway state between learning and not learning. It had mastered organizing the first three disks, but its actions beyond that appeared random. The challenge in such situations is pinpointing the reason for the learning plateau. Is it due to parameter settings, the choice of algorithm, the implementation of Reverse Curriculum Learning, or perhaps the need for extended training? The list of potential issues is extensive. After spending a little while debugging, I discovered a bug in my code: instead of consistently choosing the best action, the agent sometimes selected actions that broke the game rules. Once this bug was fixed, the agent's performance improved. The issue wasn't with the agent's learning process but rather a coding error that hindered its progress.
  • The agent managed to find the optimal policy. The agent takes 63 actions from start to completion, which is the least possible for the Tower of Hanoi with 6 disks on 3 pegs.

StatusReleased
PlatformsHTML5
Rating
Rated 5.0 out of 5 stars
(1 total ratings)
AuthorLeeBeeGames
GenrePuzzle
Tagsai, machine-learning, reinforcement-learning

Comments

Log in with itch.io to leave a comment.

love it

(+1)

tf is this description 💀💀