About

Face off against a reinforcement learning agent in this classic Flash game. Click on your bow and pull back to aim, release to fire. The first to hit their opponent three times wins.

The Agent

The agent sees its environment through a state, which includes the current wind conditions, the player's position, and the bow's current angle and power. It can choose from five actions: increasing or decreasing the angle, increasing or decreasing the power, and firing. The agent's decisions are based on its current state. It underwent training across 30,000 episodes using a technique known as Semi-Gradient SARSA. At first, I really struggled with finishing this environment and ended up switching to work on my cart pole environment instead. I just couldn't figure out how to get the agent to hit 100% success. But then I read about this method called 'Optimistic Initial Values' in Barto and Sutton's RL book, and I thought I'd give it a try.

Exploration

A crucial concept in reinforcement learning is exploration.  You can't learn anything without exploring your environment and your actions. It's about balancing two things: making the agent take the 'greedy' action, which is the best known action, and encouraging it to try new, exploratory actions. These actions might lead to poor results, but they can also uncover better strategies. It's similar to trying new dishes at a takeaway. Sticking to your usual order is safe and satisfying, but venturing into trying new menu items might lead to disappointment or, excitingly, to discovering a new fancier dish that is better then your previous favourite. Balancing exploration and exploitation in reinforcement learning is tricky. I've talked about a couple of techniques before. One approach is to set a constant exploration rate, like 10%, where the agent tries something new 10% of the time and sticks to the best-known action 90% of the time. Another method I've used is epsilon-decay, where you start with 100% exploration and gradually reduce it to almost zero. This time around, I went with Optimistic Initial Values, and it worked really well.

Optimistic Initial Values

In this environment, each state, along with an action that can be taken in that state (forming a state-action pair), is assigned a value. This value indicates how good it is to take that particular action in that state. To pick the best action, you just compare the values of all possible actions in your current state and choose the one with the highest value.  In this environment, the agent receives two types of rewards: a constant -0.1 for every action, which encourages the agent to avoid lingering or repeatedly increasing and decreasing power, and a variable negative reward based on the arrow's proximity to hitting the player when fired. The reward is always negative unless the arrow hits the player, which then earns the agent a +500 reward. Optimistic Initial Values (OIV) involve initially setting these state-action pair values higher than their actual worth at the beginning of learning. Since all rewards are negative, I set all state-action pairs values to be 0 at the start of learning. I then make it so the agent does no exploring whatsoever, it will only ever choose the very best action. Since the rewards are negative, the very best actions to take will be the ones that hit the player ( +500 ) or an action it's never tried before ( with a value of 0 ). This essentially makes the agent sweep through all actions to try each action at least once and it will be able to pinpoint the good actions that hit the player.

Leave a comment

Log in with itch.io to leave a comment.