Solve LunarLander-v2 with Deep Q-learning

October 28, 2019

Page content

the LL environ- ment is a simulated environment where the agent needs to successfully and safely land the aircraft in the designated area. The lunar lander has 4 discrete actions, do nothing, fire the left orientation engine, fire the main engine, and fire the right orientation engine. The states of the lander are represented as 8-dimensional vectors: (x, y, vx, vy, θ, vθ, lef tleg, rightleg), x and y are the x and y-coordinates of the lunar lander’s position on the screen. vx and vy are the lunar lander’s velocity components on the x and y axes.θ is the angle of the lunar lander. vθ is the angular velocity of the lander. Finally, leftleg and rightleg are binary values to indicate whether the left leg or right leg of the lunar lander is touching the ground. The lunar lander starts as (0,0) and the target landing pad is at (0, 1). The total reward for moving from starting point to landing pad ranges from 100 - 140 points varying on lander placement on the pad. If lander moves away from landing pad it is penalized the amount of reward that would be gained by moving towards the pad. An episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points respectively. Each leg ground contact is rewarded with +10 points. Each firing main engine action has a -0.3 point reward, but fuels are infinite. Lastly, the LL problem is considered solved when an average of 200 points is achieved over 100 consecutive runs.

Trained Agent Performance

trial

from collections import deque, namedtuple
import time
import math
import copy
import random
import gym
from gym import wrappers
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import Adam

_ = torch.manual_seed(1027)
np.random.seed(1027)
random.seed(1027)

Experience Replay

This is the key to train DQN, it serves to remove strong correlations between consecutive transitions because transitions are randomly sampled for training with experience replay.

Experience = namedtuple("Experience", ('state', 'action', 'reward', 'next_state', 'is_done'))

class ExpReplay:
    def __init__(self, capacity=100000, starting_size=20000):
        self.memory = deque(maxlen=capacity)
        self.starting_size = starting_size
        
    def add(self, transition):
            self.memory.append(transition)
            
    def sample(self, batch_size=16):
        if batch_size > len(self.memory):
            return None
        return random.sample(self.memory, batch_size)

    def canStart(self):
        return len(self.memory) >= self.starting_size

Function Approximation

An artificial neural network (ANN) is used as function approximation. Specifically, a 4-layer ANN was used. Noticably, the last layer of the ANN should have linear activation, i.e. no activation function needed, because rewards can be positive and negative.

class QNN(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dims=[48, 48]):
        super(QNN, self).__init__()
        self.state_dim = state_dim
        self.n_action = action_dim
        
        n_nodes = [state_dim] + hidden_dims + [action_dim]
        self.layers = nn.ModuleList([])
        for i in range(1, len(n_nodes)):
            self.layers.append(nn.Linear(n_nodes[i-1], n_nodes[i]))
        
    def forward(self, X):
        
        # feed through all layers
        for i in range(len(self.layers) - 1):
            X = torch.tanh(self.layers[i](X))
        
        out = self.layers[-1](X)
        
        return out

# loss function
def loss_fn(yPred, yTrue):
    return torch.nn.SmoothL1Loss()(yPred, yTrue)

The Lunar Lander Agent

class LunarLanderAgent:
    def __init__(self, env, policy_net, target_net, exp_replay):
        self.env = env
        self.state_dim = env.observation_space.shape[0]
        self.n_action = env.action_space.n
        
        # intialize NN for Q function approximation, and associated target net
        self.pn = policy_net.to(device)               
        self.tn = target_net.to(device)              
        self.tn.load_state_dict(self.pn.state_dict())
        self.tn.eval()
        
        # initialize experience replay
        self.memo = exp_replay   
    
    def __wrapState(self, state):
        """Wrap state tuples into a torch.tensor"""
        return torch.tensor(state, dtype=torch.float32, device=device).reshape(1, -1)
    
    def chooseAction(self, state, epsilon):
        """
        Choose action using the epsilon greedy policy, with epsilon probability to choose a random action, 
        otherwise stick with policy.
        """
        explore = (torch.rand(1) < epsilon)
        if explore:  
            # need to explore
            return torch.randint(0, self.n_action, (1,)).item()
        else:        
            # pick the best move
            state = self.__wrapState(state)
            with torch.no_grad():
                action = self.pn(state).detach().argmax().item()
            return action
        
    def play(self, render=False, sleep_time=0.1):
        '''
        Run the agent in the environment once, return the total reward if no rendering
        is needed, otherwise, render the environment with given time intervals between 
        each frame.
        '''
        total_reward = 0
        state = self.env.reset()
        while True:
            with torch.no_grad():
                action = self.pn(self.__wrapState(state)).argmax().item()
            new_state, reward, done, _ = self.env.step(action)
            if render:
                time.sleep(sleep_time)
                self.env.render()
            total_reward += reward
            state = new_state
            if done:
                break
        if render:
            print("Total reward: %d" % total_reward)
            self.env.close()
        else:
            return total_reward

Training

The training procedure follows the original algorithm described in Mnih et al. (2015) like below:

DQN algorithm

Something different is that an early-stopping flag is added to early stop training when the agent is already able to confidently (90%) land on the landing pad with > 200 points.

def processBatch(batch):
    '''Wrap elements of mini-batches into torch tensors'''
    state_dim = batch[0].state.shape[0]

    X = torch.empty((len(batch), state_dim), dtype=torch.float32)
    X_next = torch.empty((len(batch), state_dim), dtype=torch.float32)
    rewards = torch.empty((len(batch), 1), dtype=torch.float32)
    actions = torch.zeros((len(batch,)), dtype=torch.long)
    not_dones = torch.tensor([True] * len(batch), dtype=torch.bool)
    for i, transition in enumerate(batch):
        X[i, :] = torch.tensor(transition.state, dtype=torch.float32)            # current state
        X_next[i, :] = torch.tensor(transition.next_state, dtype=torch.float32)  # next state
        actions[i] = transition.action                                           # action
        rewards[i, :] = transition.reward                                        # rewards
        not_dones[i] = not transition.is_done                                    # is terminal state
        
    return X.to(device), X_next.to(device), actions.to(device), rewards.to(device), not_dones.to(device)


def train(agent, optimizer, episodes=500, batch_size=32, update_rate=1, gamma=0.99, 
            max_epsilon=1, min_epsilon=0.1, eps_decay=200, reward_thres=195, C=100):
    '''Train the agent'''

    print("Training start:")
    
    # counter of total steps taken
    step_counter = 0   
    
    # record the return from every episode, for diagnose purpose
    episode_return = []
    
    # early stop flag
    early_stop = False
    
    for i in range(episodes):
        state = agent.env.reset()
        reward_rec = 0  # record reward of this episode
        if early_stop:
            break
        while True:
            # decay epsilon
            if agent.memo.canStart():
                # do not decay epsilon when the replay memory is not large enough, instead, use max_epsilon
                epsilon = min_epsilon + (max_epsilon - min_epsilon) * math.exp(-1. * step_counter / eps_decay)
            else:
                epsilon = max_epsilon
                
            action = agent.chooseAction(state, epsilon)
            new_state, reward, done, _ = agent.env.step(action)

            step_counter += 1      # increment the counter of steps taken
            reward_rec += reward   # record the reward of this transition
            
            # new_state = agent.__wrapState(new_state)
            agent.memo.add(Experience(state, action, reward, new_state, done))
            
            # don't forget this assignment!!!!!
            state = new_state
            
            if done:
                episode_return.append(reward_rec)
                if (i+1) % 100 == 0:
                    # print something useful every 100 episodes
                    mean_reward = np.mean(episode_return[-100:]).item()                  # mean reward of the last 100 episodes
                    num_solve = np.sum(np.array(episode_return[-100:]) >= reward_thres)  # count of last 100 episodes that solved
        
                    # early stop if the current policy already solved problem and the exploration rate is low
                    if mean_reward >= reward_thres and num_solve >= 90 and epsilon < min_epsilon + 0.05:
                        early_stop = True

                    print("  -Episode: {0}/{1};\tTotal steps: {2};\tMean rewards of last 100 runs: {3};\t\
                    Count of solved episodes in last 100 runs: {4}"\
                            .format(i+1, episodes, step_counter, mean_reward, num_solve))
                break
            
            
            # check if replay memory is big enough to start learning, learn every update_rate steps
            if not agent.memo.canStart() or step_counter % update_rate != 0:
                continue
            
            # =====================
            # sample from experience memory and start learning
            # =====================
            batch = agent.memo.sample(batch_size=batch_size)
            if batch is None:
                continue
            else:
                # transform batch into predicted and true Q values
                X, X_next, actions, rewards, not_dones = processBatch(batch)
                
                yPred = agent.pn(X).gather(1, actions.view(-1, 1))
                yTrue = torch.zeros((len(batch), 1), dtype=torch.float32, device=device)
                yTrue[not_dones, :] = agent.tn(X_next[not_dones, :]).detach().max(1)[0].view(-1, 1)
                yTrue = rewards + yTrue * gamma
    
                loss = loss_fn(yPred, yTrue)
                
                optimizer.zero_grad()
                loss.backward()
                
#                 for param in agent.pn.parameters():
#                     param.grad.data.clamp_(-1, 1)
                    
                optimizer.step()
            
            if step_counter % C == 0:   # update target network after C steps
                agent.tn.load_state_dict(agent.pn.state_dict())
    
    print("Training complete!")
    plot_training(episode_return)
    return episode_return


def plot_training(episode_rewards, solve_thres=200):

    is_ipython = 'inline' in matplotlib.get_backend()
    if is_ipython:
        from IPython import display
    
    x_ticks = len(episode_rewards)
    y_max = max(episode_rewards)
    y_min = min(episode_rewards)

    plt.figure(figsize=(4, 3))
    plt.grid(True)
    plt.plot(range(1, x_ticks+1), episode_rewards, c='b', label="episode returns")
    plt.plot(range(1, x_ticks+1), [solve_thres] * x_ticks, c='r', label="solving threshold")
    plt.title("Episode Return during Training")
    plt.legend()
    # plt.savefig("illustrations/training.png", dpi=300)
    plt.show()

def plot_test(agent, solve_thres=200):
    rewards = []
    for i in range(100):
        rewards.append(agent.play())
    
    num_solved = sum(np.array(rewards) >= solve_thres)

    plt.figure(figsize=(3.54, 2.8))
    plt.grid(True)
    plt.scatter(range(1, 101), rewards, c='b', s=1, label="episode returns")
    plt.plot(range(1, 101), [solve_thres] * 100, c='r', label="Solving threshold")

    plt.title("Episode Return of 100 runs with greedy policy, solved {}/100".format(num_solved))
    plt.legend()
    # plt.savefig("illustrations/test.png", dpi=300)
    plt.show()

Start Training:

# Hyperparameters
# -----------------------
NUM_EPISODES = 3000
BATCH_SIZE = 32
LEARNING_RATE = 0.001
GAMMA = 0.99
MAX_EPSILON = 1
MIN_EPSILON = 0.05
EPS_DECAY = 50000
EXPERIENCE_CAPACITY = 500000
EXP_START_SIZE = 20000
UPDATE_RATE = 1
REWARD_THRES = 200
C = 10000

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

env = gym.make("LunarLander-v2")

# environment attributes 
state_dim = env.observation_space.shape[0]
n_actions = env.action_space.n

# The function approximation for Q function
policy_net = QNN(state_dim, n_actions, [48])
target_net = QNN(state_dim, n_actions, [48])

# Experience replay
exp_play = ExpReplay(EXPERIENCE_CAPACITY, EXP_START_SIZE)

# make the learning agent
agent = LunarLanderAgent(env, policy_net, target_net, exp_play)

# set-up optimizer
optimizer = Adam(agent.pn.parameters(), lr=LEARNING_RATE)

# start training
reward_records = train(agent, optimizer, episodes=NUM_EPISODES, batch_size=BATCH_SIZE, update_rate=UPDATE_RATE, 
            gamma=GAMMA, max_epsilon=MAX_EPSILON, min_epsilon=MIN_EPSILON, eps_decay=EPS_DECAY, reward_thres=REWARD_THRES, C=C)

# save training info
# torch.save(agent.pn.state_dict(), "checkpoint.pt")

Training start:
  -Episode: 100/10000;	Total steps: 9067;	Mean rewards of last 100 runs: -160.88986340660983;	Count of solved episodes in last 100 runs: 0
  -Episode: 200/10000;	Total steps: 17939;	Mean rewards of last 100 runs: -160.47751106315226;	Count of solved episodes in last 100 runs: 0
  -Episode: 300/10000;	Total steps: 30335;	Mean rewards of last 100 runs: -158.31169153656703;	Count of solved episodes in last 100 runs: 0
  -Episode: 400/10000;	Total steps: 42925;	Mean rewards of last 100 runs: -113.53394949576729;	Count of solved episodes in last 100 runs: 0
  -Episode: 500/10000;	Total steps: 58738;	Mean rewards of last 100 runs: -89.43146045843977;	Count of solved episodes in last 100 runs: 0
  -Episode: 600/10000;	Total steps: 79764;	Mean rewards of last 100 runs: -51.9893973266434;	Count of solved episodes in last 100 runs: 0
  -Episode: 700/10000;	Total steps: 130868;	Mean rewards of last 100 runs: -18.182056879256944;	Count of solved episodes in last 100 runs: 0
  -Episode: 800/10000;	Total steps: 221585;	Mean rewards of last 100 runs: -29.624186275988762;	Count of solved episodes in last 100 runs: 3
  -Episode: 900/10000;	Total steps: 302616;	Mean rewards of last 100 runs: 39.47518648175584;	Count of solved episodes in last 100 runs: 17
  -Episode: 1000/10000;	Total steps: 376357;	Mean rewards of last 100 runs: 101.38140501057693;	Count of solved episodes in last 100 runs: 21
  -Episode: 1100/10000;	Total steps: 453062;	Mean rewards of last 100 runs: 79.35989873385518;	Count of solved episodes in last 100 runs: 12
  -Episode: 1200/10000;	Total steps: 534772;	Mean rewards of last 100 runs: 59.02408128299986;	Count of solved episodes in last 100 runs: 14
  -Episode: 1300/10000;	Total steps: 617943;	Mean rewards of last 100 runs: 72.61160251534045;	Count of solved episodes in last 100 runs: 26
  -Episode: 1400/10000;	Total steps: 698440;	Mean rewards of last 100 runs: 121.4529622869516;	Count of solved episodes in last 100 runs: 26
  -Episode: 1500/10000;	Total steps: 762849;	Mean rewards of last 100 runs: 184.06175787646396;	Count of solved episodes in last 100 runs: 49
  -Episode: 1600/10000;	Total steps: 823373;	Mean rewards of last 100 runs: 198.621072098437;	Count of solved episodes in last 100 runs: 63
  -Episode: 1700/10000;	Total steps: 881620;	Mean rewards of last 100 runs: 197.92096811870906;	Count of solved episodes in last 100 runs: 60
  -Episode: 1800/10000;	Total steps: 940180;	Mean rewards of last 100 runs: 177.97471911411057;	Count of solved episodes in last 100 runs: 68
  -Episode: 1900/10000;	Total steps: 996553;	Mean rewards of last 100 runs: 211.3775879535419;	Count of solved episodes in last 100 runs: 66
  -Episode: 2000/10000;	Total steps: 1041880;	Mean rewards of last 100 runs: 227.34789115259966;	Count of solved episodes in last 100 runs: 90
Training complete!

plot_training(episode_rewards, solve_thres=200)

plot_test(agent, solve_thres=200)

agent.play(render=False)

Total reward: 221

# average reward of 100 trials
rewards = 0
for i in range(100):
    rewards += agent.play()

rewards/100

204.6834266441645

Conclusion

From the test, we can see that only ~9 trials have less than 200 points, > 90% of them landed with high points. A random trial achieved 221 points, and the average point of 100 trials is 204.68. Therefore, we can conclude that we solved the LunarLander-v2 with > 90% confidence.