Cart Pole – NeuralFit

Cart Pole example

While applying neuro-evolution to datasets with a clear output for each input is interesting, it really shines when we look at scenarios where there is no clear “good” answers. Typically reinforcement learning is used for such scenarios, but neuro-evolution a worthy opponent. In this example we will apply NeuralFit to one of OpenAI’s gyms, in particular the Cart Pole environment where a cart has to balance a pole subject to gravity. We start by importing neuralfit (for fitting), numpy (for arrays) and gym (for the environment).

import neuralfit as nf
import numpy as np
import gym

Next, we initialize the gym environment. We do not enable rendering because this considerably slows down training.

env = gym.make('CartPole-v1')

Afterwards, we create and define the model. The model has 4 inputs (the state) and 1 output (the action). For more information on this, visit the Cart Pole environment documentation. Furthermore we compile the model, but note that we should not include a metric since we are going to use func_evolve which uses fitness function to assess the performance of a model instead. We supply the size metric since we want to monitor the size of the best performing model during evolution.

model = nf.Model(4,1)
model.compile(monitors=['size'])

Now comes the most important part: defining the function that evaluates a group of networks (called genomes, together called the population). Essentially we loop over the population, playing a single epoch for each of them and recording their cumulative reward during that epoch. Every epoch runs until one of the termination conditions has been reached (see the Cart Pole documentation). Note that is very important to seed the environments identically for each genome, so that their relative rewards reflect their relative performance (and does not include “luck” due to starting conditions).

def evaluate (genomes):
    losses = np.zeros(len(genomes))
    random_seed = np.random.randint(0,1000) 
    for i in range(len(genomes)):
        observation, _ = env.reset(seed=random_seed)
        for t in range(1000):
            observation = np.reshape(observation, (1,4))
            action = int(np.clip(genomes[i].predict(observation),0,1)[0][0])
            observation, reward, done, info, _ = env.step(action)
            losses[i] -= reward
            if done:
                break

    return losses

At this point we can simply call model.func_evolve to train the model! We specify 50 epochs, which should be (on average) plenty of time to get good results.

model.func_evolve(evaluate, epochs=50)

If the model is trained succesfully, you should find a score of -1000. This indicates that the model was able to balance the pole until the end of the epoch, which lasts for 1000 timesteps at most. We can visualize the control behaviour of the resulting model by re-intializing the environment, but this time with the human rendering mode. Afterwards we call the evaluation function manually on the evolved model.

env = gym.make('CartPole-v1', render_mode='human')
evaluate([model])

If all went well, you should get an animation similar to the one below! To build some intuition, try changing the number of timesteps per epoch and study the control behavior. What happens if the number of timesteps is too low?