Spaces:
Running
Running
File size: 12,096 Bytes
9ce984a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
"""
Title: Proximal Policy Optimization
Author: [Ilias Chrysovergis](https://twitter.com/iliachry)
Date created: 2021/06/24
Last modified: 2024/03/12
Description: Implementation of a Proximal Policy Optimization agent for the CartPole-v1 environment.
Accelerator: None
"""
"""
## Introduction
This code example solves the CartPole-v1 environment using a Proximal Policy Optimization (PPO) agent.
### CartPole-v1
A pole is attached by an un-actuated joint to a cart, which moves along a frictionless track.
The system is controlled by applying a force of +1 or -1 to the cart.
The pendulum starts upright, and the goal is to prevent it from falling over.
A reward of +1 is provided for every timestep that the pole remains upright.
The episode ends when the pole is more than 15 degrees from vertical, or the cart moves more than 2.4 units from the center.
After 200 steps the episode ends. Thus, the highest return we can get is equal to 200.
[CartPole-v1](https://gymnasium.farama.org/environments/classic_control/cart_pole/)
### Proximal Policy Optimization
PPO is a policy gradient method and can be used for environments with either discrete or continuous action spaces.
It trains a stochastic policy in an on-policy way. Also, it utilizes the actor critic method. The actor maps the
observation to an action and the critic gives an expectation of the rewards of the agent for the observation given.
Firstly, it collects a set of trajectories for each epoch by sampling from the latest version of the stochastic policy.
Then, the rewards-to-go and the advantage estimates are computed in order to update the policy and fit the value function.
The policy is updated via a stochastic gradient ascent optimizer, while the value function is fitted via some gradient descent algorithm.
This procedure is applied for many epochs until the environment is solved.

- [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)
- [OpenAI Spinning Up docs - PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
### Note
This code example uses Keras and Tensorflow v2. It is based on the PPO Original Paper,
the OpenAI's Spinning Up docs for PPO, and the OpenAI's Spinning Up implementation of PPO using Tensorflow v1.
[OpenAI Spinning Up Github - PPO](https://github.com/openai/spinningup/blob/master/spinup/algos/tf1/ppo/ppo.py)
"""
"""
## Libraries
For this example the following libraries are used:
1. `numpy` for n-dimensional arrays
2. `tensorflow` and `keras` for building the deep RL PPO agent
3. `gymnasium` for getting everything we need about the environment
4. `scipy.signal` for calculating the discounted cumulative sums of vectors
"""
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
from keras import layers
import numpy as np
import tensorflow as tf
import gymnasium as gym
import scipy.signal
"""
## Functions and class
"""
def discounted_cumulative_sums(x, discount):
# Discounted cumulative sums of vectors for computing rewards-to-go and advantage estimates
return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]
class Buffer:
# Buffer for storing trajectories
def __init__(self, observation_dimensions, size, gamma=0.99, lam=0.95):
# Buffer initialization
self.observation_buffer = np.zeros(
(size, observation_dimensions), dtype=np.float32
)
self.action_buffer = np.zeros(size, dtype=np.int32)
self.advantage_buffer = np.zeros(size, dtype=np.float32)
self.reward_buffer = np.zeros(size, dtype=np.float32)
self.return_buffer = np.zeros(size, dtype=np.float32)
self.value_buffer = np.zeros(size, dtype=np.float32)
self.logprobability_buffer = np.zeros(size, dtype=np.float32)
self.gamma, self.lam = gamma, lam
self.pointer, self.trajectory_start_index = 0, 0
def store(self, observation, action, reward, value, logprobability):
# Append one step of agent-environment interaction
self.observation_buffer[self.pointer] = observation
self.action_buffer[self.pointer] = action
self.reward_buffer[self.pointer] = reward
self.value_buffer[self.pointer] = value
self.logprobability_buffer[self.pointer] = logprobability
self.pointer += 1
def finish_trajectory(self, last_value=0):
# Finish the trajectory by computing advantage estimates and rewards-to-go
path_slice = slice(self.trajectory_start_index, self.pointer)
rewards = np.append(self.reward_buffer[path_slice], last_value)
values = np.append(self.value_buffer[path_slice], last_value)
deltas = rewards[:-1] + self.gamma * values[1:] - values[:-1]
self.advantage_buffer[path_slice] = discounted_cumulative_sums(
deltas, self.gamma * self.lam
)
self.return_buffer[path_slice] = discounted_cumulative_sums(
rewards, self.gamma
)[:-1]
self.trajectory_start_index = self.pointer
def get(self):
# Get all data of the buffer and normalize the advantages
self.pointer, self.trajectory_start_index = 0, 0
advantage_mean, advantage_std = (
np.mean(self.advantage_buffer),
np.std(self.advantage_buffer),
)
self.advantage_buffer = (self.advantage_buffer - advantage_mean) / advantage_std
return (
self.observation_buffer,
self.action_buffer,
self.advantage_buffer,
self.return_buffer,
self.logprobability_buffer,
)
def mlp(x, sizes, activation=keras.activations.tanh, output_activation=None):
# Build a feedforward neural network
for size in sizes[:-1]:
x = layers.Dense(units=size, activation=activation)(x)
return layers.Dense(units=sizes[-1], activation=output_activation)(x)
def logprobabilities(logits, a):
# Compute the log-probabilities of taking actions a by using the logits (i.e. the output of the actor)
logprobabilities_all = keras.ops.log_softmax(logits)
logprobability = keras.ops.sum(
keras.ops.one_hot(a, num_actions) * logprobabilities_all, axis=1
)
return logprobability
seed_generator = keras.random.SeedGenerator(1337)
# Sample action from actor
@tf.function
def sample_action(observation):
logits = actor(observation)
action = keras.ops.squeeze(
keras.random.categorical(logits, 1, seed=seed_generator), axis=1
)
return logits, action
# Train the policy by maxizing the PPO-Clip objective
@tf.function
def train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
):
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
ratio = keras.ops.exp(
logprobabilities(actor(observation_buffer), action_buffer)
- logprobability_buffer
)
min_advantage = keras.ops.where(
advantage_buffer > 0,
(1 + clip_ratio) * advantage_buffer,
(1 - clip_ratio) * advantage_buffer,
)
policy_loss = -keras.ops.mean(
keras.ops.minimum(ratio * advantage_buffer, min_advantage)
)
policy_grads = tape.gradient(policy_loss, actor.trainable_variables)
policy_optimizer.apply_gradients(zip(policy_grads, actor.trainable_variables))
kl = keras.ops.mean(
logprobability_buffer
- logprobabilities(actor(observation_buffer), action_buffer)
)
kl = keras.ops.sum(kl)
return kl
# Train the value function by regression on mean-squared error
@tf.function
def train_value_function(observation_buffer, return_buffer):
with tf.GradientTape() as tape: # Record operations for automatic differentiation.
value_loss = keras.ops.mean((return_buffer - critic(observation_buffer)) ** 2)
value_grads = tape.gradient(value_loss, critic.trainable_variables)
value_optimizer.apply_gradients(zip(value_grads, critic.trainable_variables))
"""
## Hyperparameters
"""
# Hyperparameters of the PPO algorithm
steps_per_epoch = 4000
epochs = 30
gamma = 0.99
clip_ratio = 0.2
policy_learning_rate = 3e-4
value_function_learning_rate = 1e-3
train_policy_iterations = 80
train_value_iterations = 80
lam = 0.97
target_kl = 0.01
hidden_sizes = (64, 64)
# True if you want to render the environment
render = False
"""
## Initializations
"""
# Initialize the environment and get the dimensionality of the
# observation space and the number of possible actions
env = gym.make("CartPole-v1")
observation_dimensions = env.observation_space.shape[0]
num_actions = env.action_space.n
# Initialize the buffer
buffer = Buffer(observation_dimensions, steps_per_epoch)
# Initialize the actor and the critic as keras models
observation_input = keras.Input(shape=(observation_dimensions,), dtype="float32")
logits = mlp(observation_input, list(hidden_sizes) + [num_actions])
actor = keras.Model(inputs=observation_input, outputs=logits)
value = keras.ops.squeeze(mlp(observation_input, list(hidden_sizes) + [1]), axis=1)
critic = keras.Model(inputs=observation_input, outputs=value)
# Initialize the policy and the value function optimizers
policy_optimizer = keras.optimizers.Adam(learning_rate=policy_learning_rate)
value_optimizer = keras.optimizers.Adam(learning_rate=value_function_learning_rate)
# Initialize the observation, episode return and episode length
observation, _ = env.reset()
episode_return, episode_length = 0, 0
"""
## Train
"""
# Iterate over the number of epochs
for epoch in range(epochs):
# Initialize the sum of the returns, lengths and number of episodes for each epoch
sum_return = 0
sum_length = 0
num_episodes = 0
# Iterate over the steps of each epoch
for t in range(steps_per_epoch):
if render:
env.render()
# Get the logits, action, and take one step in the environment
observation = observation.reshape(1, -1)
logits, action = sample_action(observation)
observation_new, reward, done, _, _ = env.step(action[0].numpy())
episode_return += reward
episode_length += 1
# Get the value and log-probability of the action
value_t = critic(observation)
logprobability_t = logprobabilities(logits, action)
# Store obs, act, rew, v_t, logp_pi_t
buffer.store(observation, action, reward, value_t, logprobability_t)
# Update the observation
observation = observation_new
# Finish trajectory if reached to a terminal state
terminal = done
if terminal or (t == steps_per_epoch - 1):
last_value = 0 if done else critic(observation.reshape(1, -1))
buffer.finish_trajectory(last_value)
sum_return += episode_return
sum_length += episode_length
num_episodes += 1
observation, _ = env.reset()
episode_return, episode_length = 0, 0
# Get values from the buffer
(
observation_buffer,
action_buffer,
advantage_buffer,
return_buffer,
logprobability_buffer,
) = buffer.get()
# Update the policy and implement early stopping using KL divergence
for _ in range(train_policy_iterations):
kl = train_policy(
observation_buffer, action_buffer, logprobability_buffer, advantage_buffer
)
if kl > 1.5 * target_kl:
# Early Stopping
break
# Update the value function
for _ in range(train_value_iterations):
train_value_function(observation_buffer, return_buffer)
# Print mean return and length for each epoch
print(
f" Epoch: {epoch + 1}. Mean Return: {sum_return / num_episodes}. Mean Length: {sum_length / num_episodes}"
)
"""
## Visualizations
Before training:

After 8 epochs of training:

After 20 epochs of training:

"""
|