作者: Jacob Chapman 和 Mathias Lechner
创建日期 2020/05/23
上次修改日期 2024/03/16
描述:使用深度 Q 网络玩 Atari Breakout。
此脚本展示了在 BreakoutNoFrameskip-v4
环境中使用深度 Q 学习的实现。
当智能体采取行动并在环境中移动时,它学习将观察到的环境状态映射到一个动作。智能体将在给定状态下根据“Q 值”选择一个动作,该值是基于预期最高长期奖励的加权奖励。Q 学习智能体学习执行其任务,以便推荐的动作最大化潜在的未来奖励。此方法被认为是一种“离策略”方法,这意味着它的 Q 值是在假设选择了最佳动作的情况下更新的,即使没有选择最佳动作也是如此。
在此环境中,一个挡板沿屏幕底部移动,返回一个球,该球将破坏屏幕顶部的积木。游戏的目标是移除所有积木并冲出关卡。智能体必须学习通过左右移动挡板来控制它,返回球并移除所有积木,而不会让球越过挡板。
Deepmind 论文训练了“总共 5000 万帧(即总共大约 38 天的游戏体验)”。但是,此脚本在大约 1000 万帧时就能获得良好的结果,这些帧在现代机器上不到 24 小时即可处理。
您可以通过将 max_episodes
变量设置为大于 0 的值来控制剧集数量。
import os
os.environ["KERAS_BACKEND"] = "tensorflow"
import keras
from keras import layers
import gymnasium as gym
from gymnasium.wrappers import AtariPreprocessing, FrameStack
import numpy as np
import tensorflow as tf
# Configuration parameters for the whole setup
seed = 42
gamma = 0.99 # Discount factor for past rewards
epsilon = 1.0 # Epsilon greedy parameter
epsilon_min = 0.1 # Minimum epsilon greedy parameter
epsilon_max = 1.0 # Maximum epsilon greedy parameter
epsilon_interval = (
epsilon_max - epsilon_min
) # Rate at which to reduce chance of random action being taken
batch_size = 32 # Size of batch taken from replay buffer
max_steps_per_episode = 10000
max_episodes = 10 # Limit training episodes, will run until solved if smaller than 1
# Use the Atari environment
# Specify the `render_mode` parameter to show the attempts of the agent in a pop up window.
env = gym.make("BreakoutNoFrameskip-v4") # , render_mode="human")
# Environment preprocessing
env = AtariPreprocessing(env)
# Stack four frames
env = FrameStack(env, 4)
env.seed(seed)
A.L.E: Arcade Learning Environment (version 0.8.1+unknown)
[Powered by Stella]
Game console created:
ROM file: /Users/luca/mambaforge/envs/keras-io/lib/python3.9/site-packages/AutoROM/roms/breakout.bin
Cart Name: Breakout - Breakaway IV (1978) (Atari)
Cart MD5: f34f08e5eb96e500e851a80be3277a56
Display Format: AUTO-DETECT ==> NTSC
ROM Size: 2048
Bankswitch Type: AUTO-DETECT ==> 2K
Running ROM file...
Random seed is -975249067
Game console created:
ROM file: /Users/luca/mambaforge/envs/keras-io/lib/python3.9/site-packages/AutoROM/roms/breakout.bin
Cart Name: Breakout - Breakaway IV (1978) (Atari)
Cart MD5: f34f08e5eb96e500e851a80be3277a56
Display Format: AUTO-DETECT ==> NTSC
ROM Size: 2048
Bankswitch Type: AUTO-DETECT ==> 2K
Running ROM file...
Random seed is -1625411987
(3444837047, 2669555309)
此网络学习 Q 表的近似值,Q 表是智能体将采取的动作的状态和动作之间的映射。对于每个状态,我们将有四个可以采取的动作。环境提供状态,动作是通过选择输出层中预测的四个 Q 值中较大的一个来选择的。
num_actions = 4
def create_q_model():
# Network defined by the Deepmind paper
return keras.Sequential(
[
layers.Lambda(
lambda tensor: keras.ops.transpose(tensor, [0, 2, 3, 1]),
output_shape=(84, 84, 4),
input_shape=(4, 84, 84),
),
# Convolutions on the frames on the screen
layers.Conv2D(32, 8, strides=4, activation="relu", input_shape=(4, 84, 84)),
layers.Conv2D(64, 4, strides=2, activation="relu"),
layers.Conv2D(64, 3, strides=1, activation="relu"),
layers.Flatten(),
layers.Dense(512, activation="relu"),
layers.Dense(num_actions, activation="linear"),
]
)
# The first model makes the predictions for Q-values which are used to
# make a action.
model = create_q_model()
# Build a target model for the prediction of future rewards.
# The weights of a target model get updated every 10000 steps thus when the
# loss between the Q-values is calculated the target Q-value is stable.
model_target = create_q_model()
# In the Deepmind paper they use RMSProp however then Adam optimizer
# improves training time
optimizer = keras.optimizers.Adam(learning_rate=0.00025, clipnorm=1.0)
# Experience replay buffers
action_history = []
state_history = []
state_next_history = []
rewards_history = []
done_history = []
episode_reward_history = []
running_reward = 0
episode_count = 0
frame_count = 0
# Number of frames to take random action and observe output
epsilon_random_frames = 50000
# Number of frames for exploration
epsilon_greedy_frames = 1000000.0
# Maximum replay length
# Note: The Deepmind paper suggests 1000000 however this causes memory issues
max_memory_length = 100000
# Train the model after 4 actions
update_after_actions = 4
# How often to update the target network
update_target_network = 10000
# Using huber loss for stability
loss_function = keras.losses.Huber()
while True:
observation, _ = env.reset()
state = np.array(observation)
episode_reward = 0
for timestep in range(1, max_steps_per_episode):
frame_count += 1
# Use epsilon-greedy for exploration
if frame_count < epsilon_random_frames or epsilon > np.random.rand(1)[0]:
# Take random action
action = np.random.choice(num_actions)
else:
# Predict action Q-values
# From environment state
state_tensor = keras.ops.convert_to_tensor(state)
state_tensor = keras.ops.expand_dims(state_tensor, 0)
action_probs = model(state_tensor, training=False)
# Take best action
action = keras.ops.argmax(action_probs[0]).numpy()
# Decay probability of taking random action
epsilon -= epsilon_interval / epsilon_greedy_frames
epsilon = max(epsilon, epsilon_min)
# Apply the sampled action in our environment
state_next, reward, done, _, _ = env.step(action)
state_next = np.array(state_next)
episode_reward += reward
# Save actions and states in replay buffer
action_history.append(action)
state_history.append(state)
state_next_history.append(state_next)
done_history.append(done)
rewards_history.append(reward)
state = state_next
# Update every fourth frame and once batch size is over 32
if frame_count % update_after_actions == 0 and len(done_history) > batch_size:
# Get indices of samples for replay buffers
indices = np.random.choice(range(len(done_history)), size=batch_size)
# Using list comprehension to sample from replay buffer
state_sample = np.array([state_history[i] for i in indices])
state_next_sample = np.array([state_next_history[i] for i in indices])
rewards_sample = [rewards_history[i] for i in indices]
action_sample = [action_history[i] for i in indices]
done_sample = keras.ops.convert_to_tensor(
[float(done_history[i]) for i in indices]
)
# Build the updated Q-values for the sampled future states
# Use the target model for stability
future_rewards = model_target.predict(state_next_sample)
# Q value = reward + discount factor * expected future reward
updated_q_values = rewards_sample + gamma * keras.ops.amax(
future_rewards, axis=1
)
# If final frame set the last value to -1
updated_q_values = updated_q_values * (1 - done_sample) - done_sample
# Create a mask so we only calculate loss on the updated Q-values
masks = keras.ops.one_hot(action_sample, num_actions)
with tf.GradientTape() as tape:
# Train the model on the states and updated Q-values
q_values = model(state_sample)
# Apply the masks to the Q-values to get the Q-value for action taken
q_action = keras.ops.sum(keras.ops.multiply(q_values, masks), axis=1)
# Calculate loss between new Q-value and old Q-value
loss = loss_function(updated_q_values, q_action)
# Backpropagation
grads = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(grads, model.trainable_variables))
if frame_count % update_target_network == 0:
# update the the target network with new weights
model_target.set_weights(model.get_weights())
# Log details
template = "running reward: {:.2f} at episode {}, frame count {}"
print(template.format(running_reward, episode_count, frame_count))
# Limit the state and reward history
if len(rewards_history) > max_memory_length:
del rewards_history[:1]
del state_history[:1]
del state_next_history[:1]
del action_history[:1]
del done_history[:1]
if done:
break
# Update running reward to check condition for solving
episode_reward_history.append(episode_reward)
if len(episode_reward_history) > 100:
del episode_reward_history[:1]
running_reward = np.mean(episode_reward_history)
episode_count += 1
if running_reward > 40: # Condition to consider the task solved
print("Solved at episode {}!".format(episode_count))
break
if (
max_episodes > 0 and episode_count >= max_episodes
): # Maximum number of episodes reached
print("Stopped at episode {}!".format(episode_count))
break
训练前:
训练初期:
训练后期: