使用 TF-Agents 训练深度 Q 网络

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

使用 TF-Agents 训练深度 Q 网络#

在 TensorFlow.org 上查看

在 Google Colab 中运行

在 Github 上查看源代码

下载笔记本

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'  # 设置日志级别为ERROR，以减少警告信息
# 禁用 Gemini 的底层库（gRPC 和 Abseil）在初始化日志警告
os.environ["GRPC_VERBOSITY"] = "ERROR"
os.environ["GLOG_minloglevel"] = "3"  # 0: INFO, 1: WARNING, 2: ERROR, 3: FATAL
os.environ["GLOG_minloglevel"] = "true"
import logging
import tensorflow as tf
tf.get_logger().setLevel(logging.ERROR)
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)
!export TF_FORCE_GPU_ALLOW_GROWTH=true
from pathlib import Path

temp_dir = Path(".temp")
temp_dir.mkdir(parents=True, exist_ok=True)

简介#

本示例展示了如何使用 TF-Agents 库在 Cartpole 环境下训练 DQN（深度 Q 网络）代理。

Cartpole environment

示例将引导您逐步了解强化学习 (RL) 的训练、评估和数据收集流水线的所有组成部分。

要实时运行此代码，请点击上方的“在 Google Colab 中运行”链接。

设置#

如果尚未安装以下依赖项，请运行以下命令：

sudo apt-get update
sudo apt-get install -y xvfb ffmpeg freeglut3-dev
pip install imageio
pip install pyvirtualdisplay
pip install tf-agents[reverb]
pip install pyglet

import base64
import imageio
import IPython
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import PIL.Image
import pyvirtualdisplay
import reverb

import tensorflow as tf

from tf_agents.agents.dqn import dqn_agent
from tf_agents.drivers import py_driver
from tf_agents.environments import suite_gym
from tf_agents.environments import tf_py_environment
from tf_agents.eval import metric_utils
from tf_agents.metrics import tf_metrics
from tf_agents.networks import sequential
from tf_agents.policies import py_tf_eager_policy
from tf_agents.policies import random_tf_policy
from tf_agents.replay_buffers import reverb_replay_buffer
from tf_agents.replay_buffers import reverb_utils
from tf_agents.trajectories import trajectory
from tf_agents.specs import tensor_spec
from tf_agents.utils import common

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 9
      7 import PIL.Image
      8 import pyvirtualdisplay
----> 9 import reverb
     11 import tensorflow as tf
     13 from tf_agents.agents.dqn import dqn_agent

ModuleNotFoundError: No module named 'reverb'

# Set up a virtual display for rendering OpenAI gym environments.
display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()

tf.version.VERSION

超参数#

num_iterations = 20000 # @param {type:"integer"}

initial_collect_steps = 100  # @param {type:"integer"}
collect_steps_per_iteration =   1# @param {type:"integer"}
replay_buffer_max_length = 100000  # @param {type:"integer"}

batch_size = 64  # @param {type:"integer"}
learning_rate = 1e-3  # @param {type:"number"}
log_interval = 200  # @param {type:"integer"}

num_eval_episodes = 10  # @param {type:"integer"}
eval_interval = 1000  # @param {type:"integer"}

环境#

在强化学习 (RL) 中，环境代表要解决的任务或问题。可以使用 tf_agents.environments 套件在 TF-Agents 中创建标准环境。TF-Agents 具有用于从 OpenAI Gym、Atari 和 DM Control 等来源加载环境的套件。

从 OpenAI Gym 套件加载 CartPole 环境。

env_name = 'CartPole-v0'
env = suite_gym.load(env_name)

您可以渲染此环境以查看其形式。小车上连接一条自由摆动的长杆。目标是向右或向左移动小车，使长杆保持朝上。

#@test {"skip": true}
env.reset()
PIL.Image.fromarray(env.render())

environment.step 方法会在环境中执行 action 并返回 TimeStep 元组，其中包含环境的下一观测值以及该操作的奖励。

time_step_spec() 方法会返回 TimeStep 元组的规范。其 observation 特性显示了观测值的形状、数据类型和允许值的范围。reward 特性显示了奖励的相同详细信息。

print('Observation Spec:')
print(env.time_step_spec().observation)

print('Reward Spec:')
print(env.time_step_spec().reward)

action_spec() 方法会返回有效操作的形状、数据类型和允许的值。

print('Action Spec:')
print(env.action_spec())

在 Cartpole 环境中：

observation 是由 4 个浮点数组成的数组：
- 小车的位置和速度
- 长杆的角位置和角速度
reward 是一个浮点标量
action 是一个整数标量，只有两个可能的值：
- 0 -“向左移动”
- 1 -“向右移动”

time_step = env.reset()
print('Time step:')
print(time_step)

action = np.array(1, dtype=np.int32)

next_time_step = env.step(action)
print('Next time step:')
print(next_time_step)

通常需要实例化两个环境：一个用于训练，一个用于评估。

train_py_env = suite_gym.load(env_name)
eval_py_env = suite_gym.load(env_name)

与大多数环境一样，Cartpole 环境采用纯 Python 编写。需要使用 TFPyEnvironment 封装容器将其转换为 TensorFlow。

原始环境的 API 使用 Numpy 数组。TFPyEnvironment 会将它们转换为 Tensors 以使其与 Tensorflow 代理和策略兼容。

train_env = tf_py_environment.TFPyEnvironment(train_py_env)
eval_env = tf_py_environment.TFPyEnvironment(eval_py_env)

代理#

用于解决 RL 问题的算法由 Agent 表示。TF-Agents 提供了各种 Agents 的标准实现，包括：

DQN（本教程使用）
REINFORCE
DDPG
TD3
PPO
SAC

DQN 代理可以在具有离散操作空间的任何环境中使用。

DQN 代理的核心是 QNetwork，它是一种神经网络模型，可以基于环境的观测值来学习以预测所有操作的 QValues（预期回报）。

我们将使用 tf_agents.networks. 创建一个 QNetwork。该网络将由一系列 tf.keras.layers.Dense 层组成，其中最后一层将为每个可能的动作提供 1 个输出。

fc_layer_params = (100, 50)
action_tensor_spec = tensor_spec.from_spec(env.action_spec())
num_actions = action_tensor_spec.maximum - action_tensor_spec.minimum + 1

# Define a helper function to create Dense layers configured with the right
# activation and kernel initializer.
def dense_layer(num_units):
  return tf.keras.layers.Dense(
      num_units,
      activation=tf.keras.activations.relu,
      kernel_initializer=tf.keras.initializers.VarianceScaling(
          scale=2.0, mode='fan_in', distribution='truncated_normal'))

# QNetwork consists of a sequence of Dense layers followed by a dense layer
# with `num_actions` units to generate one q_value per available action as
# its output.
dense_layers = [dense_layer(num_units) for num_units in fc_layer_params]
q_values_layer = tf.keras.layers.Dense(
    num_actions,
    activation=None,
    kernel_initializer=tf.keras.initializers.RandomUniform(
        minval=-0.03, maxval=0.03),
    bias_initializer=tf.keras.initializers.Constant(-0.2))
q_net = sequential.Sequential(dense_layers + [q_values_layer])

现在，使用 tf_agents.agents.dqn.dqn_agent 实例化 DqnAgent。除了 time_step_spec、action_spec 和 QNetwork，代理构造函数还需要优化器（本例为 AdamOptimizer）、损失函数和整数计步器。

optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

train_step_counter = tf.Variable(0)

agent = dqn_agent.DqnAgent(
    train_env.time_step_spec(),
    train_env.action_spec(),
    q_network=q_net,
    optimizer=optimizer,
    td_errors_loss_fn=common.element_wise_squared_loss,
    train_step_counter=train_step_counter)

agent.initialize()

策略#

策略定义了代理在环境中的行为方式。通常，强化学习的目标是训练基础模型，直到策略产生期望的结果为止。

在本教程中：

期望的结果是使长杆在小车上保持平衡直立状态。
策略会针对每个 time_step 观测值返回一个操作（向左或向右）。

代理包含两个策略：

agent.policy - 用于评估和部署的主策略。
agent.collect_policy - 用于数据收集的第二策略。

eval_policy = agent.policy
collect_policy = agent.collect_policy

可以独立于代理创建策略。例如，使用 tf_agents.policies.random_tf_policy 创建策略，将为每个 time_step 随机选择一个操作。

random_policy = random_tf_policy.RandomTFPolicy(train_env.time_step_spec(),
                                                train_env.action_spec())

要从策略中获取操作，请调用 policy.action(time_step) 方法。time_step 包含来自环境的观测值。此方法会返回 PolicyStep，此为命名元组，包含以下三个组成部分：

action - 要执行的操作（本例为 0 或 1）
state - 用于有状态（即基于 RNN 的）策略
info - 辅助数据，例如操作的日志概率

example_environment = tf_py_environment.TFPyEnvironment(
    suite_gym.load('CartPole-v0'))

time_step = example_environment.reset()

random_policy.action(time_step)

指标和评估#

用于评估策略的最常用指标是平均回报。回报是针对某个片段在环境中运行策略时获得的奖励总和。运行多个片段后，即可创建平均回报。

以下函数可基于特定策略、环境和多个片段来计算策略的平均回报。

#@test {"skip": true}
def compute_avg_return(environment, policy, num_episodes=10):

  total_return = 0.0
  for _ in range(num_episodes):

    time_step = environment.reset()
    episode_return = 0.0

    while not time_step.is_last():
      action_step = policy.action(time_step)
      time_step = environment.step(action_step.action)
      episode_return += time_step.reward
    total_return += episode_return

  avg_return = total_return / num_episodes
  return avg_return.numpy()[0]


# See also the metrics module for standard implementations of different metrics.
# https://github.com/tensorflow/agents/tree/master/tf_agents/metrics

在 random_policy 上运行此计算将显示环境中的基线性能。

compute_avg_return(eval_env, random_policy, num_eval_episodes)

回放缓冲区#

为了跟踪从环境收集的数据，我们将使用 Reverb，这是 Deepmind 打造的一款高效、可扩展且易于使用的回放系统。它会在我们收集轨迹时存储经验数据，并在训练期间使用。

回放缓冲区使用描述要存储的张量的规范构造，可以使用 agent.collect_data_spec 从代理获取这些张量。

table_name = 'uniform_table'
replay_buffer_signature = tensor_spec.from_spec(
      agent.collect_data_spec)
replay_buffer_signature = tensor_spec.add_outer_dim(
    replay_buffer_signature)

table = reverb.Table(
    table_name,
    max_size=replay_buffer_max_length,
    sampler=reverb.selectors.Uniform(),
    remover=reverb.selectors.Fifo(),
    rate_limiter=reverb.rate_limiters.MinSize(1),
    signature=replay_buffer_signature)

reverb_server = reverb.Server([table])

replay_buffer = reverb_replay_buffer.ReverbReplayBuffer(
    agent.collect_data_spec,
    table_name=table_name,
    sequence_length=2,
    local_server=reverb_server)

rb_observer = reverb_utils.ReverbAddTrajectoryObserver(
  replay_buffer.py_client,
  table_name,
  sequence_length=2)

对于大多数代理来说，collect_data_spec 是一个名为 Trajectory 的命名元组，其中包含观测值、操作、奖励和其他项目的规范。

agent.collect_data_spec

agent.collect_data_spec._fields

数据收集#

现在，在环境中将随机策略执行几个步骤，这会将数据记录在回放缓冲区中。

在这里，我们使用“PyDriver”来运行经验收集循环。您可以在我们的驱动程序教程中详细了解 TF Agents 驱动程序。

#@test {"skip": true}
py_driver.PyDriver(
    env,
    py_tf_eager_policy.PyTFEagerPolicy(
      random_policy, use_tf_function=True),
    [rb_observer],
    max_steps=initial_collect_steps).run(train_py_env.reset())

回放缓冲区现在是一个轨迹的集合。

# For the curious:
# Uncomment to peel one of these off and inspect it.
# iter(replay_buffer.as_dataset()).next()

代理需要访问回放缓冲区。通过创建可迭代的 tf.data.Dataset 流水线即可实现访问，此流水线可将数据馈送给代理。

回放缓冲区的每一行仅存储一个观测步骤。但是，由于 DQN 代理需要当前和下一个观测值来计算损失，因此数据集流水线将为批次中的每个项目采样两个相邻的行 (num_steps=2)。

此数据集还通过运行并行调用和预提取数据进行了优化。

# Dataset generates trajectories with shape [Bx2x...]
dataset = replay_buffer.as_dataset(
    num_parallel_calls=3,
    sample_batch_size=batch_size,
    num_steps=2).prefetch(3)

dataset

iterator = iter(dataset)
print(iterator)

# For the curious:
# Uncomment to see what the dataset iterator is feeding to the agent.
# Compare this representation of replay data 
# to the collection of individual trajectories shown earlier.

# iterator.next()

训练代理#

训练循环中必须包含两个步骤：

从环境中收集数据
使用该数据训练代理的神经网络

在此示例中，还会定期评估策略并打印当前分数。

运行以下示例大约需要 5 分钟。

#@test {"skip": true}
try:
  %%time
except:
  pass

# (Optional) Optimize by wrapping some of the code in a graph using TF function.
agent.train = common.function(agent.train)

# Reset the train step.
agent.train_step_counter.assign(0)

# Evaluate the agent's policy once before training.
avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
returns = [avg_return]

# Reset the environment.
time_step = train_py_env.reset()

# Create a driver to collect experience.
collect_driver = py_driver.PyDriver(
    env,
    py_tf_eager_policy.PyTFEagerPolicy(
      agent.collect_policy, use_tf_function=True),
    [rb_observer],
    max_steps=collect_steps_per_iteration)

for _ in range(num_iterations):

  # Collect a few steps and save to the replay buffer.
  time_step, _ = collect_driver.run(time_step)

  # Sample a batch of data from the buffer and update the agent's network.
  experience, unused_info = next(iterator)
  train_loss = agent.train(experience).loss

  step = agent.train_step_counter.numpy()

  if step % log_interval == 0:
    print('step = {0}: loss = {1}'.format(step, train_loss))

  if step % eval_interval == 0:
    avg_return = compute_avg_return(eval_env, agent.policy, num_eval_episodes)
    print('step = {0}: Average Return = {1}'.format(step, avg_return))
    returns.append(avg_return)

可视化#

绘图#

使用 matplotlib.pyplot 绘制图表，展示策略在训练过程中的改进方式。

Cartpole-v0 的一个迭代包含 200 个时间步骤。长杆保持直立的每一步，环境都会分配 +1 奖励，因此一个片段的最大回报为 200。图表显示，在训练期间每次评估的回报都朝着该最大值递增（递增可能稍有不稳定情况，并且并非每次均为单调递增）。

#@test {"skip": true}

iterations = range(0, num_iterations + 1, eval_interval)
plt.plot(iterations, returns)
plt.ylabel('Average Return')
plt.xlabel('Iterations')
plt.ylim(top=250)

视频#

图表非常实用，但能够看到代理在环境中真实地执行任务将更为生动。

首先，创建一个函数以在笔记本内嵌入视频。

def embed_mp4(filename):
  """Embeds an mp4 file in the notebook."""
  video = open(filename,'rb').read()
  b64 = base64.b64encode(video)
  tag = '''
  <video width="640" height="480" controls>
    <source src="data:video/mp4;base64,{0}" type="video/mp4">
  Your browser does not support the video tag.
  </video>'''.format(b64.decode())

  return IPython.display.HTML(tag)

现在，使用代理训练 Cartpole 模型，完成几个片段的迭代。底层 Python 环境（在 TensorFlow 环境封装容器的“内部”）提供了一个 render() 方法，可输出环境状态的图像。这些图像可被收集并处理成视频。

def create_policy_eval_video(policy, filename, num_episodes=5, fps=30):
  filename = filename + ".mp4"
  with imageio.get_writer(filename, fps=fps) as video:
    for _ in range(num_episodes):
      time_step = eval_env.reset()
      video.append_data(eval_py_env.render())
      while not time_step.is_last():
        action_step = policy.action(time_step)
        time_step = eval_env.step(action_step.action)
        video.append_data(eval_py_env.render())
  return embed_mp4(filename)

create_policy_eval_video(agent.policy, "trained-agent")

您可以将经过训练的代理（上方）与随机移动的代理（效果不佳）进行比较。

create_policy_eval_video(random_policy, "random-agent")