{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "klGNgWREsvQv" }, "source": [ "**Copyright 2023 The TF-Agents Authors.**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "nQnmcm0oI1Q-" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "lsaQlK8fFQqH" }, "source": [ "# 具有 Actor-Learner API 的 SAC Minitaur\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 运行 在 Github 上查看源代码 下载笔记本
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "ZOUOQOrFs3zn" }, "source": [ "## 简介" ] }, { "cell_type": "markdown", "metadata": { "id": "cKOCZlhUgXVK" }, "source": [ "本例介绍如何在 [Minitaur](https://arxiv.org/abs/1812.05905) 环境中训练 [Soft Actor Critic](https://github.com/bulletphysics/bullet3/blob/master/examples/pybullet/gym/pybullet_envs/bullet/minitaur.py) 代理。\n", "\n", "如果您使用过 [DQN Colab](https://github.com/tensorflow/agents/blob/master/docs/tutorials/1_dqn_tutorial.ipynb),应该会对这个环境很熟悉。一些明显的变化包括:\n", "\n", "- 将代理从 DQN 改成了 SAC。\n", "- 在 Minitaur 中训练,与 CartPole 相比,此环境要复杂得多。Minitaur 环境旨在训练一个四足机器人向前移动。\n", "- 使用 TF-Agents Actor-Learner API 进行分布式强化学习。\n", "\n", "该 API 既支持使用经验回放缓冲区和可变容器(参数服务器)的分布式数据收集,也支持跨多个设备的分布训练,其设计非常简单,并且是模块化的。对于回放缓冲区和可变容器,我们都利用 [Reverb](https://deepmind.com/research/open-source/Reverb),而对于 GPU 和 TPU 上的分布训练,我们都利用 [TF DistributionStrategy API](https://tensorflow.google.cn/guide/distributed_training)。" ] }, { "cell_type": "markdown", "metadata": { "id": "9vUQms4DAY5A" }, "source": [ "如果尚未安装以下依赖项,请运行:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fskoLlB-AZ9j" }, "outputs": [], "source": [ "!sudo apt-get update\n", "!sudo apt-get install -y xvfb ffmpeg\n", "!pip install 'imageio==2.4.0'\n", "!pip install matplotlib\n", "!pip install tf-agents[reverb]\n", "!pip install pybullet" ] }, { "cell_type": "markdown", "metadata": { "id": "1u9QVVsShC9X" }, "source": [ "## 设置" ] }, { "cell_type": "markdown", "metadata": { "id": "nNV5wyH3dyMl" }, "source": [ "首先,我们将导入所需的不同工具。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sMitx5qSgJk1" }, "outputs": [], "source": [ "import base64\n", "import imageio\n", "import IPython\n", "import matplotlib.pyplot as plt\n", "import os\n", "import reverb\n", "import tempfile\n", "import PIL.Image\n", "\n", "import tensorflow as tf\n", "\n", "from tf_agents.agents.ddpg import critic_network\n", "from tf_agents.agents.sac import sac_agent\n", "from tf_agents.agents.sac import tanh_normal_projection_network\n", "from tf_agents.environments import suite_pybullet\n", "from tf_agents.metrics import py_metrics\n", "from tf_agents.networks import actor_distribution_network\n", "from tf_agents.policies import greedy_policy\n", "from tf_agents.policies import py_tf_eager_policy\n", "from tf_agents.policies import random_py_policy\n", "from tf_agents.replay_buffers import reverb_replay_buffer\n", "from tf_agents.replay_buffers import reverb_utils\n", "from tf_agents.train import actor\n", "from tf_agents.train import learner\n", "from tf_agents.train import triggers\n", "from tf_agents.train.utils import spec_utils\n", "from tf_agents.train.utils import strategy_utils\n", "from tf_agents.train.utils import train_utils\n", "\n", "tempdir = tempfile.gettempdir()" ] }, { "cell_type": "markdown", "metadata": { "id": "LmC0NDhdLIKY" }, "source": [ "## 超参数" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HC1kNrOsLSIZ" }, "outputs": [], "source": [ "env_name = \"MinitaurBulletEnv-v0\" # @param {type:\"string\"}\n", "\n", "# Use \"num_iterations = 1e6\" for better results (2 hrs)\n", "# 1e5 is just so this doesn't take too long (1 hr)\n", "num_iterations = 100000 # @param {type:\"integer\"}\n", "\n", "initial_collect_steps = 10000 # @param {type:\"integer\"}\n", "collect_steps_per_iteration = 1 # @param {type:\"integer\"}\n", "replay_buffer_capacity = 10000 # @param {type:\"integer\"}\n", "\n", "batch_size = 256 # @param {type:\"integer\"}\n", "\n", "critic_learning_rate = 3e-4 # @param {type:\"number\"}\n", "actor_learning_rate = 3e-4 # @param {type:\"number\"}\n", "alpha_learning_rate = 3e-4 # @param {type:\"number\"}\n", "target_update_tau = 0.005 # @param {type:\"number\"}\n", "target_update_period = 1 # @param {type:\"number\"}\n", "gamma = 0.99 # @param {type:\"number\"}\n", "reward_scale_factor = 1.0 # @param {type:\"number\"}\n", "\n", "actor_fc_layer_params = (256, 256)\n", "critic_joint_fc_layer_params = (256, 256)\n", "\n", "log_interval = 5000 # @param {type:\"integer\"}\n", "\n", "num_eval_episodes = 20 # @param {type:\"integer\"}\n", "eval_interval = 10000 # @param {type:\"integer\"}\n", "\n", "policy_save_interval = 5000 # @param {type:\"integer\"}" ] }, { "cell_type": "markdown", "metadata": { "id": "VMsJC3DEgI0x" }, "source": [ "## 环境\n", "\n", "在强化学习 (RL) 中,环境代表要解决的任务或问题。在 TF-Agents 中,使用 `suites` 可以轻松创建标准环境。我们提供了不同的 `suites`,只需提供一个字符串环境名称,即可从 OpenAI Gym、Atari、DM Control 等来源加载环境。\n", "\n", "现在,我们从 Pybullet 套件加载 Minituar 环境。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "RlO7WIQHu_7D" }, "outputs": [], "source": [ "env = suite_pybullet.load(env_name)\n", "env.reset()\n", "PIL.Image.fromarray(env.render())" ] }, { "cell_type": "markdown", "metadata": { "id": "gY179d1xlmoM" }, "source": [ "在该环境中,代理的目标是训练一个控制 Minitaur 机器人的策略,让机器人以尽可能快的速度向前移动。片段会持续 1000 个步骤,回报是整个片段的奖励总和。\n", "\n", "我们看看该环境提供的 `observation` 信息,该策略将使用这些信息生成 `actions`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "exDv57iHfwQV" }, "outputs": [], "source": [ "print('Observation Spec:')\n", "print(env.time_step_spec().observation)\n", "print('Action Spec:')\n", "print(env.action_spec())" ] }, { "cell_type": "markdown", "metadata": { "id": "Wg5ysVTnctIm" }, "source": [ "该观测值非常复杂。我们收到了 28 个值,分别代表所有电机的角度、速度和扭矩。作为回应,该环境希望获得操作介于 `[-1, 1]` 之间的 8 个值。这些值是需要的电机角度。\n", "\n", "通常,我们创建两个环境:一个用于在训练过程中收集数据,另一个用于评估。这些环境使用纯 Python 语言编写,并且使用 NumPy 数组(由 Actor Learner API 直接使用)。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Xp-Y4mD6eDhF" }, "outputs": [], "source": [ "collect_env = suite_pybullet.load(env_name)\n", "eval_env = suite_pybullet.load(env_name)" ] }, { "cell_type": "markdown", "metadata": { "id": "Da-z2yF66FR9" }, "source": [ "## 分布策略\n", "\n", "我们使用 DistributionStrategy API 来支持跨多个设备(如使用数据并行的多个 GPU 或 TPU)运行训练步骤计算。该训练步骤:\n", "\n", "- 接收一批训练数据\n", "- 将数据分发给设备\n", "- 计算前向步骤\n", "- 聚合并计算损失平均值\n", "- 计算后向步骤并执行梯度变量更新\n", "\n", "使用 TF-Agents Learner API 和 DistributionStrategy API,从在 GPU 上运行训练步骤(使用 MirroredStrategy)切换到 TPU(使用 TPUStrategy)非常容易,无需更改下面的任何训练逻辑。" ] }, { "cell_type": "markdown", "metadata": { "id": "wGREYZCaDB1h" }, "source": [ "### 启用 GPU\n", "\n", "如果您希望在 GPU 上运行,首先需要为笔记本启用 GPU:\n", "\n", "- 导航至 Edit→Notebook Settings\n", "- 从 Hardware Accelerator 下拉列表中选择 GPU" ] }, { "cell_type": "markdown", "metadata": { "id": "5ZuvwDV66Mn1" }, "source": [ "### 选择策略\n", "\n", "使用 `strategy_utils` 生成策略。在后台传递参数:\n", "\n", "- `use_gpu = False` 返回 `tf.distribute.get_strategy()`,该策略使用 CPU\n", "- `use_gpu = True` 返回 `tf.distribute.MirroredStrategy()`,该策略使用一台计算机上 TensorFlow 可见的所有 GPU。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ff5ZZRZI15ds" }, "outputs": [], "source": [ "use_gpu = True #@param {type:\"boolean\"}\n", "\n", "strategy = strategy_utils.get_strategy(tpu=False, use_gpu=use_gpu)" ] }, { "cell_type": "markdown", "metadata": { "id": "fMn5FTs5kHvt" }, "source": [ "您需要在 `strategy.scope()` 下创建所有变量和代理,如下所示。" ] }, { "cell_type": "markdown", "metadata": { "id": "E9lW_OZYFR8A" }, "source": [ "## 代理\n", "\n", "要创建 SAC 代理,首先要创建通过该代理训练的网络。SAC 是一个 Actor-Critic 代理,所以我们需要两个网络。\n", "\n", "Critic 会提供 `Q(s,a)` 的值估算结果。也就是说,它会接收一个观测值和一个操作作为输入,同时提供该操作在指定状态下的表现的估算结果。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TgkdEPg_muzV" }, "outputs": [], "source": [ "observation_spec, action_spec, time_step_spec = (\n", " spec_utils.get_tensor_specs(collect_env))\n", "\n", "with strategy.scope():\n", " critic_net = critic_network.CriticNetwork(\n", " (observation_spec, action_spec),\n", " observation_fc_layer_params=None,\n", " action_fc_layer_params=None,\n", " joint_fc_layer_params=critic_joint_fc_layer_params,\n", " kernel_initializer='glorot_uniform',\n", " last_kernel_initializer='glorot_uniform')" ] }, { "cell_type": "markdown", "metadata": { "id": "pYy4AH4V7Ph4" }, "source": [ "我们将使用该 Critic 来训练 `actor` 网络,通过该网络,我们可以利用提供的观测值生成操作。\n", "\n", "`ActorNetwork` 将预测 tanh-squashed [MultivariateNormalDiag](https://tensorflow.google.cn/probability/api_docs/python/tfp/distributions/MultivariateNormalDiag) 分布的参数。随后,只要我们需要生成操作,就可以对该分布进行采样,根据当前观测值建立条件。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TB5Y3Oub4u7f" }, "outputs": [], "source": [ "with strategy.scope():\n", " actor_net = actor_distribution_network.ActorDistributionNetwork(\n", " observation_spec,\n", " action_spec,\n", " fc_layer_params=actor_fc_layer_params,\n", " continuous_projection_net=(\n", " tanh_normal_projection_network.TanhNormalProjectionNetwork))" ] }, { "cell_type": "markdown", "metadata": { "id": "z62u55hSmviJ" }, "source": [ "凭借这些已有的网络,我们现在可以实例化代理。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jbY4yrjTEyc9" }, "outputs": [], "source": [ "with strategy.scope():\n", " train_step = train_utils.create_train_step()\n", "\n", " tf_agent = sac_agent.SacAgent(\n", " time_step_spec,\n", " action_spec,\n", " actor_network=actor_net,\n", " critic_network=critic_net,\n", " actor_optimizer=tf.keras.optimizers.Adam(\n", " learning_rate=actor_learning_rate),\n", " critic_optimizer=tf.keras.optimizers.Adam(\n", " learning_rate=critic_learning_rate),\n", " alpha_optimizer=tf.keras.optimizers.Adam(\n", " learning_rate=alpha_learning_rate),\n", " target_update_tau=target_update_tau,\n", " target_update_period=target_update_period,\n", " td_errors_loss_fn=tf.math.squared_difference,\n", " gamma=gamma,\n", " reward_scale_factor=reward_scale_factor,\n", " train_step_counter=train_step)\n", "\n", " tf_agent.initialize()" ] }, { "cell_type": "markdown", "metadata": { "id": "NLva6g2jdWgr" }, "source": [ "## 回放缓冲区\n", "\n", "为了跟踪从环境收集的数据,我们将使用 [Reverb](https://deepmind.com/research/open-source/Reverb)——Deepmind 出品的一款高效、可扩展且易于使用的回放系统。它会存储 Actor 收集的经验数据,供 Learner 在训练时使用。\n", "\n", "在本教程中,其重要性不如 `max_size`。但是,在使用异步收集和训练的分布设置中,您可能希望使用位置在 2 到 1000 之间的 samples_per_insert 来尝试执行 `rate_limiters.SampleToInsertRatio`。例如:\n", "\n", "```\n", "rate_limiter=reverb.rate_limiters.SampleToInsertRatio(samples_per_insert=3.0, min_size_to_sample=3, error_buffer=3.0)\n", "```\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vX2zGUWJGWAl" }, "outputs": [], "source": [ "table_name = 'uniform_table'\n", "table = reverb.Table(\n", " table_name,\n", " max_size=replay_buffer_capacity,\n", " sampler=reverb.selectors.Uniform(),\n", " remover=reverb.selectors.Fifo(),\n", " rate_limiter=reverb.rate_limiters.MinSize(1))\n", "\n", "reverb_server = reverb.Server([table])" ] }, { "cell_type": "markdown", "metadata": { "id": "LRNvAnkO7JK2" }, "source": [ "回放缓冲区使用描述要存储的张量的规范构造,这些张量可从使用 `tf_agent.collect_data_spec` 的代理获取。\n", "\n", "由于 SAC 代理同时需要当前和下一个观察值才能计算损失,因此,我们设置 `sequence_length=2`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xVLUxyUo7HQR" }, "outputs": [], "source": [ "reverb_replay = reverb_replay_buffer.ReverbReplayBuffer(\n", " tf_agent.collect_data_spec,\n", " sequence_length=2,\n", " table_name=table_name,\n", " local_server=reverb_server)" ] }, { "cell_type": "markdown", "metadata": { "id": "rVD5nQ9ZGo8_" }, "source": [ "现在,我们创建一个驱动器来积累经验,利用这些经验设置回放缓冲区的种子。驱动器提供了一种简单的方法,让我们可以使用特定策略在环境中收集 `n` 个步骤或片段的数据。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ba7bilizt_qW" }, "outputs": [], "source": [ "dataset = reverb_replay.as_dataset(\n", " sample_batch_size=batch_size, num_steps=2).prefetch(50)\n", "experience_dataset_fn = lambda: dataset" ] }, { "cell_type": "markdown", "metadata": { "id": "I0KLrEPwkn5x" }, "source": [ "## 策略\n", "\n", "在 TF-Agents 中,策略就是 RL 中的标准概念策略:给定 `time_step` 来产生操作或操作的分布。主要方法是 `policy_step = policy.step(time_step)`,其中 `policy_step` 是指定元祖 `PolicyStep(action, state, info)`。`policy_step.action` 是要应用到环境的 `action`,`state` 表示有状态 (RNN) 策略的状态,而 `info` 可能包含辅助信息(如操作的对数几率)。\n", "\n", "代理包含两项策略:\n", "\n", "- `agent.policy` — 用于评估和部署的主策略。\n", "- `agent.collect_policy` — 用于数据收集的第二策略。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "yq7JE8IwFe0E" }, "outputs": [], "source": [ "tf_eval_policy = tf_agent.policy\n", "eval_policy = py_tf_eager_policy.PyTFEagerPolicy(\n", " tf_eval_policy, use_tf_function=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "f_A4rZveEQzW" }, "outputs": [], "source": [ "tf_collect_policy = tf_agent.collect_policy\n", "collect_policy = py_tf_eager_policy.PyTFEagerPolicy(\n", " tf_collect_policy, use_tf_function=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "azkJZ8oaF8uc" }, "source": [ "策略可以独立于代理进行创建。例如,使用 `tf_agents.policies.random_py_policy` 创建策略,将为每个 time_step 随机选择一项操作。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BwY7StuMkuV4" }, "outputs": [], "source": [ "random_policy = random_py_policy.RandomPyPolicy(\n", " collect_env.time_step_spec(), collect_env.action_spec())" ] }, { "cell_type": "markdown", "metadata": { "id": "l1LMqw60Kuso" }, "source": [ "## Actor\n", "\n", "Actor 用于管理策略与环境之间的交互。\n", "\n", "- Actor 组件包含环境的一个实例( 作为 `py_environment`)和策略变量的一个副本。\n", "- 给定策略变量的本地值,每个 Actor 工作进程运行一系列数据收集步骤。\n", "- 在调用 `actor.run()` 之前,使用训练脚本中的可变容器客户端实例明确完成变量更新。\n", "- 在每个数据收集步骤中将观察到的经验写入回放缓冲区。" ] }, { "cell_type": "markdown", "metadata": { "id": "XjE59ct9fU7W" }, "source": [ "当 Actor 运行数据收集步骤时,它们会将(状态、操作、奖励)的轨迹传递给观察器,而观察器将缓存轨迹并将其写入 Reverb 回放系统。\n", "\n", "由于 `stride_length=1`,因此,我们存储框架 [(t0,t1) (t1,t2) (t2,t3), ...]。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "HbyGmdiNfNDc" }, "outputs": [], "source": [ "rb_observer = reverb_utils.ReverbAddTrajectoryObserver(\n", " reverb_replay.py_client,\n", " table_name,\n", " sequence_length=2,\n", " stride_length=1)" ] }, { "cell_type": "markdown", "metadata": { "id": "6yaVVC22fOcF" }, "source": [ "我们使用随机策略创建一个 Actor,并收集经验以设置回放缓冲区的种子。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ZGq3SY0kKwsa" }, "outputs": [], "source": [ "initial_collect_actor = actor.Actor(\n", " collect_env,\n", " random_policy,\n", " train_step,\n", " steps_per_run=initial_collect_steps,\n", " observers=[rb_observer])\n", "initial_collect_actor.run()" ] }, { "cell_type": "markdown", "metadata": { "id": "6Pkg-0vZP_Ya" }, "source": [ "在训练过程中使用收集策略实例化 Actor 以收集更多经验。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "A6ooXyk0FZ5j" }, "outputs": [], "source": [ "env_step_metric = py_metrics.EnvironmentSteps()\n", "collect_actor = actor.Actor(\n", " collect_env,\n", " collect_policy,\n", " train_step,\n", " steps_per_run=1,\n", " metrics=actor.collect_metrics(10),\n", " summary_dir=os.path.join(tempdir, learner.TRAIN_DIR),\n", " observers=[rb_observer, env_step_metric])" ] }, { "cell_type": "markdown", "metadata": { "id": "FR9CZ-jfPN2T" }, "source": [ "创建一个可用于在训练过程中评估策略的 Actor。我们传入 `actor.eval_metrics(num_eval_episodes)`,以便随后记录指标。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "vHY2BT5lFhgL" }, "outputs": [], "source": [ "eval_actor = actor.Actor(\n", " eval_env,\n", " eval_policy,\n", " train_step,\n", " episodes_per_run=num_eval_episodes,\n", " metrics=actor.eval_metrics(num_eval_episodes),\n", " summary_dir=os.path.join(tempdir, 'eval'),\n", ")" ] }, { "cell_type": "markdown", "metadata": { "id": "y6eBGSYiOf83" }, "source": [ "## Learner\n", "\n", "Learner 组件包含代理,并使用回放缓冲区的经验数据执行策略变量的梯度步骤更新。在经过一个或多个训练步骤后,Learner 可以将一组新的变量值推送到可变容器。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "gi37YicSFTfF" }, "outputs": [], "source": [ "saved_model_dir = os.path.join(tempdir, learner.POLICY_SAVED_MODEL_DIR)\n", "\n", "# Triggers to save the agent's policy checkpoints.\n", "learning_triggers = [\n", " triggers.PolicySavedModelTrigger(\n", " saved_model_dir,\n", " tf_agent,\n", " train_step,\n", " interval=policy_save_interval),\n", " triggers.StepPerSecondLogTrigger(train_step, interval=1000),\n", "]\n", "\n", "agent_learner = learner.Learner(\n", " tempdir,\n", " train_step,\n", " tf_agent,\n", " experience_dataset_fn,\n", " triggers=learning_triggers,\n", " strategy=strategy)" ] }, { "cell_type": "markdown", "metadata": { "id": "94rCXQtbUbXv" }, "source": [ "## 指标和评估\n", "\n", "我们使用上面的 `actor.eval_metrics` 实例化评价 Actor,这会创建在策略评估期间最常用的指标:\n", "\n", "- 平均回报。回报是在某一片段的环境中运行策略时获得的回报总和,我们通常会求几个片段的平均值。\n", "- 平均片段长度。\n", "\n", "我们运行 Actor 以生成这些指标。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "83iMSHUC71RG" }, "outputs": [], "source": [ "def get_eval_metrics():\n", " eval_actor.run()\n", " results = {}\n", " for metric in eval_actor.metrics:\n", " results[metric.name] = metric.result()\n", " return results\n", "\n", "metrics = get_eval_metrics()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "jnOMvX_eZvOW" }, "outputs": [], "source": [ "def log_eval_metrics(step, metrics):\n", " eval_results = (', ').join(\n", " '{} = {:.6f}'.format(name, result) for name, result in metrics.items())\n", " print('step = {0}: {1}'.format(step, eval_results))\n", "\n", "log_eval_metrics(0, metrics)" ] }, { "cell_type": "markdown", "metadata": { "id": "hWWURm_rXG-f" }, "source": [ "有关不同指标的其他标准实现,请查看[指标模块](https://github.com/tensorflow/agents/blob/master/tf_agents/metrics/tf_metrics.py)。" ] }, { "cell_type": "markdown", "metadata": { "id": "hBc9lj9VWWtZ" }, "source": [ "## 训练代理\n", "\n", "训练循环包括从环境收集数据和优化代理的网络。在训练过程中,我们偶尔会评估代理的策略,看看效果如何。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0pTbJ3PeyF-u" }, "outputs": [], "source": [ "#@test {\"skip\": true}\n", "try:\n", " %%time\n", "except:\n", " pass\n", "\n", "# Reset the train step\n", "tf_agent.train_step_counter.assign(0)\n", "\n", "# Evaluate the agent's policy once before training.\n", "avg_return = get_eval_metrics()[\"AverageReturn\"]\n", "returns = [avg_return]\n", "\n", "for _ in range(num_iterations):\n", " # Training.\n", " collect_actor.run()\n", " loss_info = agent_learner.run(iterations=1)\n", "\n", " # Evaluating.\n", " step = agent_learner.train_step_numpy\n", "\n", " if eval_interval and step % eval_interval == 0:\n", " metrics = get_eval_metrics()\n", " log_eval_metrics(step, metrics)\n", " returns.append(metrics[\"AverageReturn\"])\n", "\n", " if log_interval and step % log_interval == 0:\n", " print('step = {0}: loss = {1}'.format(step, loss_info.loss.numpy()))\n", "\n", "rb_observer.close()\n", "reverb_server.stop()" ] }, { "cell_type": "markdown", "metadata": { "id": "68jNcA_TiJDq" }, "source": [ "## 可视化\n" ] }, { "cell_type": "markdown", "metadata": { "id": "aO-LWCdbbOIC" }, "source": [ "### 绘图\n", "\n", "我们可以通过绘制回报与全局步骤的图形来了解代理的性能。在 `Minitaur` 中,奖励函数基于 Minitaur 在 1000 个步骤中行走了多长的距离,并扣除能量消耗。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "rXKzyGt72HS8" }, "outputs": [], "source": [ "#@test {\"skip\": true}\n", "\n", "steps = range(0, num_iterations + 1, eval_interval)\n", "plt.plot(steps, returns)\n", "plt.ylabel('Average Return')\n", "plt.xlabel('Step')\n", "plt.ylim()" ] }, { "cell_type": "markdown", "metadata": { "id": "M7-XpPP99Cy7" }, "source": [ "### 视频" ] }, { "cell_type": "markdown", "metadata": { "id": "9pGfGxSH32gn" }, "source": [ "渲染每个步骤的环境有助于可视化代理的性能。在此之前,我们先创建一个函数,在该 Colab 中嵌入视频。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "ULaGr8pvOKbl" }, "outputs": [], "source": [ "def embed_mp4(filename):\n", " \"\"\"Embeds an mp4 file in the notebook.\"\"\"\n", " video = open(filename,'rb').read()\n", " b64 = base64.b64encode(video)\n", " tag = '''\n", " '''.format(b64.decode())\n", "\n", " return IPython.display.HTML(tag)" ] }, { "cell_type": "markdown", "metadata": { "id": "9c_PH-pX4Pr5" }, "source": [ "以下代码可将代理策略可视化多个片段:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "PSgaQN1nXT-h" }, "outputs": [], "source": [ "num_episodes = 3\n", "video_filename = 'sac_minitaur.mp4'\n", "with imageio.get_writer(video_filename, fps=60) as video:\n", " for _ in range(num_episodes):\n", " time_step = eval_env.reset()\n", " video.append_data(eval_env.render())\n", " while not time_step.is_last():\n", " action_step = eval_actor.policy.action(time_step)\n", " time_step = eval_env.step(action_step.action)\n", " video.append_data(eval_env.render())\n", "\n", "embed_mp4(video_filename)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "7_SAC_minitaur_tutorial.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }