{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "beObUOFyuRjT" }, "source": [ "##### Copyright 2023 The TF-Agents Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "nQnmcm0oI1Q-" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "s6D70EeAZe-Q" }, "source": [ "# 驱动程序\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 运行 在 Github 上查看源代码 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "8aPHF9kXFggA" }, "source": [ "## 简介\n", "\n", "强化学习的常见模式是在环境中执行策略,持续指定的步数或片段数。在诸如数据收集、评估和生成代理视频期间会采用这种模式。\n", "\n", "使用 Python 编程非常简单,但在 TensorFlow 中编程和调试则要复杂得多,因为它涉及 `tf.while` 循环、`tf.cond` 和 `tf.control_dependencies`。因此,我们将运行循环这一概念抽象成一个名为 `driver` 的类,并在 Python 和 TensorFlow 中提供经过充分测试的实现。\n", "\n", "此外,驱动程序在每步遇到的数据都会保存在名为 Trajectory 的命名元组内,并广播给一组观察者(例如回放缓冲区和指标)。这些数据包括环境观测值、策略建议的操作、获得的奖励、当前和下一个步骤的类型等。" ] }, { "cell_type": "markdown", "metadata": { "id": "t7PM1QfMZqkS" }, "source": [ "## 设置" ] }, { "cell_type": "markdown", "metadata": { "id": "0w-Ykwl1bn4v" }, "source": [ "如果尚未安装 TF-Agents 或 Gym,请运行以下命令:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "TnE2CgilrngG" }, "outputs": [], "source": [ "!pip install tf-agents\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "whYNP894FSkA" }, "outputs": [], "source": [ "from __future__ import absolute_import\n", "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import tensorflow as tf\n", "\n", "\n", "from tf_agents.environments import suite_gym\n", "from tf_agents.environments import tf_py_environment\n", "from tf_agents.policies import random_py_policy\n", "from tf_agents.policies import random_tf_policy\n", "from tf_agents.metrics import py_metrics\n", "from tf_agents.metrics import tf_metrics\n", "from tf_agents.drivers import py_driver\n", "from tf_agents.drivers import dynamic_episode_driver" ] }, { "cell_type": "markdown", "metadata": { "id": "9V7DEcB8IeiQ" }, "source": [ "## Python 驱动程序\n", "\n", "`PyDriver` 类采用 Python 环境、Python 策略和观察者列表在每个时间步骤更新。主要方法为 `run()`,它会使用策略中的操作逐步执行环境,直到至少满足以下终止条件之一:步数达到 `max_steps` 或片段数达到 `max_episodes`。\n", "\n", "实现方式大致如下:\n", "\n", "```python\n", "class PyDriver(object):\n", "\n", " def __init__(self, env, policy, observers, max_steps=1, max_episodes=1):\n", " self._env = env\n", " self._policy = policy\n", " self._observers = observers or []\n", " self._max_steps = max_steps or np.inf\n", " self._max_episodes = max_episodes or np.inf\n", "\n", " def run(self, time_step, policy_state=()):\n", " num_steps = 0\n", " num_episodes = 0\n", " while num_steps < self._max_steps and num_episodes < self._max_episodes:\n", "\n", " # Compute an action using the policy for the given time_step\n", " action_step = self._policy.action(time_step, policy_state)\n", "\n", " # Apply the action to the environment and get the next step\n", " next_time_step = self._env.step(action_step.action)\n", "\n", " # Package information into a trajectory\n", " traj = trajectory.Trajectory(\n", " time_step.step_type,\n", " time_step.observation,\n", " action_step.action,\n", " action_step.info,\n", " next_time_step.step_type,\n", " next_time_step.reward,\n", " next_time_step.discount)\n", "\n", " for observer in self._observers:\n", " observer(traj)\n", "\n", " # Update statistics to check termination\n", " num_episodes += np.sum(traj.is_last())\n", " num_steps += np.sum(~traj.is_boundary())\n", "\n", " time_step = next_time_step\n", " policy_state = action_step.state\n", "\n", " return time_step, policy_state\n", "\n", "```\n", "\n", "以下示例展示了在 CartPole 环境中运行随机策略,将结果保存到回放缓冲区并计算一些指标。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Dj4_-77_5ExP" }, "outputs": [], "source": [ "env = suite_gym.load('CartPole-v0')\n", "policy = random_py_policy.RandomPyPolicy(time_step_spec=env.time_step_spec(), \n", " action_spec=env.action_spec())\n", "replay_buffer = []\n", "metric = py_metrics.AverageReturnMetric()\n", "observers = [replay_buffer.append, metric]\n", "driver = py_driver.PyDriver(\n", " env, policy, observers, max_steps=20, max_episodes=1)\n", "\n", "initial_time_step = env.reset()\n", "final_time_step, _ = driver.run(initial_time_step)\n", "\n", "print('Replay Buffer:')\n", "for traj in replay_buffer:\n", " print(traj)\n", "\n", "print('Average Return: ', metric.result())" ] }, { "cell_type": "markdown", "metadata": { "id": "X3Yrxg36Ik1x" }, "source": [ "## TensorFlow 驱动程序\n", "\n", "TensorFlow 中也有驱动程序,其功能与 Python 驱动程序类似,区别是使用 TF 环境、TF 策略、TF 观察者等。我们目前有 2 种 TensorFlow 驱动程序:`DynamicStepDriver`(在给定的有效环境步数后终止),以及 `DynamicEpisodeDriver`(在给定的片段数后终止)。让我们看一下 DynamicEpisode 的实际应用示例。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "WC4ba3ObSceA" }, "outputs": [], "source": [ "env = suite_gym.load('CartPole-v0')\n", "tf_env = tf_py_environment.TFPyEnvironment(env)\n", "\n", "tf_policy = random_tf_policy.RandomTFPolicy(action_spec=tf_env.action_spec(),\n", " time_step_spec=tf_env.time_step_spec())\n", "\n", "\n", "num_episodes = tf_metrics.NumberOfEpisodes()\n", "env_steps = tf_metrics.EnvironmentSteps()\n", "observers = [num_episodes, env_steps]\n", "driver = dynamic_episode_driver.DynamicEpisodeDriver(\n", " tf_env, tf_policy, observers, num_episodes=2)\n", "\n", "# Initial driver.run will reset the environment and initialize the policy.\n", "final_time_step, policy_state = driver.run()\n", "\n", "print('final_time_step', final_time_step)\n", "print('Number of Steps: ', env_steps.result().numpy())\n", "print('Number of Episodes: ', num_episodes.result().numpy())" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Sz5jhHnU0fX1" }, "outputs": [], "source": [ "# Continue running from previous state\n", "final_time_step, _ = driver.run(final_time_step, policy_state)\n", "\n", "print('final_time_step', final_time_step)\n", "print('Number of Steps: ', env_steps.result().numpy())\n", "print('Number of Episodes: ', num_episodes.result().numpy())" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "4_drivers_tutorial.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }