{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "id": "_jQ1tEQCxwRx" }, "outputs": [], "source": [ "##### Copyright 2020 The TensorFlow Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "V_sgB_5dx1f1" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import os\n", "os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # 设置日志级别为ERROR,以减少警告信息\n", "# 禁用 Gemini 的底层库(gRPC 和 Abseil)在初始化日志警告\n", "os.environ[\"GRPC_VERBOSITY\"] = \"ERROR\"\n", "os.environ[\"GLOG_minloglevel\"] = \"3\" # 0: INFO, 1: WARNING, 2: ERROR, 3: FATAL\n", "os.environ[\"GLOG_minloglevel\"] = \"true\"\n", "import logging\n", "import tensorflow as tf\n", "tf.get_logger().setLevel(logging.ERROR)\n", "tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.ERROR)\n", "!export TF_FORCE_GPU_ALLOW_GROWTH=true\n", "from pathlib import Path\n", "\n", "temp_dir = Path(\".temp\")\n", "temp_dir.mkdir(parents=True, exist_ok=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "p62G8M_viUJp" }, "source": [ "# 使用 Actor-Critic 方法玩 CartPole 游戏\n" ] }, { "cell_type": "markdown", "metadata": { "id": "-mJ2i6jvZ3sK" }, "source": [ "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看在 Google Colab 中运行在 GitHub 上查看源代码下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "kFgN7h_wiUJq" }, "source": [ "本教程演示如何使用 TensorFlow 实现 [Actor-Critic](https://papers.nips.cc/paper/1786-actor-critic-algorithms.pdf) 方法以在 [Open AI Gym](https://www.gymlibrary.dev/) [`CartPole-v0`](https://www.gymlibrary.dev/environments/classic_control/cart_pole/) 环境中训练代理。假定读者对[(深度)强化学习](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf)的[策略梯度方法](https://en.wikipedia.org/wiki/Deep_reinforcement_learning)有所了解。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "_kA10ZKRR0hi" }, "source": [ "**Actor-Critic 方法**\n", "\n", "Actor-Critic 方法是表示与价值函数无关的策略函数的[时间差分 (TD) 学习](https://en.wikipedia.org/wiki/Temporal_difference_learning)方法。\n", "\n", "策略函数(或策略)返回代理基于给定状态可以采取的动作的概率分布。价值函数确定从给定状态开始并在此后永远根据特定策略运行的代理的预期回报。\n", "\n", "在 Actor-Critic 方法中,策略被称为*行动者*,它根据给定状态提出一组可能的动作,而估算的价值函数则被称为*评价者*,它根据给定的策略评估*行动者*所采取的动作。\n", "\n", "在本教程中,将使用一个具有两个输出的神经网络来表示*行动者*和*评价者*。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "rBfiafKSRs2k" }, "source": [ "**`CartPole-v0`**\n", "\n", "在 [`CartPole-v0` 环境](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)中,将长杆连接到沿无摩擦轨道移动的小车上。长杆开始时是直立的,代理的目标是通过对小车施加 `-1` 或 `+1` 的力来防止其倒下。对于长杆保持直立的每个时间步骤,奖励为 `+1`。当 1) 长杆与垂直方向的夹角超过 15 度或 2) 小车从中心移出超过 2.4 个单位时,片段结束。\n", "\n", "
\n", "
<figure>\n",
        "    <image src=\"https://tensorflow.org/tutorials/reinforcement_learning/images/cartpole-v0.gif\">\n",
        "    <figcaption>\n",
        "      Trained actor-critic model in Cartpole-v0 environment\n",
        "    </figcaption>\n",
        "  </figure>
\n", "
\n" ] }, { "cell_type": "markdown", "metadata": { "id": "XSNVK0AeRoJd" }, "source": [ "在 100 次连续试验中,当片段的平均总奖励达到 195 时,问题被视为“已解决”。" ] }, { "cell_type": "markdown", "metadata": { "id": "glLwIctHiUJq" }, "source": [ "## 安装\n", "\n", "导入必要的软件包并配置全局设置。\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "id": "13l6BbxKhCKp" }, "outputs": [], "source": [ "!pip install -q gymnasium[classic_control]\n", "!pip install -q pyglet" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "id": "WBeQhPi2S4m5" }, "outputs": [], "source": [ "%%bash\n", "# Install additional packages for visualization\n", "sudo apt-get install -y python-opengl > /dev/null 2>&1\n", "pip install git+https://github.com/tensorflow/docs > /dev/null 2>&1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "tT4N3qYviUJr" }, "outputs": [], "source": [ "import collections\n", "import gymnasium as gym\n", "import numpy as np\n", "import statistics\n", "import tensorflow as tf\n", "import tqdm\n", "\n", "from matplotlib import pyplot as plt\n", "from tensorflow.keras import layers\n", "from typing import Any, List, Sequence, Tuple\n", "\n", "\n", "# Create the environment\n", "env = gym.make(\"CartPole-v1\")\n", "\n", "# Set seed for experiment reproducibility\n", "seed = 42\n", "tf.random.set_seed(seed)\n", "np.random.seed(seed)\n", "\n", "# Small epsilon value for stabilizing division operations\n", "eps = np.finfo(np.float32).eps.item()" ] }, { "cell_type": "markdown", "metadata": { "id": "AOUCe2D0iUJu" }, "source": [ "## 模型\n", "\n", "将使用分别生成动作概率和评价者值的神经网络来对*行动者*和*评价者*进行建模。本教程使用模型子类化来定义模型。\n", "\n", "在前向传递过程中,模型将状态作为输入,并将输出动作概率和评价者价值 $V$,后者会对与状态相关的[价值函数](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#value-functions)进行建模。目标是训练一个基于可使预期[回报](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#reward-and-return)最大化的策略 $\\pi$ 来选择动作的模型。\n", "\n", "对于 `CartPole-v0`,有四个值来表示状态:小车位置、小车速度、长杆角度和长杆速度。代理可以采取两种动作分别将小车向左 (`0`) 和向右 (`1`) 推。\n", "\n", "有关更多信息,请参阅 [Gym 的 Cart Pole 文档页面](https://www.gymlibrary.dev/environments/classic_control/cart_pole/)和 Barto、Sutton 和 Anderson (1983) 所著的 [*Neuronlike adaptive elements that can solve difficult learning control problems*](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)。\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "aXKbbMC-kmuv" }, "outputs": [], "source": [ "class ActorCritic(tf.keras.Model):\n", " \"\"\"Combined actor-critic network.\"\"\"\n", "\n", " def __init__(\n", " self, \n", " num_actions: int, \n", " num_hidden_units: int):\n", " \"\"\"Initialize.\"\"\"\n", " super().__init__()\n", "\n", " self.common = layers.Dense(num_hidden_units, activation=\"relu\")\n", " self.actor = layers.Dense(num_actions)\n", " self.critic = layers.Dense(1)\n", "\n", " def call(self, inputs: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:\n", " x = self.common(inputs)\n", " return self.actor(x), self.critic(x)" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "nWyxJgjLn68c" }, "outputs": [], "source": [ "num_actions = env.action_space.n # 2\n", "num_hidden_units = 128\n", "\n", "model = ActorCritic(num_actions, num_hidden_units)" ] }, { "cell_type": "markdown", "metadata": { "id": "hk92njFziUJw" }, "source": [ "## 训练代理\n", "\n", "要训练代理,您将按照以下步骤操作:\n", "\n", "1. 在环境上运行代理以收集每个片段的训练数据。\n", "2. 计算每个时间步骤的预期回报。\n", "3. 计算组合的 Actor-Critic 模型的损失。\n", "4. 计算梯度并更新网络参数。\n", "5. 重复第 1-4 步,直至达到成功标准或最大片段数。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "R2nde2XDs8Gh" }, "source": [ "### 1. 收集训练数据\n", "\n", "与监督学习中一样,为了训练 Actor-Critic 模型,我们需要拥有训练数据。但是,为了收集此类数据,需要在环境中“运行”模型。\n", "\n", "我们收集每个片段的训练数据。随后,在每个时间步骤,将在环境的状态下运行模型的前向传递,以便基于由模型权重参数化的当前策略来生成动作概率和评价者价值。\n", "\n", "将根据模型生成的动作概率对下一个动作进行采样,然后将其应用于环境,这会生成下一个状态和奖励。\n", "\n", "此过程在 `run_episode` 函数中实现,该函数使用 TensorFlow 运算,以便稍后可以将其编译为 TensorFlow 计算图以实现更快的训练。请注意,需要使用 `tf.TensorArray` 来支持可变长度数组上的 Tensor 迭代。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "5URrbGlDSAGx" }, "outputs": [], "source": [ "# Wrap Gym's `env.step` call as an operation in a TensorFlow function.\n", "# This would allow it to be included in a callable TensorFlow graph.\n", "\n", "def env_step(action: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:\n", " \"\"\"Returns state, reward and done flag given an action.\"\"\"\n", "\n", " state, reward, done, truncated, info = env.step(action)\n", " return (state.astype(np.float32), \n", " np.array(reward, np.int32), \n", " np.array(done, np.int32))\n", "\n", "\n", "def tf_env_step(action: tf.Tensor) -> List[tf.Tensor]:\n", " return tf.numpy_function(env_step, [action], \n", " [tf.float32, tf.int32, tf.int32])" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "a4qVRV063Cl9" }, "outputs": [], "source": [ "def run_episode(\n", " initial_state: tf.Tensor, \n", " model: tf.keras.Model, \n", " max_steps: int) -> Tuple[tf.Tensor, tf.Tensor, tf.Tensor]:\n", " \"\"\"Runs a single episode to collect training data.\"\"\"\n", "\n", " action_probs = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)\n", " values = tf.TensorArray(dtype=tf.float32, size=0, dynamic_size=True)\n", " rewards = tf.TensorArray(dtype=tf.int32, size=0, dynamic_size=True)\n", "\n", " initial_state_shape = initial_state.shape\n", " state = initial_state\n", "\n", " for t in tf.range(max_steps):\n", " # Convert state into a batched tensor (batch size = 1)\n", " state = tf.expand_dims(state, 0)\n", " \n", " # Run the model and to get action probabilities and critic value\n", " action_logits_t, value = model(state)\n", " \n", " # Sample next action from the action probability distribution\n", " action = tf.random.categorical(action_logits_t, 1)[0, 0]\n", " action_probs_t = tf.nn.softmax(action_logits_t)\n", "\n", " # Store critic values\n", " values = values.write(t, tf.squeeze(value))\n", "\n", " # Store log probability of the action chosen\n", " action_probs = action_probs.write(t, action_probs_t[0, action])\n", " \n", " # Apply action to the environment to get next state and reward\n", " state, reward, done = tf_env_step(action)\n", " state.set_shape(initial_state_shape)\n", " \n", " # Store reward\n", " rewards = rewards.write(t, reward)\n", "\n", " if tf.cast(done, tf.bool):\n", " break\n", "\n", " action_probs = action_probs.stack()\n", " values = values.stack()\n", " rewards = rewards.stack()\n", " \n", " return action_probs, values, rewards" ] }, { "cell_type": "markdown", "metadata": { "id": "lBnIHdz22dIx" }, "source": [ "### 2. 计算预期回报\n", "\n", "将在一个片段内收集到的每个时间步骤的奖励序列 $t$, ${r_{t}}^{T}{em0}{t=1}$ 转换为预期回报序列 ${G{/em0}{t}}^{T}_{t=1}$,其中,奖励的总和来自于当前时间步骤 $t$ 到 $T$,每个奖励都会乘以按指数衰减的折扣因子 $\\gamma$:\n", "\n", "$$G_{t} = \\sum^{T}*{t'=t} \\gamma^{t'-t}r*{t'}$$\n", "\n", "由于 $\\gamma\\in(0,1)$,距离当前时间步骤更远的奖励的权重较小。\n", "\n", "从直觉上看,预期回报只是意味着现在的回报要优于以后的回报。从数学意义上讲,这是为了确保奖励的总和收敛。\n", "\n", "为了稳定训练,我们还将生成的回报序列标准化(即均值和单位标准差为零)。\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "jpEwFyl315dl" }, "outputs": [], "source": [ "def get_expected_return(\n", " rewards: tf.Tensor, \n", " gamma: float, \n", " standardize: bool = True) -> tf.Tensor:\n", " \"\"\"Compute expected returns per timestep.\"\"\"\n", "\n", " n = tf.shape(rewards)[0]\n", " returns = tf.TensorArray(dtype=tf.float32, size=n)\n", "\n", " # Start from the end of `rewards` and accumulate reward sums\n", " # into the `returns` array\n", " rewards = tf.cast(rewards[::-1], dtype=tf.float32)\n", " discounted_sum = tf.constant(0.0)\n", " discounted_sum_shape = discounted_sum.shape\n", " for i in tf.range(n):\n", " reward = rewards[i]\n", " discounted_sum = reward + gamma * discounted_sum\n", " discounted_sum.set_shape(discounted_sum_shape)\n", " returns = returns.write(i, discounted_sum)\n", " returns = returns.stack()[::-1]\n", "\n", " if standardize:\n", " returns = ((returns - tf.math.reduce_mean(returns)) / \n", " (tf.math.reduce_std(returns) + eps))\n", "\n", " return returns" ] }, { "cell_type": "markdown", "metadata": { "id": "qhr50_Czxazw" }, "source": [ "### 3. Actor-Critic 损失\n", "\n", "由于使用的是混合 Actor-Critic 模型,会选择将行动者损失与评价者损失相结合的损失函数来进行训练,如下所示:\n", "\n", "$$L = L_{actor} + L_{critic}$$" ] }, { "cell_type": "markdown", "metadata": { "id": "nOQIJuG1xdTH" }, "source": [ "#### 行动者损失\n", "\n", "行动者损失基于[将评价者作为状态相关基线的策略梯度](https://www.youtube.com/watch?v=EKqxumCuAAY&t=62m23s),并使用单个样本(每个片段)的估计值进行计算。\n", "\n", "$$L_{actor} = -\\sum^{T}*{t=1} log\\pi*{\\theta}(a_{t} | s_{t})[G(s_{t}, a_{t}) - V^{\\pi}*{\\theta}(s*{t})]$$\n", "\n", "其中:\n", "\n", "- $T$:每个片段的时间步骤数,各个片段可能有所不同\n", "- $s_{t}$:时间步骤 $t$ 的状态\n", "- $a_{t}$:对于给定状态 $s$,在时间步骤 $t$ 选择的动作\n", "- $\\pi_{\\theta}$:由 $\\theta$ 参数化的策略(行动者)\n", "- $V^{\\pi}_{\\theta}$:由 $\\theta$ 参数化的价值函数(评价者)\n", "- $G = G_{t}$:对于给定状态,在时间步骤 $t$ 采取的动作对的预期回报\n", "\n", "向总和添加一个负数项,因为我们希望通过最大程度减少组合损失来尽可能提高产生更高奖励的动作的概率。\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "Y304O4OAxiAv" }, "source": [ "##### 优势\n", "\n", "$L_{actor}$ 公式中的 $G - V$ 项称为[优势](https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#advantage-functions),它表示针对给定的特定状态采取的动作要比根据用于该状态的策略 $\\pi$ 选择的随机动作好得多。\n", "\n", "尽管可以排除基线,但这可能会导致训练期间出现高方差。选择评价者 $V$ 作为基线的好处是,它可以训练得尽可能接近 $G$,从而降低方差。\n", "\n", "此外,在没有评价者时,该算法将尝试根据预期回报来提高针对特定状态采取的动作的概率,如果动作之间的相对概率保持相同,可能不会有太大的区别。\n", "\n", "例如,假设给定状态的两个动作将产生相同的预期回报。在没有评价者时,该算法将基于目标 $J$ 尝试提高这些动作的概率。在有评价者时,结果可能是不存在优势 ($G - V = 0$),这对于提高动作的概率没有任何益处,并且算法会将梯度设置为零。\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": { "id": "1hrPLrgGxlvb" }, "source": [ "#### The Critic loss\n", "\n", "可以将 $V$ 训练得尽可能接近 $G$,这样便可将评价者损失设置为具有以下损失函数的回归问题:\n", "\n", "$$L_{critic} = L_{\\delta}(G, V^{\\pi}_{\\theta})$$\n", "\n", "其中 $L_{\\delta}$ 是 [Huber 损失](https://en.wikipedia.org/wiki/Huber_loss),它对数据中离群值的敏感性低于平方误差损失。\n" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "9EXwbEez6n9m" }, "outputs": [], "source": [ "huber_loss = tf.keras.losses.Huber(reduction=tf.keras.losses.Reduction.SUM)\n", "\n", "def compute_loss(\n", " action_probs: tf.Tensor, \n", " values: tf.Tensor, \n", " returns: tf.Tensor) -> tf.Tensor:\n", " \"\"\"Computes the combined Actor-Critic loss.\"\"\"\n", "\n", " advantage = returns - values\n", "\n", " action_log_probs = tf.math.log(action_probs)\n", " actor_loss = -tf.math.reduce_sum(action_log_probs * advantage)\n", "\n", " critic_loss = huber_loss(values, returns)\n", "\n", " return actor_loss + critic_loss" ] }, { "cell_type": "markdown", "metadata": { "id": "HSYkQOmRfV75" }, "source": [ "### 4. 定义训练步骤以更新参数\n", "\n", "将上述所有步骤组合成一个每个片段都会运行的训练步骤。使用 `tf.GradientTape` 上下文来执行损失函数之前的所有步骤,以实现自动微分。\n", "\n", "本教程使用 Adam 优化器将梯度应用于模型参数。\n", "\n", "无折扣奖励的总和 `` 也会在此步骤中计算。此值稍后将用于评估是否满足成功标准。\n", "\n", "将 `tf.function` 上下文应用于 `train_step` 函数,这样便可将其编译为可调用的 TensorFlow 计算图,进而可以将训练速度提高 10 倍。\n" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "QoccrkF3IFCg" }, "outputs": [], "source": [ "optimizer = tf.keras.optimizers.Adam(learning_rate=0.01)\n", "\n", "\n", "@tf.function\n", "def train_step(\n", " initial_state: tf.Tensor, \n", " model: tf.keras.Model, \n", " optimizer: tf.keras.optimizers.Optimizer, \n", " gamma: float, \n", " max_steps_per_episode: int) -> tf.Tensor:\n", " \"\"\"Runs a model training step.\"\"\"\n", "\n", " with tf.GradientTape() as tape:\n", "\n", " # Run the model for one episode to collect training data\n", " action_probs, values, rewards = run_episode(\n", " initial_state, model, max_steps_per_episode) \n", "\n", " # Calculate the expected returns\n", " returns = get_expected_return(rewards, gamma)\n", "\n", " # Convert training data to appropriate TF tensor shapes\n", " action_probs, values, returns = [\n", " tf.expand_dims(x, 1) for x in [action_probs, values, returns]] \n", "\n", " # Calculate the loss values to update our network\n", " loss = compute_loss(action_probs, values, returns)\n", "\n", " # Compute the gradients from the loss\n", " grads = tape.gradient(loss, model.trainable_variables)\n", "\n", " # Apply the gradients to the model's parameters\n", " optimizer.apply_gradients(zip(grads, model.trainable_variables))\n", "\n", " episode_reward = tf.math.reduce_sum(rewards)\n", "\n", " return episode_reward" ] }, { "cell_type": "markdown", "metadata": { "id": "HFvZiDoAflGK" }, "source": [ "### 5. 运行训练循环\n", "\n", "通过运行训练步骤来执行训练,直到达到成功标准或最大片段数。\n", "\n", "使用队列保存片段奖励的运行记录。达到 100 次试验后,将删除队列尾端(左侧)的最早奖励,同时将最新奖励添加到队列头部(右侧)。此外,还会保持奖励的运行总和以提高计算效率。\n", "\n", "根据您的运行时,训练可以在不到一分钟的时间内完成。" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "kbmBxnzLiUJx" }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " 15%|█▍ | 1458/10000 [11:21<1:06:34, 2.14it/s, episode_reward=500, running_reward=478]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "\n", "Solved at episode 1458: average reward: 478.01!\n", "CPU times: user 27min 22s, sys: 2min 6s, total: 29min 28s\n", "Wall time: 11min 21s\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "%%time\n", "\n", "min_episodes_criterion = 100\n", "max_episodes = 10000\n", "max_steps_per_episode = 500\n", "\n", "# `CartPole-v1` is considered solved if average reward is >= 475 over 500 \n", "# consecutive trials\n", "reward_threshold = 475\n", "running_reward = 0\n", "\n", "# The discount factor for future rewards\n", "gamma = 0.99\n", "\n", "# Keep the last episodes reward\n", "episodes_reward: collections.deque = collections.deque(maxlen=min_episodes_criterion)\n", "\n", "t = tqdm.trange(max_episodes)\n", "for i in t:\n", " initial_state, info = env.reset()\n", " initial_state = tf.constant(initial_state, dtype=tf.float32)\n", " episode_reward = int(train_step(\n", " initial_state, model, optimizer, gamma, max_steps_per_episode))\n", " \n", " episodes_reward.append(episode_reward)\n", " running_reward = statistics.mean(episodes_reward)\n", " \n", "\n", " t.set_postfix(\n", " episode_reward=episode_reward, running_reward=running_reward)\n", " \n", " # Show the average episode reward every 10 episodes\n", " if i % 10 == 0:\n", " pass # print(f'Episode {i}: average reward: {avg_reward}')\n", " \n", " if running_reward > reward_threshold and i >= min_episodes_criterion: \n", " break\n", "\n", "print(f'\\nSolved at episode {i}: average reward: {running_reward:.2f}!')" ] }, { "cell_type": "markdown", "metadata": { "id": "ru8BEwS1EmAv" }, "source": [ "## 可视化\n", "\n", "训练后,最好呈现模型在环境中的表现。您可以运行下面的单元来生成模型的一次片段运行的 GIF 动画。请注意,需要为 Gym 安装附加软件包,以便在 Colab 中正确呈现环境图像。" ] }, { "cell_type": "code", "execution_count": 15, "metadata": { "id": "qbIMMkfmRHyC" }, "outputs": [], "source": [ "# Render an episode and save as a GIF file\n", "\n", "from IPython import display as ipythondisplay\n", "from PIL import Image\n", "\n", "render_env = gym.make(\"CartPole-v1\", render_mode='rgb_array')\n", "\n", "def render_episode(env: gym.Env, model: tf.keras.Model, max_steps: int): \n", " state, info = env.reset()\n", " state = tf.constant(state, dtype=tf.float32)\n", " screen = env.render()\n", " images = [Image.fromarray(screen)]\n", " \n", " for i in range(1, max_steps + 1):\n", " state = tf.expand_dims(state, 0)\n", " action_probs, _ = model(state)\n", " action = np.argmax(np.squeeze(action_probs))\n", "\n", " state, reward, done, truncated, info = env.step(action)\n", " state = tf.constant(state, dtype=tf.float32)\n", "\n", " # Render screen every 10 steps\n", " if i % 10 == 0:\n", " screen = env.render()\n", " images.append(Image.fromarray(screen))\n", " \n", " if done:\n", " break\n", " \n", " return images\n", "\n", "\n", "# Save GIF image\n", "images = render_episode(render_env, model, max_steps_per_episode)\n", "image_file = temp_dir/'cartpole-v1.gif'\n", "# loop=0: loop forever, duration=1: play each frame for 1ms\n", "images[0].save(\n", " image_file, save_all=True, append_images=images[1:], loop=0, duration=1)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "id": "TLd720SejKmf" }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import tensorflow_docs.vis.embed as embed\n", "embed.embed_file(image_file)" ] }, { "cell_type": "markdown", "metadata": { "id": "lnq9Hzo1Po6X" }, "source": [ "## 后续步骤\n", "\n", "本教程演示了如何使用 Tensorflow 实现 Actor-Critic 方法。\n", "\n", "接下来,您可以尝试在 Gym 的不同环境中训练模型。\n", "\n", "有关 Actor-Critic 方法和 Cartpole-v0 问题的其他信息,您可以参考以下资源:\n", "\n", "- [Actor-Critic 方法](https://hal.inria.fr/hal-00840470/document)\n", "- [Actor-Critic 讲座 (CAL)](https://www.youtube.com/watch?v=EKqxumCuAAY&list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A&index=7&t=0s)\n", "- [Cart Pole 学习控制问题 [Barto 等人,1983 年]](http://www.derongliu.org/adp/adp-cdrom/Barto1983.pdf)\n", "\n", "有关 TensorFlow 中的更多强化学习示例,您可以查看以下资源:\n", "\n", "- [强化学习代码示例 (keras.io)](https://keras.io/examples/rl/)\n", "- [TF-Agents 强化学习库](https://tensorflow.google.cn/agents)\n" ] } ], "metadata": { "colab": { "collapsed_sections": [ "_jQ1tEQCxwRx" ], "name": "actor_critic.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "xxx", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 0 }