{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "1Pi_B2cvdBiW" }, "source": [ "##### Copyright 2023 The TF-Agents Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "nQnmcm0oI1Q-" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "3NFuTvWVZG_B" }, "source": [ "# 策略\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 运行 在 Github 上查看源代码 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "31uij8nIo5bG" }, "source": [ "## 简介" ] }, { "cell_type": "markdown", "metadata": { "id": "PqFn7q5bs3BF" }, "source": [ "在强化学习中,“策略”一词是指将环境的观测值映射到操作或操作分布。在 TF-Agents 中,环境的观测值包含在命名元组 `TimeStep('step_type', 'discount', 'reward', 'observation')` 内,而策略会将时间步骤映射到操作或操作分布。大多数策略使用 `timestep.observation`,某些策略会使用 `timestep.step_type`(例如,在有状态策略中,在片段开始时重置状态),但 `timestep.discount` 和 `timestep.reward` 通常会被忽略。\n", "\n", "策略与 TF-Agents 中其他组成部分的相关性如下。大多数策略都通过神经网络来计算时间步骤的操作和/或操作分布。代理可以包含一个或多个用于不同目的的策略,例如采用一个主要策略进行训练用以部署,并采用一个噪声策略进行数据收集。策略可以保存/恢复,并且可以在不依赖于代理的状态下用于数据收集、评估等工作。\n", "\n", "一些策略在 Tensorflow 中更容易编写(例如神经网络策略),而其他策略在 Python 中更容易编写(例如基于操作脚本的策略)。 因此,在 TF-Agent 中,Python 和 Tensorflow 策略均支持使用。此外,有时可能需要在 Python 环境中使用通过 TensorFlow 编写的策略,反之亦然(例如使用 TensorFlow 策略进行训练,但之后需要在生产 python 环境中进行部署)。为了简化这一步骤,我们提供了用于在 python 和 TensorFlow 策略之间进行转换的包装器。\n", "\n", "与策略有关的另一项有趣内容是策略包装器,它们能够以某种方式修改给定策略,例如添加特定类型的噪声、制作随机策略的贪心或 ϵ 贪心版本、随机混合多个策略等。 " ] }, { "cell_type": "markdown", "metadata": { "id": "HdnG_TT_amWH" }, "source": [ "## 设置" ] }, { "cell_type": "markdown", "metadata": { "id": "9Meq2nT_aquh" }, "source": [ "如果尚未安装 TF-Agents,请运行以下命令:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xsLTHlVdiZP3" }, "outputs": [], "source": [ "!pip install tf-agents" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sdvop99JlYSM" }, "outputs": [], "source": [ "from __future__ import absolute_import\n", "from __future__ import division\n", "from __future__ import print_function\n", "\n", "import abc\n", "import tensorflow as tf\n", "import tensorflow_probability as tfp\n", "import numpy as np\n", "\n", "from tf_agents.specs import array_spec\n", "from tf_agents.specs import tensor_spec\n", "from tf_agents.networks import network\n", "\n", "from tf_agents.policies import py_policy\n", "from tf_agents.policies import random_py_policy\n", "from tf_agents.policies import scripted_py_policy\n", "\n", "from tf_agents.policies import tf_policy\n", "from tf_agents.policies import random_tf_policy\n", "from tf_agents.policies import actor_policy\n", "from tf_agents.policies import q_policy\n", "from tf_agents.policies import greedy_policy\n", "\n", "from tf_agents.trajectories import time_step as ts" ] }, { "cell_type": "markdown", "metadata": { "id": "NyXO5-Aalb-6" }, "source": [ "## Python 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "DOtUZ1hs02bu" }, "source": [ "Python 策略的接口在 `policies/py_policy.PyPolicy` 中进行定义。主要方法为:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "4PqNEVls1uqc" }, "outputs": [], "source": [ "class Base(object):\n", "\n", " @abc.abstractmethod\n", " def __init__(self, time_step_spec, action_spec, policy_state_spec=()):\n", " self._time_step_spec = time_step_spec\n", " self._action_spec = action_spec\n", " self._policy_state_spec = policy_state_spec\n", "\n", " @abc.abstractmethod\n", " def reset(self, policy_state=()):\n", " # return initial_policy_state.\n", " pass\n", "\n", " @abc.abstractmethod\n", " def action(self, time_step, policy_state=()):\n", " # return a PolicyStep(action, state, info) named tuple.\n", " pass\n", "\n", " @abc.abstractmethod\n", " def distribution(self, time_step, policy_state=()):\n", " # Not implemented in python, only for TF policies.\n", " pass\n", "\n", " @abc.abstractmethod\n", " def update(self, policy):\n", " # update self to be similar to the input `policy`.\n", " pass\n", "\n", " @property\n", " def time_step_spec(self):\n", " return self._time_step_spec\n", "\n", " @property\n", " def action_spec(self):\n", " return self._action_spec\n", "\n", " @property\n", " def policy_state_spec(self):\n", " return self._policy_state_spec" ] }, { "cell_type": "markdown", "metadata": { "id": "16kyDKk65bka" }, "source": [ "最重要的方法为 `action(time_step)`,该方法可将包含环境观测值的 `time_step` 映射到包含以下特性的 PolicyStep 命名元组:\n", "\n", "- `action`:应用于环境的操作。\n", "- `state`:将被馈入下一个操作调用的策略状态(例如 RNN 状态)。\n", "- `info`:可选辅助信息,例如操作日志概率。\n", "\n", "`time_step_spec` 和 `action_spec` 分别为输入时间步骤规范和输出操作规范。策略还具有 `reset` 函数,通常用于重置有状态策略中的状态。`update(new_policy)` 函数可将 `self` 朝着 `new_policy` 更新。\n", "\n", "现在,让我们看看 Python 策略的两个示例。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "YCH1Hs_WlmDT" }, "source": [ "### 示例 1:随机 Python 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "lbnQ0BQ3_0N2" }, "source": [ "`RandomPyPolicy` 是 `PyPolicy` 的一个简单示例, 它可以为给定的离散/连续 action_spec 生成随机操作。将忽略输入 `time_step`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QX8M4Nl-_0uu" }, "outputs": [], "source": [ "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n", "my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,\n", " action_spec=action_spec)\n", "time_step = None\n", "action_step = my_random_py_policy.action(time_step)\n", "print(action_step)\n", "action_step = my_random_py_policy.action(time_step)\n", "print(action_step)" ] }, { "cell_type": "markdown", "metadata": { "id": "B8WrFOR1lz31" }, "source": [ "### 示例 2:脚本化 Python 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "AJ0Br1lGBnTT" }, "source": [ "脚本化策略可回放以 `(num_repeats, action)` 元组列表形式呈现的操作脚本。每次调用 `action` 函数时,它都会返回列表内的下一个操作并使其完成指定的重复次数,然后再继续执行列表中的下一个操作。调用 `reset` 方法即可从列表开头重新开始执行。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_mZ244m4BUYv" }, "outputs": [], "source": [ "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n", "action_script = [(1, np.array([5, 2], dtype=np.int32)), \n", " (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeats` to 0 will skip this action.\n", " (2, np.array([1, 2], dtype=np.int32)), \n", " (1, np.array([3, 4], dtype=np.int32))]\n", "\n", "my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(\n", " time_step_spec=None, action_spec=action_spec, action_script=action_script)\n", "\n", "policy_state = my_scripted_py_policy.get_initial_state()\n", "time_step = None\n", "print('Executing scripted policy...')\n", "action_step = my_scripted_py_policy.action(time_step, policy_state)\n", "print(action_step)\n", "action_step= my_scripted_py_policy.action(time_step, action_step.state)\n", "print(action_step)\n", "action_step = my_scripted_py_policy.action(time_step, action_step.state)\n", "print(action_step)\n", "\n", "print('Resetting my_scripted_py_policy...')\n", "policy_state = my_scripted_py_policy.get_initial_state()\n", "action_step = my_scripted_py_policy.action(time_step, policy_state)\n", "print(action_step)" ] }, { "cell_type": "markdown", "metadata": { "id": "3Dz7HSTZl6aU" }, "source": [ "## TensorFlow 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "LwcoBXqKl8Yb" }, "source": [ "TensorFlow 策略采用与 Python 策略相同的接口。让我们看几个例子。" ] }, { "cell_type": "markdown", "metadata": { "id": "3x8pDWEFrQ5C" }, "source": [ "### 示例 1:随机 TF 策略\n", "\n", "使用 RandomTFPolicy 可根据给定的离散/连续 `action_spec` 生成随机操作。将忽略输入 `time_step`。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "nZ3pe5G4rjrW" }, "outputs": [], "source": [ "action_spec = tensor_spec.BoundedTensorSpec(\n", " (2,), tf.float32, minimum=-1, maximum=3)\n", "input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)\n", "time_step_spec = ts.time_step_spec(input_tensor_spec)\n", "\n", "my_random_tf_policy = random_tf_policy.RandomTFPolicy(\n", " action_spec=action_spec, time_step_spec=time_step_spec)\n", "observation = tf.ones(time_step_spec.observation.shape)\n", "time_step = ts.restart(observation)\n", "action_step = my_random_tf_policy.action(time_step)\n", "\n", "print('Action:')\n", "print(action_step.action)" ] }, { "cell_type": "markdown", "metadata": { "id": "GOBoWETprWCB" }, "source": [ "### 示例 2:参与者策略\n", "\n", "可以使用将 `time_steps` 映射到操作的网络或将 `time_steps` 映射到操作分布的网络创建参与者策略。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "2S94E5zQgge_" }, "source": [ "#### 使用操作网络" ] }, { "cell_type": "markdown", "metadata": { "id": "X2LM5STNgv1u" }, "source": [ "让我们按照以下方式定义网络:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "S2wFgzJFteQX" }, "outputs": [], "source": [ "class ActionNet(network.Network):\n", "\n", " def __init__(self, input_tensor_spec, output_tensor_spec):\n", " super(ActionNet, self).__init__(\n", " input_tensor_spec=input_tensor_spec,\n", " state_spec=(),\n", " name='ActionNet')\n", " self._output_tensor_spec = output_tensor_spec\n", " self._sub_layers = [\n", " tf.keras.layers.Dense(\n", " action_spec.shape.num_elements(), activation=tf.nn.tanh),\n", " ]\n", "\n", " def call(self, observations, step_type, network_state):\n", " del step_type\n", "\n", " output = tf.cast(observations, dtype=tf.float32)\n", " for layer in self._sub_layers:\n", " output = layer(output)\n", " actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())\n", "\n", " # Scale and shift actions to the correct range if necessary.\n", " return actions, network_state" ] }, { "cell_type": "markdown", "metadata": { "id": "k7fIn-ybVdC6" }, "source": [ "在 TensorFlow 中,大多数网络层都是针对批量运算而设计的,因此我们希望输入 time_step 得到批处理,网络的输出也得到批处理。另外,网络还负责在给定 action_spec 的正确范围内生成操作。常用方法是对最后一层使用 tanh 激活函数以在 [-1, 1] 区间内生成操作,然后将其缩放并移动到正确的范围作为输入 action_spec(例如,请参阅 `tf_agents/agents/ddpg/networks.actor_network()`)。\n", "\n", "现在,我们可以使用以上网络创建参与者策略。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "0UGmFTe7a5VQ" }, "outputs": [], "source": [ "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n", "time_step_spec = ts.time_step_spec(input_tensor_spec)\n", "action_spec = tensor_spec.BoundedTensorSpec((3,),\n", " tf.float32,\n", " minimum=-1,\n", " maximum=1)\n", "\n", "action_net = ActionNet(input_tensor_spec, action_spec)\n", "\n", "my_actor_policy = actor_policy.ActorPolicy(\n", " time_step_spec=time_step_spec,\n", " action_spec=action_spec,\n", " actor_network=action_net)" ] }, { "cell_type": "markdown", "metadata": { "id": "xlmGPTAmfPK3" }, "source": [ "我们可以将其应用于遵循 time_step_spec 的任何 time_step 批次:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "fvsIsR0VfOA4" }, "outputs": [], "source": [ "batch_size = 2\n", "observations = tf.ones([2] + time_step_spec.observation.shape.as_list())\n", "\n", "time_step = ts.restart(observations, batch_size)\n", "\n", "action_step = my_actor_policy.action(time_step)\n", "print('Action:')\n", "print(action_step.action)\n", "\n", "distribution_step = my_actor_policy.distribution(time_step)\n", "print('Action distribution:')\n", "print(distribution_step.action)" ] }, { "cell_type": "markdown", "metadata": { "id": "lumtyhejZOXR" }, "source": [ "在以上示例中,我们是使用生成操作张量的操作网络来创建的策略。在这种情况下,`policy.distribution(time_step)` 是围绕 `policy.action(time_step)` 输出的确定性(增量)分布。生成随机策略的一种方法是在策略包装器内包装参与者策略,为操作增加噪声。另一种方法是使用操作分布网络而非操作网络来创建参与者策略,如下所示。" ] }, { "cell_type": "markdown", "metadata": { "id": "_eNrJ5gKgl3W" }, "source": [ "#### 使用操作分布网络" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "sSYzC9LobVsK" }, "outputs": [], "source": [ "class ActionDistributionNet(ActionNet):\n", "\n", " def call(self, observations, step_type, network_state):\n", " action_means, network_state = super(ActionDistributionNet, self).call(\n", " observations, step_type, network_state)\n", "\n", " action_std = tf.ones_like(action_means)\n", " return tfp.distributions.MultivariateNormalDiag(action_means, action_std), network_state\n", "\n", "\n", "action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)\n", "\n", "my_actor_policy = actor_policy.ActorPolicy(\n", " time_step_spec=time_step_spec,\n", " action_spec=action_spec,\n", " actor_network=action_distribution_net)\n", "\n", "action_step = my_actor_policy.action(time_step)\n", "print('Action:')\n", "print(action_step.action)\n", "distribution_step = my_actor_policy.distribution(time_step)\n", "print('Action distribution:')\n", "print(distribution_step.action)" ] }, { "cell_type": "markdown", "metadata": { "id": "BzoNGJnlibtz" }, "source": [ "请注意,上例中的操作被裁剪到给定操作规范 [-1, 1] 区间内。这是因为 ActorPolicy clip 的构造函数参数的默认值为 True。将其设置为 False 将返回网络生成的未裁剪操作。 " ] }, { "cell_type": "markdown", "metadata": { "id": "PLj6A-5domNG" }, "source": [ "可以使用 GreedyPolicy 包装器将随机策略转换为确定性策略,该包装器选择 `stochastic_policy.distribution().mode()` 作为其操作,并选择围绕此贪心操作的确定性/增量分布作为其 `distribution()`。" ] }, { "cell_type": "markdown", "metadata": { "id": "4Xxzo2a7rZ7v" }, "source": [ "### 示例 3:Q 策略" ] }, { "cell_type": "markdown", "metadata": { "id": "79eGLqpOhQVp" }, "source": [ "Q 策略适用于 DQN 一类的代理,并基于预测每个离散操作 Q 值的 Q 网络。对于给定的时间步骤,Q 策略中的操作分布是使用 Q 值作为 logits 创建的分类分布。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Haakr2VvjqKC" }, "outputs": [], "source": [ "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n", "time_step_spec = ts.time_step_spec(input_tensor_spec)\n", "action_spec = tensor_spec.BoundedTensorSpec((),\n", " tf.int32,\n", " minimum=0,\n", " maximum=2)\n", "num_actions = action_spec.maximum - action_spec.minimum + 1\n", "\n", "\n", "class QNetwork(network.Network):\n", "\n", " def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):\n", " super(QNetwork, self).__init__(\n", " input_tensor_spec=input_tensor_spec,\n", " state_spec=(),\n", " name=name)\n", " self._sub_layers = [\n", " tf.keras.layers.Dense(num_actions),\n", " ]\n", "\n", " def call(self, inputs, step_type=None, network_state=()):\n", " del step_type\n", " inputs = tf.cast(inputs, tf.float32)\n", " for layer in self._sub_layers:\n", " inputs = layer(inputs)\n", " return inputs, network_state\n", "\n", "\n", "batch_size = 2\n", "observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())\n", "time_steps = ts.restart(observation, batch_size=batch_size)\n", "\n", "my_q_network = QNetwork(\n", " input_tensor_spec=input_tensor_spec,\n", " action_spec=action_spec)\n", "my_q_policy = q_policy.QPolicy(\n", " time_step_spec, action_spec, q_network=my_q_network)\n", "action_step = my_q_policy.action(time_steps)\n", "distribution_step = my_q_policy.distribution(time_steps)\n", "\n", "print('Action:')\n", "print(action_step.action)\n", "\n", "print('Action distribution:')\n", "print(distribution_step.action)" ] }, { "cell_type": "markdown", "metadata": { "id": "Xpu9m6mvqJY-" }, "source": [ "## 策略包装器" ] }, { "cell_type": "markdown", "metadata": { "id": "OfaUrqRAoigk" }, "source": [ "策略包装器可用于包装和修改给定策略,例如增加噪声。策略包装器是策略 (Python/TensorFlow) 的子类,因此可以像其他任何策略一样使用。 " ] }, { "cell_type": "markdown", "metadata": { "id": "-JJVVAALqVNQ" }, "source": [ "### 示例:贪心策略\n", "\n", "贪心包装器可用于包装实现 `distribution()` 的任何 TensorFlow 策略。`GreedyPolicy.action()` 将返回 `wrapped_policy.distribution().mode()`,而 `GreedyPolicy.distribution()` 是围绕 `GreedyPolicy.action()` 的确定性/增量分布:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "xsRPBeLZtXvu" }, "outputs": [], "source": [ "my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)\n", "\n", "action_step = my_greedy_policy.action(time_steps)\n", "print('Action:')\n", "print(action_step.action)\n", "\n", "distribution_step = my_greedy_policy.distribution(time_steps)\n", "print('Action distribution:')\n", "print(distribution_step.action)" ] } ], "metadata": { "colab": { "collapsed_sections": [], "name": "3_policies_tutorial.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }