{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1Pi_B2cvdBiW"
      },
      "source": [
        "##### Copyright 2023 The TF-Agents Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "nQnmcm0oI1Q-"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3NFuTvWVZG_B"
      },
      "source": [
        "# 策略\n",
        "\n",
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/agents/tutorials/3_policies_tutorial\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a> </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/agents/tutorials/3_policies_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 运行</a> </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/agents/tutorials/3_policies_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 Github 上查看源代码</a> </td>\n",
        "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/agents/tutorials/3_policies_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>   </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "31uij8nIo5bG"
      },
      "source": [
        "## 简介"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PqFn7q5bs3BF"
      },
      "source": [
        "在强化学习中，“策略”一词是指将环境的观测值映射到操作或操作分布。在 TF-Agents 中，环境的观测值包含在命名元组 `TimeStep('step_type', 'discount', 'reward', 'observation')` 内，而策略会将时间步骤映射到操作或操作分布。大多数策略使用 `timestep.observation`，某些策略会使用 `timestep.step_type`（例如，在有状态策略中，在片段开始时重置状态），但 `timestep.discount` 和 `timestep.reward` 通常会被忽略。\n",
        "\n",
        "策略与 TF-Agents 中其他组成部分的相关性如下。大多数策略都通过神经网络来计算时间步骤的操作和/或操作分布。代理可以包含一个或多个用于不同目的的策略，例如采用一个主要策略进行训练用以部署，并采用一个噪声策略进行数据收集。策略可以保存/恢复，并且可以在不依赖于代理的状态下用于数据收集、评估等工作。\n",
        "\n",
        "一些策略在 Tensorflow 中更容易编写（例如神经网络策略），而其他策略在 Python 中更容易编写（例如基于操作脚本的策略）。 因此，在 TF-Agent 中，Python 和 Tensorflow 策略均支持使用。此外，有时可能需要在 Python 环境中使用通过 TensorFlow 编写的策略，反之亦然（例如使用 TensorFlow 策略进行训练，但之后需要在生产 python 环境中进行部署）。为了简化这一步骤，我们提供了用于在 python 和 TensorFlow 策略之间进行转换的包装器。\n",
        "\n",
        "与策略有关的另一项有趣内容是策略包装器，它们能够以某种方式修改给定策略，例如添加特定类型的噪声、制作随机策略的贪心或 ϵ 贪心版本、随机混合多个策略等。 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HdnG_TT_amWH"
      },
      "source": [
        "## 设置"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9Meq2nT_aquh"
      },
      "source": [
        "如果尚未安装 TF-Agents，请运行以下命令："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xsLTHlVdiZP3"
      },
      "outputs": [],
      "source": [
        "!pip install tf-agents"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sdvop99JlYSM"
      },
      "outputs": [],
      "source": [
        "from __future__ import absolute_import\n",
        "from __future__ import division\n",
        "from __future__ import print_function\n",
        "\n",
        "import abc\n",
        "import tensorflow as tf\n",
        "import tensorflow_probability as tfp\n",
        "import numpy as np\n",
        "\n",
        "from tf_agents.specs import array_spec\n",
        "from tf_agents.specs import tensor_spec\n",
        "from tf_agents.networks import network\n",
        "\n",
        "from tf_agents.policies import py_policy\n",
        "from tf_agents.policies import random_py_policy\n",
        "from tf_agents.policies import scripted_py_policy\n",
        "\n",
        "from tf_agents.policies import tf_policy\n",
        "from tf_agents.policies import random_tf_policy\n",
        "from tf_agents.policies import actor_policy\n",
        "from tf_agents.policies import q_policy\n",
        "from tf_agents.policies import greedy_policy\n",
        "\n",
        "from tf_agents.trajectories import time_step as ts"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NyXO5-Aalb-6"
      },
      "source": [
        "## Python 策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DOtUZ1hs02bu"
      },
      "source": [
        "Python 策略的接口在 `policies/py_policy.PyPolicy` 中进行定义。主要方法为：\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "4PqNEVls1uqc"
      },
      "outputs": [],
      "source": [
        "class Base(object):\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def __init__(self, time_step_spec, action_spec, policy_state_spec=()):\n",
        "    self._time_step_spec = time_step_spec\n",
        "    self._action_spec = action_spec\n",
        "    self._policy_state_spec = policy_state_spec\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def reset(self, policy_state=()):\n",
        "    # return initial_policy_state.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def action(self, time_step, policy_state=()):\n",
        "    # return a PolicyStep(action, state, info) named tuple.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def distribution(self, time_step, policy_state=()):\n",
        "    # Not implemented in python, only for TF policies.\n",
        "    pass\n",
        "\n",
        "  @abc.abstractmethod\n",
        "  def update(self, policy):\n",
        "    # update self to be similar to the input `policy`.\n",
        "    pass\n",
        "\n",
        "  @property\n",
        "  def time_step_spec(self):\n",
        "    return self._time_step_spec\n",
        "\n",
        "  @property\n",
        "  def action_spec(self):\n",
        "    return self._action_spec\n",
        "\n",
        "  @property\n",
        "  def policy_state_spec(self):\n",
        "    return self._policy_state_spec"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "16kyDKk65bka"
      },
      "source": [
        "最重要的方法为 `action(time_step)`，该方法可将包含环境观测值的 `time_step` 映射到包含以下特性的 PolicyStep 命名元组：\n",
        "\n",
        "- `action`：应用于环境的操作。\n",
        "- `state`：将被馈入下一个操作调用的策略状态（例如 RNN 状态）。\n",
        "- `info`：可选辅助信息，例如操作日志概率。\n",
        "\n",
        "`time_step_spec` 和 `action_spec` 分别为输入时间步骤规范和输出操作规范。策略还具有 `reset` 函数，通常用于重置有状态策略中的状态。`update(new_policy)` 函数可将 `self` 朝着 `new_policy` 更新。\n",
        "\n",
        "现在，让我们看看 Python 策略的两个示例。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YCH1Hs_WlmDT"
      },
      "source": [
        "### 示例 1：随机 Python 策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lbnQ0BQ3_0N2"
      },
      "source": [
        "`RandomPyPolicy` 是 `PyPolicy` 的一个简单示例， 它可以为给定的离散/连续 action_spec 生成随机操作。将忽略输入 `time_step`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QX8M4Nl-_0uu"
      },
      "outputs": [],
      "source": [
        "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n",
        "my_random_py_policy = random_py_policy.RandomPyPolicy(time_step_spec=None,\n",
        "    action_spec=action_spec)\n",
        "time_step = None\n",
        "action_step = my_random_py_policy.action(time_step)\n",
        "print(action_step)\n",
        "action_step = my_random_py_policy.action(time_step)\n",
        "print(action_step)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "B8WrFOR1lz31"
      },
      "source": [
        "### 示例 2：脚本化 Python 策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AJ0Br1lGBnTT"
      },
      "source": [
        "脚本化策略可回放以 `(num_repeats, action)` 元组列表形式呈现的操作脚本。每次调用 `action` 函数时，它都会返回列表内的下一个操作并使其完成指定的重复次数，然后再继续执行列表中的下一个操作。调用 `reset` 方法即可从列表开头重新开始执行。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "_mZ244m4BUYv"
      },
      "outputs": [],
      "source": [
        "action_spec = array_spec.BoundedArraySpec((2,), np.int32, -10, 10)\n",
        "action_script = [(1, np.array([5, 2], dtype=np.int32)), \n",
        "                 (0, np.array([0, 0], dtype=np.int32)), # Setting `num_repeats` to 0 will skip this action.\n",
        "                 (2, np.array([1, 2], dtype=np.int32)), \n",
        "                 (1, np.array([3, 4], dtype=np.int32))]\n",
        "\n",
        "my_scripted_py_policy = scripted_py_policy.ScriptedPyPolicy(\n",
        "    time_step_spec=None, action_spec=action_spec, action_script=action_script)\n",
        "\n",
        "policy_state = my_scripted_py_policy.get_initial_state()\n",
        "time_step = None\n",
        "print('Executing scripted policy...')\n",
        "action_step = my_scripted_py_policy.action(time_step, policy_state)\n",
        "print(action_step)\n",
        "action_step= my_scripted_py_policy.action(time_step, action_step.state)\n",
        "print(action_step)\n",
        "action_step = my_scripted_py_policy.action(time_step, action_step.state)\n",
        "print(action_step)\n",
        "\n",
        "print('Resetting my_scripted_py_policy...')\n",
        "policy_state = my_scripted_py_policy.get_initial_state()\n",
        "action_step = my_scripted_py_policy.action(time_step, policy_state)\n",
        "print(action_step)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3Dz7HSTZl6aU"
      },
      "source": [
        "## TensorFlow 策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LwcoBXqKl8Yb"
      },
      "source": [
        "TensorFlow 策略采用与 Python 策略相同的接口。让我们看几个例子。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "3x8pDWEFrQ5C"
      },
      "source": [
        "### 示例 1：随机 TF 策略\n",
        "\n",
        "使用 RandomTFPolicy 可根据给定的离散/连续 `action_spec` 生成随机操作。将忽略输入 `time_step`。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "nZ3pe5G4rjrW"
      },
      "outputs": [],
      "source": [
        "action_spec = tensor_spec.BoundedTensorSpec(\n",
        "    (2,), tf.float32, minimum=-1, maximum=3)\n",
        "input_tensor_spec = tensor_spec.TensorSpec((2,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "\n",
        "my_random_tf_policy = random_tf_policy.RandomTFPolicy(\n",
        "    action_spec=action_spec, time_step_spec=time_step_spec)\n",
        "observation = tf.ones(time_step_spec.observation.shape)\n",
        "time_step = ts.restart(observation)\n",
        "action_step = my_random_tf_policy.action(time_step)\n",
        "\n",
        "print('Action:')\n",
        "print(action_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GOBoWETprWCB"
      },
      "source": [
        "### 示例 2：参与者策略\n",
        "\n",
        "可以使用将 `time_steps` 映射到操作的网络或将 `time_steps` 映射到操作分布的网络创建参与者策略。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2S94E5zQgge_"
      },
      "source": [
        "#### 使用操作网络"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X2LM5STNgv1u"
      },
      "source": [
        "让我们按照以下方式定义网络："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "S2wFgzJFteQX"
      },
      "outputs": [],
      "source": [
        "class ActionNet(network.Network):\n",
        "\n",
        "  def __init__(self, input_tensor_spec, output_tensor_spec):\n",
        "    super(ActionNet, self).__init__(\n",
        "        input_tensor_spec=input_tensor_spec,\n",
        "        state_spec=(),\n",
        "        name='ActionNet')\n",
        "    self._output_tensor_spec = output_tensor_spec\n",
        "    self._sub_layers = [\n",
        "        tf.keras.layers.Dense(\n",
        "            action_spec.shape.num_elements(), activation=tf.nn.tanh),\n",
        "    ]\n",
        "\n",
        "  def call(self, observations, step_type, network_state):\n",
        "    del step_type\n",
        "\n",
        "    output = tf.cast(observations, dtype=tf.float32)\n",
        "    for layer in self._sub_layers:\n",
        "      output = layer(output)\n",
        "    actions = tf.reshape(output, [-1] + self._output_tensor_spec.shape.as_list())\n",
        "\n",
        "    # Scale and shift actions to the correct range if necessary.\n",
        "    return actions, network_state"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "k7fIn-ybVdC6"
      },
      "source": [
        "在 TensorFlow 中，大多数网络层都是针对批量运算而设计的，因此我们希望输入 time_step 得到批处理，网络的输出也得到批处理。另外，网络还负责在给定 action_spec 的正确范围内生成操作。常用方法是对最后一层使用 tanh 激活函数以在 [-1, 1] 区间内生成操作，然后将其缩放并移动到正确的范围作为输入 action_spec（例如，请参阅 `tf_agents/agents/ddpg/networks.actor_network()`）。\n",
        "\n",
        "现在，我们可以使用以上网络创建参与者策略。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "0UGmFTe7a5VQ"
      },
      "outputs": [],
      "source": [
        "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "action_spec = tensor_spec.BoundedTensorSpec((3,),\n",
        "                                            tf.float32,\n",
        "                                            minimum=-1,\n",
        "                                            maximum=1)\n",
        "\n",
        "action_net = ActionNet(input_tensor_spec, action_spec)\n",
        "\n",
        "my_actor_policy = actor_policy.ActorPolicy(\n",
        "    time_step_spec=time_step_spec,\n",
        "    action_spec=action_spec,\n",
        "    actor_network=action_net)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xlmGPTAmfPK3"
      },
      "source": [
        "我们可以将其应用于遵循 time_step_spec 的任何 time_step 批次："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "fvsIsR0VfOA4"
      },
      "outputs": [],
      "source": [
        "batch_size = 2\n",
        "observations = tf.ones([2] + time_step_spec.observation.shape.as_list())\n",
        "\n",
        "time_step = ts.restart(observations, batch_size)\n",
        "\n",
        "action_step = my_actor_policy.action(time_step)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "distribution_step = my_actor_policy.distribution(time_step)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lumtyhejZOXR"
      },
      "source": [
        "在以上示例中，我们是使用生成操作张量的操作网络来创建的策略。在这种情况下，`policy.distribution(time_step)` 是围绕 `policy.action(time_step)` 输出的确定性（增量）分布。生成随机策略的一种方法是在策略包装器内包装参与者策略，为操作增加噪声。另一种方法是使用操作分布网络而非操作网络来创建参与者策略，如下所示。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_eNrJ5gKgl3W"
      },
      "source": [
        "#### 使用操作分布网络"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "sSYzC9LobVsK"
      },
      "outputs": [],
      "source": [
        "class ActionDistributionNet(ActionNet):\n",
        "\n",
        "  def call(self, observations, step_type, network_state):\n",
        "    action_means, network_state = super(ActionDistributionNet, self).call(\n",
        "        observations, step_type, network_state)\n",
        "\n",
        "    action_std = tf.ones_like(action_means)\n",
        "    return tfp.distributions.MultivariateNormalDiag(action_means, action_std), network_state\n",
        "\n",
        "\n",
        "action_distribution_net = ActionDistributionNet(input_tensor_spec, action_spec)\n",
        "\n",
        "my_actor_policy = actor_policy.ActorPolicy(\n",
        "    time_step_spec=time_step_spec,\n",
        "    action_spec=action_spec,\n",
        "    actor_network=action_distribution_net)\n",
        "\n",
        "action_step = my_actor_policy.action(time_step)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "distribution_step = my_actor_policy.distribution(time_step)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BzoNGJnlibtz"
      },
      "source": [
        "请注意，上例中的操作被裁剪到给定操作规范 [-1, 1] 区间内。这是因为 ActorPolicy clip 的构造函数参数的默认值为 True。将其设置为 False 将返回网络生成的未裁剪操作。 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PLj6A-5domNG"
      },
      "source": [
        "可以使用 GreedyPolicy 包装器将随机策略转换为确定性策略，该包装器选择 `stochastic_policy.distribution().mode()` 作为其操作，并选择围绕此贪心操作的确定性/增量分布作为其 `distribution()`。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4Xxzo2a7rZ7v"
      },
      "source": [
        "### 示例 3：Q 策略"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "79eGLqpOhQVp"
      },
      "source": [
        "Q 策略适用于 DQN 一类的代理，并基于预测每个离散操作 Q 值的 Q 网络。对于给定的时间步骤，Q 策略中的操作分布是使用 Q 值作为 logits 创建的分类分布。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Haakr2VvjqKC"
      },
      "outputs": [],
      "source": [
        "input_tensor_spec = tensor_spec.TensorSpec((4,), tf.float32)\n",
        "time_step_spec = ts.time_step_spec(input_tensor_spec)\n",
        "action_spec = tensor_spec.BoundedTensorSpec((),\n",
        "                                            tf.int32,\n",
        "                                            minimum=0,\n",
        "                                            maximum=2)\n",
        "num_actions = action_spec.maximum - action_spec.minimum + 1\n",
        "\n",
        "\n",
        "class QNetwork(network.Network):\n",
        "\n",
        "  def __init__(self, input_tensor_spec, action_spec, num_actions=num_actions, name=None):\n",
        "    super(QNetwork, self).__init__(\n",
        "        input_tensor_spec=input_tensor_spec,\n",
        "        state_spec=(),\n",
        "        name=name)\n",
        "    self._sub_layers = [\n",
        "        tf.keras.layers.Dense(num_actions),\n",
        "    ]\n",
        "\n",
        "  def call(self, inputs, step_type=None, network_state=()):\n",
        "    del step_type\n",
        "    inputs = tf.cast(inputs, tf.float32)\n",
        "    for layer in self._sub_layers:\n",
        "      inputs = layer(inputs)\n",
        "    return inputs, network_state\n",
        "\n",
        "\n",
        "batch_size = 2\n",
        "observation = tf.ones([batch_size] + time_step_spec.observation.shape.as_list())\n",
        "time_steps = ts.restart(observation, batch_size=batch_size)\n",
        "\n",
        "my_q_network = QNetwork(\n",
        "    input_tensor_spec=input_tensor_spec,\n",
        "    action_spec=action_spec)\n",
        "my_q_policy = q_policy.QPolicy(\n",
        "    time_step_spec, action_spec, q_network=my_q_network)\n",
        "action_step = my_q_policy.action(time_steps)\n",
        "distribution_step = my_q_policy.distribution(time_steps)\n",
        "\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Xpu9m6mvqJY-"
      },
      "source": [
        "## 策略包装器"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OfaUrqRAoigk"
      },
      "source": [
        "策略包装器可用于包装和修改给定策略，例如增加噪声。策略包装器是策略 (Python/TensorFlow) 的子类，因此可以像其他任何策略一样使用。 "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-JJVVAALqVNQ"
      },
      "source": [
        "### 示例：贪心策略\n",
        "\n",
        "贪心包装器可用于包装实现 `distribution()` 的任何 TensorFlow 策略。`GreedyPolicy.action()` 将返回 `wrapped_policy.distribution().mode()`，而 `GreedyPolicy.distribution()` 是围绕 `GreedyPolicy.action()` 的确定性/增量分布："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "xsRPBeLZtXvu"
      },
      "outputs": [],
      "source": [
        "my_greedy_policy = greedy_policy.GreedyPolicy(my_q_policy)\n",
        "\n",
        "action_step = my_greedy_policy.action(time_steps)\n",
        "print('Action:')\n",
        "print(action_step.action)\n",
        "\n",
        "distribution_step = my_greedy_policy.distribution(time_steps)\n",
        "print('Action distribution:')\n",
        "print(distribution_step.action)"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [],
      "name": "3_policies_tutorial.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}