{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nPjtEgqN4SjA"
      },
      "source": [
        "##### Copyright 2023 The TF-Agents Authors."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "cellView": "form",
        "id": "6AZJOyCA4NpL"
      },
      "outputs": [],
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GdBl31Dqwomt"
      },
      "source": [
        "# 具有按臂特征的多臂老虎机教程"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y2vvG61d35bG"
      },
      "source": [
        "### 开始\n",
        "\n",
        "<table class=\"tfo-notebook-buttons\" align=\"left\">\n",
        "  <td>     <a target=\"_blank\" href=\"https://tensorflow.google.cn/agents/tutorials/per_arm_bandits_tutorial\"><img src=\"https://tensorflow.google.cn/images/tf_logo_32px.png\">在 TensorFlow.org 上查看</a>   </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://colab.research.google.com/github/tensorflow/docs-l10n/blob/master/site/zh-cn/agents/tutorials/per_arm_bandits_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/colab_logo_32px.png\">在 Google Colab 运行</a> </td>\n",
        "  <td>     <a target=\"_blank\" href=\"https://github.com/tensorflow/docs-l10n/blob/master/site/zh-cn/agents/tutorials/per_arm_bandits_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/GitHub-Mark-32px.png\">在 Github 上查看源代码</a> </td>\n",
        "  <td>     <a href=\"https://storage.googleapis.com/tensorflow_docs/docs-l10n/site/zh-cn/agents/tutorials/per_arm_bandits_tutorial.ipynb\"><img src=\"https://tensorflow.google.cn/images/download_logo_32px.png\">下载笔记本</a>   </td>\n",
        "</table>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ddRJNIiEwu9O"
      },
      "source": [
        "本教程将逐步指导您使用 TF-Agents 库来解决每个动作（臂）都具有自己的特征的上下文老虎机问题，例如通过特征（类型、发行年份等）表示的电影列表。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I6mUk-hZa3pB"
      },
      "source": [
        "### 前提条件\n",
        "\n",
        "我们假定读者对于 TF-Agents 的 Bandit 库有一定了解，特别是在阅读本教程之前已完成 [TF-Agents 中的多臂老虎机教程](https://github.com/tensorflow/agents/tree/master/docs/tutorials/bandits_tutorial.ipynb)。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kscmAIA5xtJW"
      },
      "source": [
        "## 具有臂特征的多臂老虎机\n",
        "\n",
        "在“经典”上下文多臂老虎机环境中，代理在每个时间步都会接收一个上下文向量（也称观测值），并且需要从一组有限的带编号动作（臂）中进行选择以最大化其累积奖励。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "BDkno4bQ1vPE"
      },
      "source": [
        "现在请考虑一种场景：代理向用户推荐下一部要观看的电影。每次需要做出决定时，代理都会接收一些用户相关信息（观影历史记录、类型偏好等）作为上下文，以及可供选择的电影列表。\n",
        "\n",
        "我们可以尝试通过将用户信息作为上下文来分析这个问题，老虎机臂将为 `movie_1, movie_2, ..., movie_K`，但这种方式有多个缺点：\n",
        "\n",
        "- 动作的数量必须是系统中的电影总数，而添加新电影将非常麻烦。\n",
        "- 代理必须针对每一部电影学习一个模型。\n",
        "- 不会考虑到电影之间的相似度。\n",
        "\n",
        "与其对电影进行编号，我们不如使用更加直观的方式：我们可以用一组特征来表示电影，包括类型、时长、演员、评分、年份等。这种方式具有多个优点：\n",
        "\n",
        "- 在不同电影之间进行泛化。\n",
        "- 代理仅学习一种使用用户和电影特征对奖励进行建模的奖励函数。\n",
        "- 易于从系统中移除电影或在系统中引入新电影。\n",
        "\n",
        "在这种新的环境中，每个时间步的动作数量甚至不必相同。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hMXxIHXNRP5_"
      },
      "source": [
        "## TF-Agents 中的按臂老虎机\n",
        "\n",
        "开发的 TF-Agents Bandit 套件同样可用于按臂案例。它提供了按臂环境，并且大多数策略和代理都可以在按臂模式下运行。\n",
        "\n",
        "在我们深入编写示例之前，我们需要导入必要内容。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Cl5_CCIWSFvn"
      },
      "source": [
        "### 安装"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "QxiNIm5XSIIp"
      },
      "outputs": [],
      "source": [
        "!pip install tf-agents"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EDKNu5JTSDmf"
      },
      "source": [
        "### 导入"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "BbQXsoeKR2ui"
      },
      "outputs": [],
      "source": [
        "import functools\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import tensorflow as tf\n",
        "\n",
        "from tf_agents.bandits.agents import lin_ucb_agent\n",
        "from tf_agents.bandits.environments import stationary_stochastic_per_arm_py_environment as p_a_env\n",
        "from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics\n",
        "from tf_agents.drivers import dynamic_step_driver\n",
        "from tf_agents.environments import tf_py_environment\n",
        "from tf_agents.replay_buffers import tf_uniform_replay_buffer\n",
        "from tf_agents.specs import tensor_spec\n",
        "from tf_agents.trajectories import time_step as ts\n",
        "\n",
        "nest = tf.nest"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Q4cVyq3JMM7Z"
      },
      "source": [
        "### 参数 -- 请随意调整"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "bfna8xm0MSCn"
      },
      "outputs": [],
      "source": [
        "# The dimension of the global features.\n",
        "GLOBAL_DIM = 40  #@param {type:\"integer\"}\n",
        "# The elements of the global feature will be integers in [-GLOBAL_BOUND, GLOBAL_BOUND).\n",
        "GLOBAL_BOUND = 10  #@param {type:\"integer\"}\n",
        "# The dimension of the per-arm features.\n",
        "PER_ARM_DIM = 50  #@param {type:\"integer\"}\n",
        "# The elements of the PER-ARM feature will be integers in [-PER_ARM_BOUND, PER_ARM_BOUND).\n",
        "PER_ARM_BOUND = 6  #@param {type:\"integer\"}\n",
        "# The variance of the Gaussian distribution that generates the rewards.\n",
        "VARIANCE = 100.0  #@param {type: \"number\"}\n",
        "# The elements of the linear reward parameter will be integers in [-PARAM_BOUND, PARAM_BOUND).\n",
        "PARAM_BOUND = 10  #@param {type: \"integer\"}\n",
        "\n",
        "NUM_ACTIONS = 70  #@param {type:\"integer\"}\n",
        "BATCH_SIZE = 20  #@param {type:\"integer\"}\n",
        "\n",
        "# Parameter for linear reward function acting on the\n",
        "# concatenation of global and per-arm features.\n",
        "reward_param = list(np.random.randint(\n",
        "      -PARAM_BOUND, PARAM_BOUND, [GLOBAL_DIM + PER_ARM_DIM]))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "N-mEz1HvRIBC"
      },
      "source": [
        "### 简单的按臂环境\n",
        "\n",
        "我们在另一个[教程](https://github.com/tensorflow/agents/tree/master/docs/tutorials/bandits_tutorial.ipynb)中讲解过平稳随机环境，而它也具有相对应的按臂平稳随机环境。\n",
        "\n",
        "要初始化按臂环境，必须定义函数来生成以下内容：\n",
        "\n",
        "- *全局和按臂特征*：这些函数没有输入参数，并会在调用时生成单个（全局或按臂）特征向量。\n",
        "- *奖励*：此函数会将全局和按臂特征向量的串联作为参数，并生成奖励。基本上，这是代理需要“猜测”的函数。这里值得注意的是，在按臂案例中，奖励函数对于每个老虎机臂都是相同的。这是与经典老虎机案例的根本区别，在经典老虎机案例中，代理必须针对每个老虎机臂独立估计奖励函数。\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "kfTa5Y4ZYjhO"
      },
      "outputs": [],
      "source": [
        "def global_context_sampling_fn():\n",
        "  \"\"\"This function generates a single global observation vector.\"\"\"\n",
        "  return np.random.randint(\n",
        "      -GLOBAL_BOUND, GLOBAL_BOUND, [GLOBAL_DIM]).astype(np.float32)\n",
        "\n",
        "def per_arm_context_sampling_fn():\n",
        "  \"\"\"\"This function generates a single per-arm observation vector.\"\"\"\n",
        "  return np.random.randint(\n",
        "      -PER_ARM_BOUND, PER_ARM_BOUND, [PER_ARM_DIM]).astype(np.float32)\n",
        "\n",
        "def linear_normal_reward_fn(x):\n",
        "  \"\"\"This function generates a reward from the concatenated global and per-arm observations.\"\"\"\n",
        "  mu = np.dot(x, reward_param)\n",
        "  return np.random.normal(mu, VARIANCE)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "P2vpRPhheTo5"
      },
      "source": [
        "现在，我们可以初始化我们的环境了。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Ny-yikfQQi9l"
      },
      "outputs": [],
      "source": [
        "per_arm_py_env = p_a_env.StationaryStochasticPerArmPyEnvironment(\n",
        "    global_context_sampling_fn,\n",
        "    per_arm_context_sampling_fn,\n",
        "    NUM_ACTIONS,\n",
        "    linear_normal_reward_fn,\n",
        "    batch_size=BATCH_SIZE\n",
        ")\n",
        "per_arm_tf_env = tf_py_environment.TFPyEnvironment(per_arm_py_env)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iIzFl8HiAIxg"
      },
      "source": [
        "下面我们可以检查此环境能够生成哪些内容。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "J8ZVqMU5AOzs"
      },
      "outputs": [],
      "source": [
        "print('observation spec: ', per_arm_tf_env.observation_spec())\n",
        "print('\\nAn observation: ', per_arm_tf_env.reset().observation)\n",
        "\n",
        "action = tf.zeros(BATCH_SIZE, dtype=tf.int32)\n",
        "time_step = per_arm_tf_env.step(action)\n",
        "print('\\nRewards after taking an action: ', time_step.reward)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eIlCGssRAzIr"
      },
      "source": [
        "我们看到观测值规范是一个包含两个元素的字典：\n",
        "\n",
        "- 一个包含键 `'global'`：这是全局上下文部分，其形状与参数 `GLOBAL_DIM` 匹配。\n",
        "- 一个包含键 `'per_arm'`：这是按臂上下文，其形状为 `[NUM_ACTIONS, PER_ARM_DIM]`。此部分为一个时间步中每个老虎机臂的臂特征占位符。\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eTpWbNVeS6ci"
      },
      "source": [
        "### LinUCB 代理"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "q83tgZR9LLUx"
      },
      "source": [
        "LinUCB 代理可实现同名的 Bandit 算法，能够估计线性奖励函数的参数，同时会在估计周围保持一个置信椭圆。代理会选择具有最高估计期望奖励的臂，假定参数位于置信椭圆内。\n",
        "\n",
        "创建代理需要了解观测值和动作规范。定义代理时，我们要将布尔参数 `accepts_per_arm_features` 设置为 `True`。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "MqzA9Zi0Q2No"
      },
      "outputs": [],
      "source": [
        "observation_spec = per_arm_tf_env.observation_spec()\n",
        "time_step_spec = ts.time_step_spec(observation_spec)\n",
        "action_spec = tensor_spec.BoundedTensorSpec(\n",
        "    dtype=tf.int32, shape=(), minimum=0, maximum=NUM_ACTIONS - 1)\n",
        "\n",
        "agent = lin_ucb_agent.LinearUCBAgent(time_step_spec=time_step_spec,\n",
        "                                     action_spec=action_spec,\n",
        "                                     accepts_per_arm_features=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MaQlv1lpC-oc"
      },
      "source": [
        "### 训练数据流\n",
        "\n",
        "本部分将简要介绍按臂特征从策略到训练的机制。您可随意跳到下一部分（定义后悔值指标），如有兴趣可稍后回来阅读。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gUSy2IFK5NpU"
      },
      "source": [
        "首先，让我们看一下代理中的数据规范。代理的 `training_data_spec` 特性用于指定训练数据应具有哪些元素和什么结构。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "YQRZd43o5M0j"
      },
      "outputs": [],
      "source": [
        "print('training data spec: ', agent.training_data_spec)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "UyEFAHUg-m9V"
      },
      "source": [
        "如果我们仔细查看规范的 `observation` 部分，我们会发现它并不包含按臂特征！"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "lTBR1vgG-2AM"
      },
      "outputs": [],
      "source": [
        "print('observation spec in training: ', agent.training_data_spec.observation)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eDtmUgeJA_DN"
      },
      "source": [
        "按臂特征发生了什么？要回答这个问题，我们首先要注意到，LinUCB 代理进行训练时，它并不需要**所有**老虎机臂的按臂特征，而是只需要**所选**老虎机臂的按臂特征。因此，有道理丢弃形状为 `[BATCH_SIZE, NUM_ACTIONS, PER_ARM_DIM]` 的张量，因为它非常浪费资源，尤其是在动作数量较大的情况下。\n",
        "\n",
        "但是，所选老虎机臂的按臂特征必须位于某个位置！为此，我们要确保 LinUCB 策略将所选老虎机臂的特征存储在训练数据的 `policy_info` 字段中："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "C0fHvLm0Cpq9"
      },
      "outputs": [],
      "source": [
        "print('chosen arm features: ', agent.training_data_spec.policy_info.chosen_arm_features)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4psGJKGIDVgN"
      },
      "source": [
        "我们从形状看出，`chosen_arm_features` 字段只有一个老虎机臂的特征向量，它将是所选老虎机臂。请注意，正如我们在查看训练数据规范时所见，`policy_info` 以及随后的 `chosen_arm_features` 是训练数据的一部分，因此在训练时可用。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ancYxxEHc-6Q"
      },
      "source": [
        "### 定义后悔值指标"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4j2RTrBfzVJQ"
      },
      "source": [
        "在开始训练循环之前，我们定义了一些效用函数来帮助计算代理的后悔值。这些函数有助于在给定一组动作（由其臂特征给出）和对代理隐藏的线性参数的情况下确定最佳预期奖励。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "N1_kJAOS6VPo"
      },
      "outputs": [],
      "source": [
        "def _all_rewards(observation, hidden_param):\n",
        "  \"\"\"Outputs rewards for all actions, given an observation.\"\"\"\n",
        "  hidden_param = tf.cast(hidden_param, dtype=tf.float32)\n",
        "  global_obs = observation['global']\n",
        "  per_arm_obs = observation['per_arm']\n",
        "  num_actions = tf.shape(per_arm_obs)[1]\n",
        "  tiled_global = tf.tile(\n",
        "      tf.expand_dims(global_obs, axis=1), [1, num_actions, 1])\n",
        "  concatenated = tf.concat([tiled_global, per_arm_obs], axis=-1)\n",
        "  rewards = tf.linalg.matvec(concatenated, hidden_param)\n",
        "  return rewards\n",
        "\n",
        "def optimal_reward(observation):\n",
        "  \"\"\"Outputs the maximum expected reward for every element in the batch.\"\"\"\n",
        "  return tf.reduce_max(_all_rewards(observation, reward_param), axis=1)\n",
        "\n",
        "regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "i06WDbjrUSog"
      },
      "source": [
        "现在我们已准备就绪，可以开始我们的老虎机训练循环了。下面的驱动器负责使用策略选择动作，将所选动作的奖励存储在重播缓冲区中，计算预定义的后悔值指标，以及执行代理的训练步。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "j2Iimtmkzs0-"
      },
      "outputs": [],
      "source": [
        "num_iterations = 20 # @param\n",
        "steps_per_loop = 1 # @param\n",
        "\n",
        "replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(\n",
        "    data_spec=agent.policy.trajectory_spec,\n",
        "    batch_size=BATCH_SIZE,\n",
        "    max_length=steps_per_loop)\n",
        "\n",
        "observers = [replay_buffer.add_batch, regret_metric]\n",
        "\n",
        "driver = dynamic_step_driver.DynamicStepDriver(\n",
        "    env=per_arm_tf_env,\n",
        "    policy=agent.collect_policy,\n",
        "    num_steps=steps_per_loop * BATCH_SIZE,\n",
        "    observers=observers)\n",
        "\n",
        "regret_values = []\n",
        "\n",
        "for _ in range(num_iterations):\n",
        "  driver.run()\n",
        "  loss_info = agent.train(replay_buffer.gather_all())\n",
        "  replay_buffer.clear()\n",
        "  regret_values.append(regret_metric.result())\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yG5VMgSlUqYS"
      },
      "source": [
        "现在让我们看看结果。如果所做工作全部正确，代理将能够有效估计线性奖励函数，因此策略可以选择预期奖励接近最优值的动作。我们上面定义的后悔值指标可以表明这点，该指标逐渐下降并趋近于零。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "p4DOYwhMSUVh"
      },
      "outputs": [],
      "source": [
        "plt.plot(regret_values)\n",
        "plt.title('Regret of LinUCB on the Linear per-arm environment')\n",
        "plt.xlabel('Number of Iterations')\n",
        "_ = plt.ylabel('Average Regret')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "t7ZgHgOx5Ojq"
      },
      "source": [
        "### 后续步骤\n",
        "\n",
        "我们的代码库中[实现](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/agents/examples/v2/train_eval_per_arm_stationary_linear.py)了上面的示例，您也可以选择其他代理，包括[神经 epsilon 贪心算法代理](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/agents/neural_epsilon_greedy_agent.py)。"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "t7ZgHgOx5Ojq"
      ],
      "name": "per_arm_bandits_tutorial.ipynb",
      "toc_visible": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}