{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "nPjtEgqN4SjA" }, "source": [ "##### Copyright 2023 The TF-Agents Authors." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "cellView": "form", "id": "6AZJOyCA4NpL" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "markdown", "metadata": { "id": "GdBl31Dqwomt" }, "source": [ "# 具有按臂特征的多臂老虎机教程" ] }, { "cell_type": "markdown", "metadata": { "id": "Y2vvG61d35bG" }, "source": [ "### 开始\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看 在 Google Colab 运行 在 Github 上查看源代码 下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "ddRJNIiEwu9O" }, "source": [ "本教程将逐步指导您使用 TF-Agents 库来解决每个动作(臂)都具有自己的特征的上下文老虎机问题,例如通过特征(类型、发行年份等)表示的电影列表。" ] }, { "cell_type": "markdown", "metadata": { "id": "I6mUk-hZa3pB" }, "source": [ "### 前提条件\n", "\n", "我们假定读者对于 TF-Agents 的 Bandit 库有一定了解,特别是在阅读本教程之前已完成 [TF-Agents 中的多臂老虎机教程](https://github.com/tensorflow/agents/tree/master/docs/tutorials/bandits_tutorial.ipynb)。" ] }, { "cell_type": "markdown", "metadata": { "id": "kscmAIA5xtJW" }, "source": [ "## 具有臂特征的多臂老虎机\n", "\n", "在“经典”上下文多臂老虎机环境中,代理在每个时间步都会接收一个上下文向量(也称观测值),并且需要从一组有限的带编号动作(臂)中进行选择以最大化其累积奖励。" ] }, { "cell_type": "markdown", "metadata": { "id": "BDkno4bQ1vPE" }, "source": [ "现在请考虑一种场景:代理向用户推荐下一部要观看的电影。每次需要做出决定时,代理都会接收一些用户相关信息(观影历史记录、类型偏好等)作为上下文,以及可供选择的电影列表。\n", "\n", "我们可以尝试通过将用户信息作为上下文来分析这个问题,老虎机臂将为 `movie_1, movie_2, ..., movie_K`,但这种方式有多个缺点:\n", "\n", "- 动作的数量必须是系统中的电影总数,而添加新电影将非常麻烦。\n", "- 代理必须针对每一部电影学习一个模型。\n", "- 不会考虑到电影之间的相似度。\n", "\n", "与其对电影进行编号,我们不如使用更加直观的方式:我们可以用一组特征来表示电影,包括类型、时长、演员、评分、年份等。这种方式具有多个优点:\n", "\n", "- 在不同电影之间进行泛化。\n", "- 代理仅学习一种使用用户和电影特征对奖励进行建模的奖励函数。\n", "- 易于从系统中移除电影或在系统中引入新电影。\n", "\n", "在这种新的环境中,每个时间步的动作数量甚至不必相同。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "hMXxIHXNRP5_" }, "source": [ "## TF-Agents 中的按臂老虎机\n", "\n", "开发的 TF-Agents Bandit 套件同样可用于按臂案例。它提供了按臂环境,并且大多数策略和代理都可以在按臂模式下运行。\n", "\n", "在我们深入编写示例之前,我们需要导入必要内容。" ] }, { "cell_type": "markdown", "metadata": { "id": "Cl5_CCIWSFvn" }, "source": [ "### 安装" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "QxiNIm5XSIIp" }, "outputs": [], "source": [ "!pip install tf-agents" ] }, { "cell_type": "markdown", "metadata": { "id": "EDKNu5JTSDmf" }, "source": [ "### 导入" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "BbQXsoeKR2ui" }, "outputs": [], "source": [ "import functools\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import tensorflow as tf\n", "\n", "from tf_agents.bandits.agents import lin_ucb_agent\n", "from tf_agents.bandits.environments import stationary_stochastic_per_arm_py_environment as p_a_env\n", "from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics\n", "from tf_agents.drivers import dynamic_step_driver\n", "from tf_agents.environments import tf_py_environment\n", "from tf_agents.replay_buffers import tf_uniform_replay_buffer\n", "from tf_agents.specs import tensor_spec\n", "from tf_agents.trajectories import time_step as ts\n", "\n", "nest = tf.nest" ] }, { "cell_type": "markdown", "metadata": { "id": "Q4cVyq3JMM7Z" }, "source": [ "### 参数 -- 请随意调整" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "bfna8xm0MSCn" }, "outputs": [], "source": [ "# The dimension of the global features.\n", "GLOBAL_DIM = 40 #@param {type:\"integer\"}\n", "# The elements of the global feature will be integers in [-GLOBAL_BOUND, GLOBAL_BOUND).\n", "GLOBAL_BOUND = 10 #@param {type:\"integer\"}\n", "# The dimension of the per-arm features.\n", "PER_ARM_DIM = 50 #@param {type:\"integer\"}\n", "# The elements of the PER-ARM feature will be integers in [-PER_ARM_BOUND, PER_ARM_BOUND).\n", "PER_ARM_BOUND = 6 #@param {type:\"integer\"}\n", "# The variance of the Gaussian distribution that generates the rewards.\n", "VARIANCE = 100.0 #@param {type: \"number\"}\n", "# The elements of the linear reward parameter will be integers in [-PARAM_BOUND, PARAM_BOUND).\n", "PARAM_BOUND = 10 #@param {type: \"integer\"}\n", "\n", "NUM_ACTIONS = 70 #@param {type:\"integer\"}\n", "BATCH_SIZE = 20 #@param {type:\"integer\"}\n", "\n", "# Parameter for linear reward function acting on the\n", "# concatenation of global and per-arm features.\n", "reward_param = list(np.random.randint(\n", " -PARAM_BOUND, PARAM_BOUND, [GLOBAL_DIM + PER_ARM_DIM]))" ] }, { "cell_type": "markdown", "metadata": { "id": "N-mEz1HvRIBC" }, "source": [ "### 简单的按臂环境\n", "\n", "我们在另一个[教程](https://github.com/tensorflow/agents/tree/master/docs/tutorials/bandits_tutorial.ipynb)中讲解过平稳随机环境,而它也具有相对应的按臂平稳随机环境。\n", "\n", "要初始化按臂环境,必须定义函数来生成以下内容:\n", "\n", "- *全局和按臂特征*:这些函数没有输入参数,并会在调用时生成单个(全局或按臂)特征向量。\n", "- *奖励*:此函数会将全局和按臂特征向量的串联作为参数,并生成奖励。基本上,这是代理需要“猜测”的函数。这里值得注意的是,在按臂案例中,奖励函数对于每个老虎机臂都是相同的。这是与经典老虎机案例的根本区别,在经典老虎机案例中,代理必须针对每个老虎机臂独立估计奖励函数。\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "kfTa5Y4ZYjhO" }, "outputs": [], "source": [ "def global_context_sampling_fn():\n", " \"\"\"This function generates a single global observation vector.\"\"\"\n", " return np.random.randint(\n", " -GLOBAL_BOUND, GLOBAL_BOUND, [GLOBAL_DIM]).astype(np.float32)\n", "\n", "def per_arm_context_sampling_fn():\n", " \"\"\"\"This function generates a single per-arm observation vector.\"\"\"\n", " return np.random.randint(\n", " -PER_ARM_BOUND, PER_ARM_BOUND, [PER_ARM_DIM]).astype(np.float32)\n", "\n", "def linear_normal_reward_fn(x):\n", " \"\"\"This function generates a reward from the concatenated global and per-arm observations.\"\"\"\n", " mu = np.dot(x, reward_param)\n", " return np.random.normal(mu, VARIANCE)" ] }, { "cell_type": "markdown", "metadata": { "id": "P2vpRPhheTo5" }, "source": [ "现在,我们可以初始化我们的环境了。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Ny-yikfQQi9l" }, "outputs": [], "source": [ "per_arm_py_env = p_a_env.StationaryStochasticPerArmPyEnvironment(\n", " global_context_sampling_fn,\n", " per_arm_context_sampling_fn,\n", " NUM_ACTIONS,\n", " linear_normal_reward_fn,\n", " batch_size=BATCH_SIZE\n", ")\n", "per_arm_tf_env = tf_py_environment.TFPyEnvironment(per_arm_py_env)" ] }, { "cell_type": "markdown", "metadata": { "id": "iIzFl8HiAIxg" }, "source": [ "下面我们可以检查此环境能够生成哪些内容。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "J8ZVqMU5AOzs" }, "outputs": [], "source": [ "print('observation spec: ', per_arm_tf_env.observation_spec())\n", "print('\\nAn observation: ', per_arm_tf_env.reset().observation)\n", "\n", "action = tf.zeros(BATCH_SIZE, dtype=tf.int32)\n", "time_step = per_arm_tf_env.step(action)\n", "print('\\nRewards after taking an action: ', time_step.reward)" ] }, { "cell_type": "markdown", "metadata": { "id": "eIlCGssRAzIr" }, "source": [ "我们看到观测值规范是一个包含两个元素的字典:\n", "\n", "- 一个包含键 `'global'`:这是全局上下文部分,其形状与参数 `GLOBAL_DIM` 匹配。\n", "- 一个包含键 `'per_arm'`:这是按臂上下文,其形状为 `[NUM_ACTIONS, PER_ARM_DIM]`。此部分为一个时间步中每个老虎机臂的臂特征占位符。\n" ] }, { "cell_type": "markdown", "metadata": { "id": "eTpWbNVeS6ci" }, "source": [ "### LinUCB 代理" ] }, { "cell_type": "markdown", "metadata": { "id": "q83tgZR9LLUx" }, "source": [ "LinUCB 代理可实现同名的 Bandit 算法,能够估计线性奖励函数的参数,同时会在估计周围保持一个置信椭圆。代理会选择具有最高估计期望奖励的臂,假定参数位于置信椭圆内。\n", "\n", "创建代理需要了解观测值和动作规范。定义代理时,我们要将布尔参数 `accepts_per_arm_features` 设置为 `True`。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MqzA9Zi0Q2No" }, "outputs": [], "source": [ "observation_spec = per_arm_tf_env.observation_spec()\n", "time_step_spec = ts.time_step_spec(observation_spec)\n", "action_spec = tensor_spec.BoundedTensorSpec(\n", " dtype=tf.int32, shape=(), minimum=0, maximum=NUM_ACTIONS - 1)\n", "\n", "agent = lin_ucb_agent.LinearUCBAgent(time_step_spec=time_step_spec,\n", " action_spec=action_spec,\n", " accepts_per_arm_features=True)" ] }, { "cell_type": "markdown", "metadata": { "id": "MaQlv1lpC-oc" }, "source": [ "### 训练数据流\n", "\n", "本部分将简要介绍按臂特征从策略到训练的机制。您可随意跳到下一部分(定义后悔值指标),如有兴趣可稍后回来阅读。" ] }, { "cell_type": "markdown", "metadata": { "id": "gUSy2IFK5NpU" }, "source": [ "首先,让我们看一下代理中的数据规范。代理的 `training_data_spec` 特性用于指定训练数据应具有哪些元素和什么结构。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "YQRZd43o5M0j" }, "outputs": [], "source": [ "print('training data spec: ', agent.training_data_spec)" ] }, { "cell_type": "markdown", "metadata": { "id": "UyEFAHUg-m9V" }, "source": [ "如果我们仔细查看规范的 `observation` 部分,我们会发现它并不包含按臂特征!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "lTBR1vgG-2AM" }, "outputs": [], "source": [ "print('observation spec in training: ', agent.training_data_spec.observation)" ] }, { "cell_type": "markdown", "metadata": { "id": "eDtmUgeJA_DN" }, "source": [ "按臂特征发生了什么?要回答这个问题,我们首先要注意到,LinUCB 代理进行训练时,它并不需要**所有**老虎机臂的按臂特征,而是只需要**所选**老虎机臂的按臂特征。因此,有道理丢弃形状为 `[BATCH_SIZE, NUM_ACTIONS, PER_ARM_DIM]` 的张量,因为它非常浪费资源,尤其是在动作数量较大的情况下。\n", "\n", "但是,所选老虎机臂的按臂特征必须位于某个位置!为此,我们要确保 LinUCB 策略将所选老虎机臂的特征存储在训练数据的 `policy_info` 字段中:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "C0fHvLm0Cpq9" }, "outputs": [], "source": [ "print('chosen arm features: ', agent.training_data_spec.policy_info.chosen_arm_features)" ] }, { "cell_type": "markdown", "metadata": { "id": "4psGJKGIDVgN" }, "source": [ "我们从形状看出,`chosen_arm_features` 字段只有一个老虎机臂的特征向量,它将是所选老虎机臂。请注意,正如我们在查看训练数据规范时所见,`policy_info` 以及随后的 `chosen_arm_features` 是训练数据的一部分,因此在训练时可用。" ] }, { "cell_type": "markdown", "metadata": { "id": "ancYxxEHc-6Q" }, "source": [ "### 定义后悔值指标" ] }, { "cell_type": "markdown", "metadata": { "id": "4j2RTrBfzVJQ" }, "source": [ "在开始训练循环之前,我们定义了一些效用函数来帮助计算代理的后悔值。这些函数有助于在给定一组动作(由其臂特征给出)和对代理隐藏的线性参数的情况下确定最佳预期奖励。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "N1_kJAOS6VPo" }, "outputs": [], "source": [ "def _all_rewards(observation, hidden_param):\n", " \"\"\"Outputs rewards for all actions, given an observation.\"\"\"\n", " hidden_param = tf.cast(hidden_param, dtype=tf.float32)\n", " global_obs = observation['global']\n", " per_arm_obs = observation['per_arm']\n", " num_actions = tf.shape(per_arm_obs)[1]\n", " tiled_global = tf.tile(\n", " tf.expand_dims(global_obs, axis=1), [1, num_actions, 1])\n", " concatenated = tf.concat([tiled_global, per_arm_obs], axis=-1)\n", " rewards = tf.linalg.matvec(concatenated, hidden_param)\n", " return rewards\n", "\n", "def optimal_reward(observation):\n", " \"\"\"Outputs the maximum expected reward for every element in the batch.\"\"\"\n", " return tf.reduce_max(_all_rewards(observation, reward_param), axis=1)\n", "\n", "regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward)" ] }, { "cell_type": "markdown", "metadata": { "id": "i06WDbjrUSog" }, "source": [ "现在我们已准备就绪,可以开始我们的老虎机训练循环了。下面的驱动器负责使用策略选择动作,将所选动作的奖励存储在重播缓冲区中,计算预定义的后悔值指标,以及执行代理的训练步。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "j2Iimtmkzs0-" }, "outputs": [], "source": [ "num_iterations = 20 # @param\n", "steps_per_loop = 1 # @param\n", "\n", "replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(\n", " data_spec=agent.policy.trajectory_spec,\n", " batch_size=BATCH_SIZE,\n", " max_length=steps_per_loop)\n", "\n", "observers = [replay_buffer.add_batch, regret_metric]\n", "\n", "driver = dynamic_step_driver.DynamicStepDriver(\n", " env=per_arm_tf_env,\n", " policy=agent.collect_policy,\n", " num_steps=steps_per_loop * BATCH_SIZE,\n", " observers=observers)\n", "\n", "regret_values = []\n", "\n", "for _ in range(num_iterations):\n", " driver.run()\n", " loss_info = agent.train(replay_buffer.gather_all())\n", " replay_buffer.clear()\n", " regret_values.append(regret_metric.result())\n" ] }, { "cell_type": "markdown", "metadata": { "id": "yG5VMgSlUqYS" }, "source": [ "现在让我们看看结果。如果所做工作全部正确,代理将能够有效估计线性奖励函数,因此策略可以选择预期奖励接近最优值的动作。我们上面定义的后悔值指标可以表明这点,该指标逐渐下降并趋近于零。" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "p4DOYwhMSUVh" }, "outputs": [], "source": [ "plt.plot(regret_values)\n", "plt.title('Regret of LinUCB on the Linear per-arm environment')\n", "plt.xlabel('Number of Iterations')\n", "_ = plt.ylabel('Average Regret')" ] }, { "cell_type": "markdown", "metadata": { "id": "t7ZgHgOx5Ojq" }, "source": [ "### 后续步骤\n", "\n", "我们的代码库中[实现](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/agents/examples/v2/train_eval_per_arm_stationary_linear.py)了上面的示例,您也可以选择其他代理,包括[神经 epsilon 贪心算法代理](https://github.com/tensorflow/agents/blob/master/tf_agents/bandits/agents/neural_epsilon_greedy_agent.py)。" ] } ], "metadata": { "colab": { "collapsed_sections": [ "t7ZgHgOx5Ojq" ], "name": "per_arm_bandits_tutorial.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "Python 3", "name": "python3" } }, "nbformat": 4, "nbformat_minor": 0 }