Copyright 2023 The TF-Agents Authors.

Copyright 2023 The TF-Agents Authors.#

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

具有按臂特征的多臂老虎机教程#

开始#

本教程将逐步指导您使用 TF-Agents 库来解决每个动作（臂）都具有自己的特征的上下文老虎机问题，例如通过特征（类型、发行年份等）表示的电影列表。

前提条件#

我们假定读者对于 TF-Agents 的 Bandit 库有一定了解，特别是在阅读本教程之前已完成 TF-Agents 中的多臂老虎机教程。

具有臂特征的多臂老虎机#

在“经典”上下文多臂老虎机环境中，代理在每个时间步都会接收一个上下文向量（也称观测值），并且需要从一组有限的带编号动作（臂）中进行选择以最大化其累积奖励。

现在请考虑一种场景：代理向用户推荐下一部要观看的电影。每次需要做出决定时，代理都会接收一些用户相关信息（观影历史记录、类型偏好等）作为上下文，以及可供选择的电影列表。

我们可以尝试通过将用户信息作为上下文来分析这个问题，老虎机臂将为 movie_1, movie_2, ..., movie_K，但这种方式有多个缺点：

动作的数量必须是系统中的电影总数，而添加新电影将非常麻烦。
代理必须针对每一部电影学习一个模型。
不会考虑到电影之间的相似度。

与其对电影进行编号，我们不如使用更加直观的方式：我们可以用一组特征来表示电影，包括类型、时长、演员、评分、年份等。这种方式具有多个优点：

在不同电影之间进行泛化。
代理仅学习一种使用用户和电影特征对奖励进行建模的奖励函数。
易于从系统中移除电影或在系统中引入新电影。

在这种新的环境中，每个时间步的动作数量甚至不必相同。

TF-Agents 中的按臂老虎机#

开发的 TF-Agents Bandit 套件同样可用于按臂案例。它提供了按臂环境，并且大多数策略和代理都可以在按臂模式下运行。

在我们深入编写示例之前，我们需要导入必要内容。

安装#

!pip install tf-agents

导入#

import functools
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

from tf_agents.bandits.agents import lin_ucb_agent
from tf_agents.bandits.environments import stationary_stochastic_per_arm_py_environment as p_a_env
from tf_agents.bandits.metrics import tf_metrics as tf_bandit_metrics
from tf_agents.drivers import dynamic_step_driver
from tf_agents.environments import tf_py_environment
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tf_agents.specs import tensor_spec
from tf_agents.trajectories import time_step as ts

nest = tf.nest

参数 – 请随意调整#

# The dimension of the global features.
GLOBAL_DIM = 40  #@param {type:"integer"}
# The elements of the global feature will be integers in [-GLOBAL_BOUND, GLOBAL_BOUND).
GLOBAL_BOUND = 10  #@param {type:"integer"}
# The dimension of the per-arm features.
PER_ARM_DIM = 50  #@param {type:"integer"}
# The elements of the PER-ARM feature will be integers in [-PER_ARM_BOUND, PER_ARM_BOUND).
PER_ARM_BOUND = 6  #@param {type:"integer"}
# The variance of the Gaussian distribution that generates the rewards.
VARIANCE = 100.0  #@param {type: "number"}
# The elements of the linear reward parameter will be integers in [-PARAM_BOUND, PARAM_BOUND).
PARAM_BOUND = 10  #@param {type: "integer"}

NUM_ACTIONS = 70  #@param {type:"integer"}
BATCH_SIZE = 20  #@param {type:"integer"}

# Parameter for linear reward function acting on the
# concatenation of global and per-arm features.
reward_param = list(np.random.randint(
      -PARAM_BOUND, PARAM_BOUND, [GLOBAL_DIM + PER_ARM_DIM]))

简单的按臂环境#

我们在另一个教程中讲解过平稳随机环境，而它也具有相对应的按臂平稳随机环境。

要初始化按臂环境，必须定义函数来生成以下内容：

全局和按臂特征：这些函数没有输入参数，并会在调用时生成单个（全局或按臂）特征向量。
奖励：此函数会将全局和按臂特征向量的串联作为参数，并生成奖励。基本上，这是代理需要“猜测”的函数。这里值得注意的是，在按臂案例中，奖励函数对于每个老虎机臂都是相同的。这是与经典老虎机案例的根本区别，在经典老虎机案例中，代理必须针对每个老虎机臂独立估计奖励函数。

def global_context_sampling_fn():
  """This function generates a single global observation vector."""
  return np.random.randint(
      -GLOBAL_BOUND, GLOBAL_BOUND, [GLOBAL_DIM]).astype(np.float32)

def per_arm_context_sampling_fn():
  """"This function generates a single per-arm observation vector."""
  return np.random.randint(
      -PER_ARM_BOUND, PER_ARM_BOUND, [PER_ARM_DIM]).astype(np.float32)

def linear_normal_reward_fn(x):
  """This function generates a reward from the concatenated global and per-arm observations."""
  mu = np.dot(x, reward_param)
  return np.random.normal(mu, VARIANCE)

现在，我们可以初始化我们的环境了。

per_arm_py_env = p_a_env.StationaryStochasticPerArmPyEnvironment(
    global_context_sampling_fn,
    per_arm_context_sampling_fn,
    NUM_ACTIONS,
    linear_normal_reward_fn,
    batch_size=BATCH_SIZE
)
per_arm_tf_env = tf_py_environment.TFPyEnvironment(per_arm_py_env)

下面我们可以检查此环境能够生成哪些内容。

print('observation spec: ', per_arm_tf_env.observation_spec())
print('\nAn observation: ', per_arm_tf_env.reset().observation)

action = tf.zeros(BATCH_SIZE, dtype=tf.int32)
time_step = per_arm_tf_env.step(action)
print('\nRewards after taking an action: ', time_step.reward)

我们看到观测值规范是一个包含两个元素的字典：

一个包含键 'global'：这是全局上下文部分，其形状与参数 GLOBAL_DIM 匹配。
一个包含键 'per_arm'：这是按臂上下文，其形状为 [NUM_ACTIONS, PER_ARM_DIM]。此部分为一个时间步中每个老虎机臂的臂特征占位符。

LinUCB 代理#

LinUCB 代理可实现同名的 Bandit 算法，能够估计线性奖励函数的参数，同时会在估计周围保持一个置信椭圆。代理会选择具有最高估计期望奖励的臂，假定参数位于置信椭圆内。

创建代理需要了解观测值和动作规范。定义代理时，我们要将布尔参数 accepts_per_arm_features 设置为 True。

observation_spec = per_arm_tf_env.observation_spec()
time_step_spec = ts.time_step_spec(observation_spec)
action_spec = tensor_spec.BoundedTensorSpec(
    dtype=tf.int32, shape=(), minimum=0, maximum=NUM_ACTIONS - 1)

agent = lin_ucb_agent.LinearUCBAgent(time_step_spec=time_step_spec,
                                     action_spec=action_spec,
                                     accepts_per_arm_features=True)

训练数据流#

本部分将简要介绍按臂特征从策略到训练的机制。您可随意跳到下一部分（定义后悔值指标），如有兴趣可稍后回来阅读。

首先，让我们看一下代理中的数据规范。代理的 training_data_spec 特性用于指定训练数据应具有哪些元素和什么结构。

print('training data spec: ', agent.training_data_spec)

如果我们仔细查看规范的 observation 部分，我们会发现它并不包含按臂特征！

print('observation spec in training: ', agent.training_data_spec.observation)

按臂特征发生了什么？要回答这个问题，我们首先要注意到，LinUCB 代理进行训练时，它并不需要所有老虎机臂的按臂特征，而是只需要所选老虎机臂的按臂特征。因此，有道理丢弃形状为 [BATCH_SIZE, NUM_ACTIONS, PER_ARM_DIM] 的张量，因为它非常浪费资源，尤其是在动作数量较大的情况下。

但是，所选老虎机臂的按臂特征必须位于某个位置！为此，我们要确保 LinUCB 策略将所选老虎机臂的特征存储在训练数据的 policy_info 字段中：

print('chosen arm features: ', agent.training_data_spec.policy_info.chosen_arm_features)

我们从形状看出，chosen_arm_features 字段只有一个老虎机臂的特征向量，它将是所选老虎机臂。请注意，正如我们在查看训练数据规范时所见，policy_info 以及随后的 chosen_arm_features 是训练数据的一部分，因此在训练时可用。

定义后悔值指标#

在开始训练循环之前，我们定义了一些效用函数来帮助计算代理的后悔值。这些函数有助于在给定一组动作（由其臂特征给出）和对代理隐藏的线性参数的情况下确定最佳预期奖励。

def _all_rewards(observation, hidden_param):
  """Outputs rewards for all actions, given an observation."""
  hidden_param = tf.cast(hidden_param, dtype=tf.float32)
  global_obs = observation['global']
  per_arm_obs = observation['per_arm']
  num_actions = tf.shape(per_arm_obs)[1]
  tiled_global = tf.tile(
      tf.expand_dims(global_obs, axis=1), [1, num_actions, 1])
  concatenated = tf.concat([tiled_global, per_arm_obs], axis=-1)
  rewards = tf.linalg.matvec(concatenated, hidden_param)
  return rewards

def optimal_reward(observation):
  """Outputs the maximum expected reward for every element in the batch."""
  return tf.reduce_max(_all_rewards(observation, reward_param), axis=1)

regret_metric = tf_bandit_metrics.RegretMetric(optimal_reward)

现在我们已准备就绪，可以开始我们的老虎机训练循环了。下面的驱动器负责使用策略选择动作，将所选动作的奖励存储在重播缓冲区中，计算预定义的后悔值指标，以及执行代理的训练步。

num_iterations = 20 # @param
steps_per_loop = 1 # @param

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
    data_spec=agent.policy.trajectory_spec,
    batch_size=BATCH_SIZE,
    max_length=steps_per_loop)

observers = [replay_buffer.add_batch, regret_metric]

driver = dynamic_step_driver.DynamicStepDriver(
    env=per_arm_tf_env,
    policy=agent.collect_policy,
    num_steps=steps_per_loop * BATCH_SIZE,
    observers=observers)

regret_values = []

for _ in range(num_iterations):
  driver.run()
  loss_info = agent.train(replay_buffer.gather_all())
  replay_buffer.clear()
  regret_values.append(regret_metric.result())

现在让我们看看结果。如果所做工作全部正确，代理将能够有效估计线性奖励函数，因此策略可以选择预期奖励接近最优值的动作。我们上面定义的后悔值指标可以表明这点，该指标逐渐下降并趋近于零。

plt.plot(regret_values)
plt.title('Regret of LinUCB on the Linear per-arm environment')
plt.xlabel('Number of Iterations')
_ = plt.ylabel('Average Regret')

后续步骤#

我们的代码库中实现了上面的示例，您也可以选择其他代理，包括神经 epsilon 贪心算法代理。