混合精度

#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

混合精度#

在 TensorFlow.org 上查看

Run in Google Colab

在 GitHub 上查看源代码

下载笔记本

%cd ..
from set_env import temp_dir

/media/pc/data/lxw/ai/d2py/doc/libs/tf-chaos/guide

概述#

混合精度是指训练时在模型中同时使用 16 位和 32 位浮点类型，这样可以加快运行速度，以及减少内存使用。通过让模型的某些部分使用 32 位类型以保持数值稳定性，可以缩短模型的单步用时，而在评估指标（如准确率）方面仍可以获得同等的训练效果。本指南介绍如何使用 Keras 混合精度 API 来加快模型速度。利用此 API 可以在现代 GPU 上将性能提高三倍以上，而在最新的 Intel CPU 上可以提高两倍以上。

如今，大多数模型使用 float32 dtype，这种数据类型占用 32 位内存。但是，还有两种精度较低的 dtype，即 float16 和 bfloat16，它们都是占用 16 位内存。现代加速器使用 16 位 dtype 执行运算的速度更快，因为它们有执行 16 位计算的专用硬件，并且从内存中读取 16 位 dtype 的速度也更快。

NVIDIA GPU 使用 float16 执行运算的速度比使用 float32 快，而 TPU 使用 bfloat16 执行运算的速度也比使用 float32 快。因此，在这些设备上应尽可能使用精度较低的 dtype。但是，出于对数值的要求，为了让模型训练获得相同的质量，一些变量和计算仍需使用 float32。利用 Keras 混合精度 API，float16 或 bfloat16 可以与 float32 混合使用，从而既可以获得 float16/bfloat16 的性能优势，也可以获得 float32 的数值稳定性优势。

注：在本指南中，术语“数值稳定性”是指使用较低精度的 dtype（而不是较高精度的 dtype）对模型质量的影响。如果使用 float16 或 bfloat16 执行运算，则与使用 float32 执行运算相比，使用这些较低精度的 dtype 会导致模型获得的评估准确率或其他指标相对较低，那么我们就说这种运算“数值不稳定”。

安装#

import tensorflow as tf

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras import mixed_precision

支持的硬件#

虽然混合精度在大多数硬件上都可以运行，但是只有在最新的 NVIDIA GPU、Cloud TPU 和最新的 Intel CPU 上才能加速模型。NVIDIA GPU 支持混合使用 float16 和 float32，而 TPU 则支持混合使用 bfloat16 和 float32。

在 NVIDIA GPU 中，计算能力为 7.0 或更高的 GPU 可以获得混合精度的最大性能优势，因为这些型号具有称为 Tensor 核心的特殊硬件单元，可以加速 float16 矩阵乘法和卷积运算。旧款 GPU 使用混合精度无法实现数学运算性能优势，不过可以节省内存和带宽，因此也可以在一定程度上提高速度。您可以在 NVIDIA 的 CUDA GPU 网页上查询 GPU 的计算能力。可以最大程度从混合精度受益的 GPU 示例包括 RTX GPU、V100 和 A100。

在 Intel CPU 中，从第四代 Intel 至强处理器（代号 Sapphire Rapids）开始，混合精度将提供最大的性能优势，因为它们可以使用 AMX 指令加速 bfloat16 计算（要求 Tensorflow 2.12 或更高版本）。

注：如果在 Google Colab 中运行本指南中示例，GPU 运行时通常会连接 P100。P100 的计算能力为 6.0，预计速度提升不明显。如果在 CPU 运行时上运行，速度可能会变慢，因为运行时可能有一个不支持 AMX 的 CPU。

您可以使用以下命令检查 GPU 类型。如果要使用此命令，必须安装 NVIDIA 驱动程序，否则会引发错误。

!nvidia-smi -L

GPU 0: NVIDIA GeForce RTX 3090 (UUID: GPU-8bb0882b-61a3-d7c6-5636-ba3ad1416223)
GPU 1: NVIDIA GeForce RTX 2080 Ti (UUID: GPU-7315b749-8f79-5d30-7955-4466dedba028)

所有 Cloud TPU 均支持 bfloat16。

即使在预计无法提升速度的旧款 Intel CPU、不支持 AMX 的其他 x86 CPU 和旧款 GPU 上，混合精度 API 仍可以用于单元测试、调试或试用 API。但是，不支持 AMX 指令的 CPU 上的 mix_bfloat16 以及所有 x86 CPU 上的 mix_float16 的运行速度会明显变慢。

设置 dtype 策略#

要在 Keras 中使用混合精度，您需要创建一条 tf.keras.mixed_precision.Policy，通常将其称为 dtype 策略。Dtype 策略可以指定将在其中运行的 dtype 层。在本指南中，您将从字符串 'mixed_float16' 构造策略，并将其设置为全局策略。这会导致随后创建的层使用 float16 和 float32 的混合精度。

policy = mixed_precision.Policy('mixed_float16')
mixed_precision.set_global_policy(policy)

简而言之，您可以直接将字符串传递给 set_global_policy，这通常在实践中完成。

# Equivalent to the two lines above
mixed_precision.set_global_policy('mixed_float16')

该策略指定了层的两个重要方面：完成层的计算所使用的 dtype 和层变量的 dtype。上面的代码创建了一条 mixed_float16 策略（即通过将字符串 'mixed_float16' 传递给其构造函数而构建的 mixed_precision.Policy ）。凭借此策略，层可以使用 float16 计算和 float32 变量。计算使用 float16 来提高性能，而变量使用 float32 来确保数值稳定性。您可以直接在策略中查询这些属性。

print('Compute dtype: %s' % policy.compute_dtype)
print('Variable dtype: %s' % policy.variable_dtype)

Compute dtype: float16
Variable dtype: float32

如前所述，在计算能力至少为 7.0 的 NVIDIA GPU 上，mixed_float16 策略可以大幅提升性能。在其他 GPU 和 CPU 上，该策略也可以运行，但可能无法提升性能。对于 TPU 和 CPU，则应使用 mixed_bfloat16 策略。

构建模型#

接下来，我们开始构建一个简单的模型。过小的模型往往无法获得混合精度的优势，因为 TensorFlow 运行时的开销通常占据大部分执行时间，导致 GPU 的性能提升几乎可以忽略不计。因此，如果使用 GPU，我们会构建两个比较大的 Dense 层，每个层具有 4096 个单元。

inputs = keras.Input(shape=(784,), name='digits')
if tf.config.list_physical_devices('GPU'):
  print('The model will run with 4096 units on a GPU')
  num_units = 4096
else:
  # Use fewer units on CPUs so the model finishes in a reasonable amount of time
  print('The model will run with 64 units on a CPU')
  num_units = 64
dense1 = layers.Dense(num_units, activation='relu', name='dense_1')
x = dense1(inputs)
dense2 = layers.Dense(num_units, activation='relu', name='dense_2')
x = dense2(x)

The model will run with 4096 units on a GPU

每个层都有一条策略，默认情况下会使用全局策略。因此，每个 Dense 层都具有 mixed_float16 策略，这是因为之前已将 mixed_float16 设置为全局策略。这样，dense 层就会执行 float16 计算，并使用 float32 变量。为了执行 float16 计算，它们会将输入转换为 float16 类型，因此，输出也是 float16 类型。它们的变量是 float32 类型，在调用层时，它们会将变量转换为 float16 类型，从而避免 dtype 不匹配所引起的错误。

print(dense1.dtype_policy)
print('x.dtype: %s' % x.dtype)
# 'kernel' is dense1's variable
print('dense1.kernel.dtype: %s' % dense1.kernel.dtype)

<DTypePolicy "mixed_float16">
x.dtype: float16
dense1.kernel.dtype: float32

接下来创建输出预测。通常，您可以按如下方法创建输出预测，但是对于 float16，其结果不一定具有数值稳定性。

# INCORRECT: softmax and model output will be float16, when it should be float32
outputs = layers.Dense(10, activation='softmax', name='predictions')(x)
print('Outputs dtype: %s' % outputs.dtype)

Outputs dtype: float16

模型末尾的 softmax 激活值本应为 float32 类型。但由于 dtype 策略是 mixed_float16，softmax 激活通常会使用 float16 dtype 进行计算，并且会输出 float16 张量。

这一问题可以通过分离 Dense 和 softmax 层，并将 dtype='float32' 传递至 softmax 层来解决。

# CORRECT: softmax and model output are float32
x = layers.Dense(10, name='dense_logits')(x)
outputs = layers.Activation('softmax', dtype='float32', name='predictions')(x)
print('Outputs dtype: %s' % outputs.dtype)

Outputs dtype: float32

将 dtype='float32' 传递至 softmax 层的构造函数会将该层的 dtype 策略重写为 float32 策略，从而由后者执行计算并保持变量为 float32 类型。同样，我们也可以传递 dtype=mixed_precision.Policy('float32')；层始终将 dtype 参数转换为策略。由于 Activation 层没有变量，因此会忽略该策略的变量 dtype，但是该策略的计算 dtype 为 float32，因此 softmax 和模型的输出也是 float32。

您可以在模型中间添加 float16 类型的 softmax，但模型末尾的 softmax 应为 float32 类型。原因是，如果从 softmax 传递给损失函数的中间张量是 float16 或 bfloat16 类型，则会出现数值问题。

如果您认为使用 float16 计算无法获得数值稳定性，则可以通过传递 dtype='float32'，将任何层的 dtype 重写为 float32 类型。但通常，只有模型的最后一层才需要这样重写，因为对大多数层来说，mixed_float16 和 mixed_bfloat16 的精度已经足够。

即使模型不以 softmax 结尾，输出也仍是 float32。虽然对这一特定模型来说并非必需，但可以使用以下代码将模型输出转换为 float32 类型：

# The linear activation is an identity function. So this simply casts 'outputs'
# to float32. In this particular case, 'outputs' is already float32 so this is a
# no-op.
outputs = layers.Activation('linear', dtype='float32')(outputs)

接下来，完成并编译模型，并生成输入数据：

model = keras.Model(inputs=inputs, outputs=outputs)
model.compile(loss='sparse_categorical_crossentropy',
              optimizer=keras.optimizers.RMSprop(),
              metrics=['accuracy'])

(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()
x_train = x_train.reshape(60000, 784).astype('float32') / 255
x_test = x_test.reshape(10000, 784).astype('float32') / 255

本示例将输入数据从 int8 强制转换为 float32。我们不转换为 float16 是因为在 CPU 上除以 255 时，float16 的运算速度比 float32 慢。在这种情况下，性能差距可以忽略不计，但一般来说，在 CPU 上执行运算时，数学处理输入应使用 float32 类型。该模型的第一层会将输入转换为 float16，因为每一层都会将浮点输入强制转换为其计算 dtype。

检索模型的初始权重。这样可以通过加载权重来从头开始训练。

initial_weights = model.get_weights()

使用 Model.fit 训练模型#

接下来，训练模型：

history = model.fit(x_train, y_train,
                    batch_size=8192,
                    epochs=5,
                    validation_split=0.2)
test_scores = model.evaluate(x_test, y_test, verbose=2)
print('Test loss:', test_scores[0])
print('Test accuracy:', test_scores[1])

Epoch 1/5
5/6 ━━━━━━━━━━━━━━━━━━━━ 0s 26ms/step - accuracy: 0.2630 - loss: 2.4553
6/6 ━━━━━━━━━━━━━━━━━━━━ 16s 2s/step - accuracy: 0.2971 - loss: 2.4333 - val_accuracy: 0.7223 - val_loss: 0.9311
Epoch 2/5
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 35ms/step - accuracy: 0.6974 - loss: 0.9873 - val_accuracy: 0.8834 - val_loss: 0.4373
Epoch 3/5
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step - accuracy: 0.8505 - loss: 0.5173 - val_accuracy: 0.8801 - val_loss: 0.4080
Epoch 4/5
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 33ms/step - accuracy: 0.8796 - loss: 0.4115 - val_accuracy: 0.8468 - val_loss: 0.4852
Epoch 5/5
6/6 ━━━━━━━━━━━━━━━━━━━━ 0s 34ms/step - accuracy: 0.8788 - loss: 0.4000 - val_accuracy: 0.9222 - val_loss: 0.2629
313/313 - 4s - 14ms/step - accuracy: 0.9218 - loss: 0.2638
Test loss: 0.2638075053691864
Test accuracy: 0.9218000173568726

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
I0000 00:00:1729856101.862005 4130438 service.cc:146] XLA service 0x7f1784008f80 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1729856101.862057 4130438 service.cc:154]   StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6
I0000 00:00:1729856101.862069 4130438 service.cc:154]   StreamExecutor device (1): NVIDIA GeForce RTX 2080 Ti, Compute Capability 7.5
I0000 00:00:1729856107.058069 4130438 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.

请注意，模型会在日志中打印每个步骤的时间：例如，“25ms/step”。第一个周期可能会变慢，因为 TensorFlow 会花一些时间来优化模型，但之后每个步骤的时间应当会稳定下来。

如果在 Colab 中运行本指南中，您可以使用 float32 比较混合精度的性能。为此，请在“Setting the dtype policy”部分将策略从 mixed_float16 更改为 float32，然后重新运行所有代码单元，直到此代码点。在计算能力至少为 7.X 的 GPU 上，您会发现每个步骤的时间大大增加，表明混合精度提升了模型的速度。在继续学习本指南之前，请确保将策略改回 mixed_float16 并重新运行代码单元。

在计算能力至少为 8.0 的 GPU（Ampere GPU 及更高版本）上，使用混合精度时，与使用 float32 相比，您可能看不到本指南中小模型的性能提升。这是由于使用 TensorFloat-32 导致的，它会在 tf.linalg.matmul 等某些 float32 运算中自动使用较低精度的数学计算。使用 float32 时，TensorFloat-32 会展现混合精度的一些性能优势。不过，在真实模型中，由于内存带宽节省和 TensorFloat-32 不支持的运算，您通常仍会看到混合精度的显著性能提升。

如果在 TPU 上运行混合精度，您会发现与在 GPU（尤其是 Ampere 架构之前的 GPU）上运行混合精度相比，性能提升并不明显。这是因为即使默认 dtype 策略为 float32，TPU 也会在后台执行一些 bfloat16 运算。这类似于 Ampere GPU 默认使用 TensorFloat-32 的方式。在实际模型上使用混合精度时，与 Ampere GPU 相比，TPU 获得的性能提升通常较少。

对于很多实际模型，使用混合精度时还可以将批次大小加倍而不会耗尽内存，因为 float16 张量只需要使用 float32 一半的内存。不过，这对本文中所讲的小模型毫无意义，因为您几乎可以使用任何 dtype 来运行该模型，而每个批次可以包含有 60,000 张图片的整个 MNIST 数据集。

损失放大#

损失放大是 tf.keras.Model.fit 使用 mixed_float16 策略自动执行，从而避免数值下溢的一种技术。本部分介绍什么是损失放大，下一部分介绍如何将其与自定义训练循环一起使用。

注：使用 mixed_bfloat16 策略时，不需要进行损失缩放。

下溢和溢出#

float16 数据类型的动态范围比 float32 窄。这意味着大于 $65504$ 的数值会因溢出而变为无穷大，小于 $6.0 \times 10^{- 8}$ 的数值则会因下溢而变成零。float32 和 bfloat16 的动态范围要大得多，因此一般不会出现下溢或溢出的问题。

例如：

x = tf.constant(256, dtype='float16')
(x ** 2).numpy()  # Overflow

inf

x = tf.constant(1e-5, dtype='float16')
(x ** 2).numpy()  # Underflow

0.0

实际上，float16 也极少出现下溢的情况。此外，在前向传递中出现下溢的情形更是十分罕见。但是，在后向传递中，梯度可能因下溢而变为零。损失放大就是一个防止出现下溢的技巧。

损失放大概述#

损失放大的基本概念非常简单：只需将损失乘以某个大数字（如 $1024$ ）即可得到损失放大{/em0｝值。这会将梯度放大 $1024$ 倍，大大降低了发生下溢的几率。计算出最终梯度后，将其除以 $1024$ 即可得到正确值。

该过程的伪代码是：

loss_scale = 1024
loss = model(inputs)
loss *= loss_scale
# Assume `grads` are float32. You do not want to divide float16 gradients.
grads = compute_gradient(loss, model.trainable_variables)
grads /= loss_scale

选择合适的损失标度比较困难。如果损失标度太小，梯度可能仍会因下溢而变为零。如果太大，则会出现相反的问题：梯度可能因溢出而变为无穷大。

为了解决这一问题，TensorFlow 会动态确定损失放大，因此，您不必手动选择。如果使用 tf.keras.Model.fit，则会自动完成损失放大，您不必做任何额外的工作。如果您使用自定义训练循环，则必须显式使用特殊的优化器封装容器 tf.keras.mixed_precision.LossScaleOptimizer 才能使用损失放大。下一部分会对此进行介绍。

使用自定义训练循环训练模型#

到目前为止，您已经使用 tf.keras.Model.fit 训练了一个具有混合精度的 Keras 模型。接下来，您会将混合精度与自定义训练循环一起使用。如果您还不知道什么是自定义训练循环，请先阅读自定义训练指南。

使用混合精度运行自定义训练循环需要对使用 float32 运行训练的模型进行两方面的更改：

使用混合精度构建模型（已完成）
如果使用 mixed_float16，则明确使用损失放大。

对于步骤 (2)，您将使用 tf.keras.mixed_precision.LossScaleOptimizer 类，此类会封装优化器并应用损失放大。默认情况下，它会动态地确定损失放大，因此您不必选择其中之一。按如下方式构造一个 LossScaleOptimizer。

optimizer = keras.optimizers.RMSprop()
optimizer = mixed_precision.LossScaleOptimizer(optimizer)

如果您愿意，可以选择一个显式损失放大或以其他方式自定义损失放大行为，但强烈建议保留默认的损失放大行为，因为经过验证，它可以在所有已知模型上很好地工作。如果要自定义损失放大行为，请参阅 tf.keras.mixed_precision.LossScaleOptimizer 文档。

接下来，定义损失对象和 tf.data.Dataset：

loss_object = tf.keras.losses.SparseCategoricalCrossentropy()
train_dataset = (tf.data.Dataset.from_tensor_slices((x_train, y_train))
                 .shuffle(10000).batch(8192))
test_dataset = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(8192)

接下来，定义训练步骤函数。您将使用损失放大优化器中的两个新方法来放大损失和缩小梯度：

get_scaled_loss(loss)：将损失值乘以损失标度值
get_unscaled_gradients(gradients)：获取一系列放大的梯度作为输入，并将每一个梯度除以损失标度，从而将其缩小为实际值

为了防止梯度发生下溢，必须使用这些函数。随后，如果全部没有出现 Inf 或 NaN 值，则 LossScaleOptimizer.apply_gradients 会应用这些梯度。它还会更新损失标度，如果梯度出现 Inf 或 NaN 值，则会将其减半，而如果出现零值，则会增大损失标度。

@tf.function
def train_step(x, y):
  with tf.GradientTape() as tape:
    predictions = model(x)
    loss = loss_object(y, predictions)
    scaled_loss = optimizer.get_scaled_loss(loss)
  scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)
  gradients = optimizer.get_unscaled_gradients(scaled_gradients)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  return loss

在训练的开始阶段，LossScaleOptimizer 可能会跳过前几个步骤。先使用非常大的损失标度，以便快速确定最佳值。经过几个步骤后，损失标度将稳定下来，这时跳过的步骤将会很少。这一过程是自动执行的，不会影响训练质量。

现在，定义测试步骤：

@tf.function
def test_step(x):
  return model(x, training=False)

加载模型的初始权重，以便您可以从头开始重新训练：

model.set_weights(initial_weights)

最后，运行自定义训练循环：

for epoch in range(5):
  epoch_loss_avg = tf.keras.metrics.Mean()
  test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      name='test_accuracy')
  for x, y in train_dataset:
    loss = train_step(x, y)
    epoch_loss_avg(loss)
  for x, y in test_dataset:
    predictions = test_step(x)
    test_accuracy.update_state(y, predictions)
  print('Epoch {}: loss={}, test accuracy={}'.format(epoch, epoch_loss_avg.result(), test_accuracy.result()))

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[27], line 6
      3 test_accuracy = tf.keras.metrics.SparseCategoricalAccuracy(
      4     name='test_accuracy')
      5 for x, y in train_dataset:
----> 6   loss = train_step(x, y)
      7   epoch_loss_avg(loss)
      8 for x, y in test_dataset:

File /media/pc/data/lxw/envs/anaconda3x/envs/xxx/lib/python3.12/site-packages/tensorflow/python/util/traceback_utils.py:153, in filter_traceback.<locals>.error_handler(*args, **kwargs)
    151 except Exception as e:
    152   filtered_tb = _process_traceback_frames(e.__traceback__)
--> 153   raise e.with_traceback(filtered_tb) from None
    154 finally:
    155   del filtered_tb

File /tmp/__autograph_generated_file05u43bl5.py:13, in outer_factory.<locals>.inner_factory.<locals>.tf__train_step(x, y)
     11     predictions = ag__.converted_call(ag__.ld(model), (ag__.ld(x),), None, fscope)
     12     loss = ag__.converted_call(ag__.ld(loss_object), (ag__.ld(y), ag__.ld(predictions)), None, fscope)
---> 13     scaled_loss = ag__.converted_call(ag__.ld(optimizer).get_scaled_loss, (ag__.ld(loss),), None, fscope)
     14 scaled_gradients = ag__.converted_call(ag__.ld(tape).gradient, (ag__.ld(scaled_loss), ag__.ld(model).trainable_variables), None, fscope)
     15 gradients = ag__.converted_call(ag__.ld(optimizer).get_unscaled_gradients, (ag__.ld(scaled_gradients),), None, fscope)

AttributeError: in user code:

    File "/tmp/ipykernel_4130203/1373614888.py", line 6, in train_step  *
        scaled_loss = optimizer.get_scaled_loss(loss)

    AttributeError: 'LossScaleOptimizer' object has no attribute 'get_scaled_loss'

GPU 性能提示#

下面是在 GPU 上使用混合精度时的一些性能提示。

增大批次大小#

当使用混合精度时，如果不影响模型质量，可以尝试使用双倍批次大小运行。因为 float16 张量只使用一半内存，所以，您通常可以将批次大小增大一倍，而不会耗尽内存。增大批次大小通常可以提高训练吞吐量，即模型每秒可以运行的训练元素数量。

确保使用 GPU Tensor 核心#

如前所述，现代 NVIDIA GPU 使用称为 Tensor 核心的特殊硬件单元，可以非常快速地执行 float16 矩阵乘法运算。但是，Tensor 核心要求张量的某些维度是 8 的倍数。在下面的示例中，当且仅当参数值是 8 的倍数时，才能使用 Tensor 核心。

tf.keras.layers.Dense(units=64)
tf.keras.layers.Conv2d(filters=48, kernel_size=7, stride=3)
- 其他卷积层也是如此，如 tf.keras.layers.Conv3d
tf.keras.layers.LSTM(units=64)
- 其他 RNN 也是如此，如 tf.keras.layers.GRU
tf.keras.Model.fit(epochs=2, batch_size=128)

您应该尽可能使用 Tensor 核心。如果要了解更多信息，请参阅 NVIDIA 深度学习性能指南，其中介绍了使用 Tensor 核心的具体要求以及与 Tensor 核心相关的其他性能信息。

XLA#

XLA 是一款可以进一步提高混合精度性能，也可以在较小程度上提高 float32 性能的编译器。请参阅 XLA 指南以了解详情。

Cloud TPU 性能提示#

就像在 GPU 上一样，您也应该尝试将批次大小增大一倍，因为 bfloat16 张量同样只使用一半内存。双倍批次大小可能会提高训练吞吐量。

TPU 不需要任何其他特定于混合精度的调整即可获得最佳性能。TPU 已经要求使用 XLA，它们可以从某些是 $128$ 的倍数的维度获得优势，不过就像使用混合精度一样，但这同样适用于 float32 类型。有关一般 TPU 性能提示，请参阅 Cloud TPU 性能指南，这些提示对混合精度和 float32 张量均适用。

总结#

如果您使用的是计算能力至少为 7.0 的 TPU 和 NVIDIA GPU 或支持 AMX 指令的 Intel CPU，则应使用混合精度，因为它可以将性能提升多达 3 倍。

您可以按如下代码使用混合精度：

# On TPUs and CPUs, use 'mixed_bfloat16' instead
mixed_precision.set_global_policy('mixed_float16')

如果您的模型以 softmax 结尾，请确保其类型为 float32。不管您的模型以什么结尾，必须确保输出为 float32。
如果您通过 mixed_float16 使用自定义训练循环，则除了上述几行代码外，您还需要使用 tf.keras.mixed_precision.LossScaleOptimizer 封装您的优化器。然后调用 optimizer.get_scaled_loss 来放大损失，并且调用 optimizer.get_unscaled_gradients 来缩小梯度。
如果您正在通过 mixed_bfloat16 使用自定义训练循环，则设置上面提到的 global_policy 已足够。
如果不会降低计算准确率，则可以将训练批次大小加倍。
在 GPU 上，确保大部分张量维度是 $8$ 的倍数，从而最大限度提高性能

有关使用 tf.keras.mixed_precision API 的混合精度示例，请查看与训练性能相关的函数和类。查看官方模型（例如 Transformer）了解详情。