{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Tce3stUlHN0L" }, "outputs": [], "source": [ "##### Copyright 2018 The TensorFlow Authors.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "cellView": "form", "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/media/pc/data/lxw/ai/d2py/doc/libs/tf-chaos/guide\n" ] } ], "source": [ "%cd ..\n", "from set_env import temp_dir" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "# 使用 TensorFlow 进行分布式训练" ] }, { "cell_type": "markdown", "metadata": { "id": "r6P32iYYV27b" }, "source": [ "
![]() | \n",
" ![]() | \n",
" ![]() | \n",
" ![]() | \n",
"
MultiWorkerMirroredStrategy
中的 `'TF_CONFIG'`,但具有额外的注意事项。\n",
"\n",
"在 TensorFlow 1 中,`ParameterServerStrategy`只能通过 `tf.compat.v1.distribute.experimental.ParameterServerStrategy` 符号在 Estimator 中使用。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "E20tG21LFfv1"
},
"source": [
"注:此策略是 [`experimental`](https://tensorflow.google.cn/guide/versions#what_is_not_covered),因为它目前正在进行积极开发。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "45H0Wa8WKI8z"
},
"source": [
"### CentralStorageStrategy\n",
"\n",
"`tf.distribute.experimental.CentralStorageStrategy` 也执行同步训练。变量不会被镜像,而是放在 CPU 上,且运算会复制到所有本地 GPU 。如果只有一个 GPU,则所有变量和运算都将被放在该 GPU 上。\n",
"\n",
"请通过以下代码创建 `CentralStorageStrategy` 实例:\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"id": "rtjZOyaoMWrP"
},
"outputs": [],
"source": [
"central_storage_strategy = tf.distribute.experimental.CentralStorageStrategy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "KY1nJHNkMl7b"
},
"source": [
"这会创建一个 `CentralStorageStrategy` 实例,该实例将使用所有可见的 GPU 和 CPU。在副本上对变量的更新将先进行聚合,然后再应用于变量。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aAFycYUiNCUb"
},
"source": [
"注:此策略是 [`experimental`](https://tensorflow.google.cn/guide/versions#what_is_not_covered),因为它目前正在进行开发。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "t2XUdmIxKljq"
},
"source": [
"### 其他策略\n",
"\n",
"除上述策略外,还有其他两种策略可能对使用 `tf.distribute` API 进行原型设计和调试有所帮助。"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "UD5I1beTpc7a"
},
"source": [
"#### 默认策略\n",
"\n",
"默认策略是一种分布策略,当作用域内没有显式分布策略时就会出现。此策略会实现 `tf.distribute.Strategy` 接口,但只具有传递功能,不提供实际分布。例如,`Strategy.run(fn)` 只会调用 `fn`。使用该策略编写的代码与未使用任何策略编写的代码完全一样。您可以将其视为“无运算”策略。\n",
"\n",
"默认策略是一种单一实例,无法创建它的更多实例。可以在任何显式策略范围之外使用 `tf.distribute.get_strategy` 来获取它(可用于在显式策略范围内获取当前策略的相同 API)。"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"id": "ibHleFOOmPn9"
},
"outputs": [],
"source": [
"default_strategy = tf.distribute.get_strategy()"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "EkxPl_5ImLzc"
},
"source": [
"此策略有两个主要用途:\n",
"\n",
"- 它允许无条件地编写可感知分布的库代码。例如,在 `tf.keras.optimizers` 中,您可以使用 `tf.distribute.get_strategy`,并用此策略来降低梯度 – 它将始终返回一个策略对象,您可以在该对象上调用 `Strategy.reduce` API。\n"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"id": "WECeRzUdT6bU"
},
"outputs": [
{
"data": {
"text/plain": [
"1.0"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# In optimizer or other library code\n",
"# Get currently active strategy\n",
"strategy = tf.distribute.get_strategy()\n",
"strategy.reduce(\"SUM\", 1., axis=None) # reduce some values"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "JURbH-pUT51B"
},
"source": [
"- 与库代码类似,它可用于编写最终用户的程序以便使用或不使用分布策略,而无需条件逻辑。下面是一个说明了这一点的示例代码段:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"id": "O4Vmae5jmSE6"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"MirroredVariable:{\n",
" 0: Model.fit
来分布以 Keras 训练框架编写的训练。\n",
"\n",
"您需要对代码进行以下更改:\n",
"\n",
"1. 创建一个合适的 `tf.distribute.Strategy` 实例。\n",
"2. 将 Keras 模型、优化器和指标的创建移到 `strategy.scope` 中。因此,模型的 `call()`、`train_step()` 和 `test_step()` 方法中的代码都将在加速器上分布和执行。\n",
"\n",
"TensorFlow 分布策略支持所有类型的 Keras 模型 - [序贯](https://tensorflow.google.cn/guide/keras/sequential_model)、[函数式](https://tensorflow.google.cn/guide/keras/sequential_model)和[子类化](https://tensorflow.google.cn/guide/keras/sequential_model)\n",
"\n",
"下面是一段代码,执行该代码会创建一个非常简单的带有一个 `Dense` 层的 Keras 模型:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"id": "gbbcpzRnPZ6V"
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/media/pc/data/lxw/envs/anaconda3x/envs/xxx/lib/python3.12/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.\n",
" super().__init__(activity_regularizer=activity_regularizer, **kwargs)\n"
]
}
],
"source": [
"mirrored_strategy = tf.distribute.MirroredStrategy()\n",
"\n",
"with mirrored_strategy.scope():\n",
" model = tf.keras.Sequential([\n",
" tf.keras.layers.Dense(1, input_shape=(1,),\n",
" kernel_regularizer=tf.keras.regularizers.L2(1e-4))])\n",
" model.compile(loss='mse', optimizer='sgd')"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "773EOxCRVlTg"
},
"source": [
"此示例使用 `MirroredStrategy`,因此您可以在具有多个 GPU 的计算机上运行。`strategy.scope()` 指示 Keras 使用哪种策略来分布训练。通过在此作用域内创建模型/优化器/指标,您可以创建分布式变量而不是常规变量。设置完成后,您可以像往常一样拟合模型。`MirroredStrategy` 负责在可用 GPU 上复制模型的训练、聚合梯度等。"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"id": "ZMmxEFRTEjH5"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/2\n",
"\u001b[1m10/10\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m2s\u001b[0m 5ms/step - loss: 0.2835\n",
"Epoch 2/2\n",
"\u001b[1m10/10\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 3ms/step - loss: 0.0536 \n",
"\u001b[1m10/10\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m1s\u001b[0m 4ms/step - loss: 0.0143 \n"
]
},
{
"data": {
"text/plain": [
"0.014308060519397259"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(100).batch(10)\n",
"model.fit(dataset, epochs=2)\n",
"model.evaluate(dataset)"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "nofTLwyXWHK8"
},
"source": [
"我们在这里使用了 `tf.data.Dataset` 来提供训练和评估输入。您还可以使用 Numpy 数组:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"id": "Lqgd9SdxW5OW"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Epoch 1/2\n",
"\u001b[1m10/10\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 3ms/step - loss: 0.0103 \n",
"Epoch 2/2\n",
"\u001b[1m10/10\u001b[0m \u001b[32m━━━━━━━━━━━━━━━━━━━━\u001b[0m\u001b[37m\u001b[0m \u001b[1m0s\u001b[0m 2ms/step - loss: 0.0021 \n"
]
},
{
"data": {
"text/plain": [
"