{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "id": "Tce3stUlHN0L" }, "outputs": [], "source": [ "##### Copyright 2018 The TensorFlow Authors.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "cellView": "form", "id": "tuOe1ymfHZPu" }, "outputs": [], "source": [ "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", "# https://www.apache.org/licenses/LICENSE-2.0\n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", "# See the License for the specific language governing permissions and\n", "# limitations under the License." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "/media/pc/data/lxw/ai/d2py/doc/libs/tf-chaos/guide\n" ] } ], "source": [ "%cd ..\n", "from set_env import temp_dir" ] }, { "cell_type": "markdown", "metadata": { "id": "MfBg1C5NB3X0" }, "source": [ "# 使用 GPU\n", "\n", "\n", " \n", " \n", " \n", " \n", "
在 TensorFlow.org 上查看在 Google Colab 中运行在 GitHub 上查看源代码下载笔记本
" ] }, { "cell_type": "markdown", "metadata": { "id": "SoYIwe40vEPI" }, "source": [ "无需更改任何代码,TensorFlow 代码以及 `tf.keras` 模型就可以在单个 GPU 上透明运行。\n", "\n", "注:使用 `tf.config.list_physical_devices('GPU')` 可以确认 TensorFlow 使用的是 GPU。\n", "\n", "在一台或多台机器上,要顺利地在多个 GPU 上运行,最简单的方法是使用[分布策略](https://render.githubusercontent.com/view/distributed_training.ipynb)。\n", "\n", "本指南适用于已尝试这些方法,但发现需要对 TensorFlow 使用 GPU 的方式进行精细控制的用户。要了解如何为单 GPU 和多 GPU 情景调试性能问题,请参阅[优化 TensorFlow GPU 性能](gpu_performance_analysis.md)指南。" ] }, { "cell_type": "markdown", "metadata": { "id": "MUXex9ctTuDB" }, "source": [ "## 设置\n", "\n", "确保已安装最新的 TensorFlow GPU 版本。" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "id": "IqR2PQG4ZaZ0" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Num GPUs Available: 2\n" ] } ], "source": [ "import tensorflow as tf\n", "print(\"Num GPUs Available: \", len(tf.config.list_physical_devices('GPU')))" ] }, { "cell_type": "markdown", "metadata": { "id": "ZELutYNetv-v" }, "source": [ "## 概述\n" ] }, { "cell_type": "markdown", "metadata": { "id": "xHxb-dlhMIzW" }, "source": [ "TensorFlow 支持在各种类型的设备上执行计算,包括 CPU 和 GPU。我们使用字符串标识符来表示这些设备,例如:\n", "\n", "- `\"/device:CPU:0\"`:机器的 CPU。\n", "- `\"/GPU:0\"`:TensorFlow 可见的机器上第一个 GPU 的速记表示法。\n", "- `\"/job:localhost/replica:0/task:0/device:GPU:1\"`:TensorFlow 可见的机器上第二个 GPU 的完全限定名称。\n", "\n", "如果一个 TensorFlow 运算同时有 CPU 和 GPU 实现,则在默认情况下,分配运算时会优先使用 GPU 设备。例如,`tf.matmul` 同时有 CPU 和 GPU 内核,在具有 `CPU:0` 和 `GPU:0` 设备的系统上,将选择 `GPU:0` 设备来运行 `tf.matmul`,除非明确要求在另一个设备上运行。\n", "\n", "如果一个 TensorFlow 运算没有相应的 GPU 实现,则该运算将回退到 CPU 设备。例如,由于 `tf.cast` 只有一个 CPU 内核,在具有 `CPU:0` 和 `GPU:0` 设备的系统上,即使请求在 `GPU:0` 设备上运行 `tf.cast`,也会选择 `CPU:0` 设备来运行该运算。" ] }, { "cell_type": "markdown", "metadata": { "id": "UhNtHfuxCGVy" }, "source": [ "## 记录设备放置\n", "\n", "为了找出将运算和张量分配到的目标设备,请将 `tf.debugging.set_log_device_placement(True)` 放在程序的第一行。启用设备放置记录将导致任何张量分配或运算被打印。" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "id": "2Dbw0tpEirCd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "input: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "_EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:GPU:0\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "a: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "b: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:0\n", "product_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0\n", "tf.Tensor(\n", "[[22. 28.]\n", " [49. 64.]], shape=(2, 2), dtype=float32)\n" ] } ], "source": [ "tf.debugging.set_log_device_placement(True)\n", "\n", "# Create some tensors\n", "a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", "b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n", "c = tf.matmul(a, b)\n", "\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "kKhmFeraTdEI" }, "source": [ "以上代码将打印 `MatMul` 运算在 `GPU:0` 上执行的指示。" ] }, { "cell_type": "markdown", "metadata": { "id": "U88FspwGjB7W" }, "source": [ "## 手动设备放置\n", "\n", "如果您希望在自己选择的设备上执行特定运算,而不是在自动选择的设备上执行,则可以使用 `with tf.device` 创建设备上下文。创建完成后,该上下文中的所有运算都会在同一指定设备上运行。" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "id": "8wqaQfEhjHit" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0\n", "tf.Tensor(\n", "[[22. 28.]\n", " [49. 64.]], shape=(2, 2), dtype=float32)\n" ] } ], "source": [ "tf.debugging.set_log_device_placement(True)\n", "\n", "# Place tensors on the CPU\n", "with tf.device('/CPU:0'):\n", " a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", " b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n", "\n", "c = tf.matmul(a, b)\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "8ixO89gRjJUu" }, "source": [ "现在,您会看到已将 `a` 和 `b` 分配给 `CPU:0`。由于没有为 `MatMul` 运算明确指定设备,TensorFlow 运行时将根据运算和可用的设备选择一个设备(本例中为 `GPU:0`),并且在需要时会自动在设备之间复制张量。" ] }, { "cell_type": "markdown", "metadata": { "id": "ARrRhwqijPzN" }, "source": [ "## 限制 GPU 内存增长\n", "\n", "默认情况下,TensorFlow 会映射进程可见的所有 GPU(取决于 [`CUDA_VISIBLE_DEVICES`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars))的几乎全部内存。这是为了减少内存碎片,更有效地利用设备上相对宝贵的 GPU 内存资源。为了将 TensorFlow 限制为使用一组特定的 GPU,我们使用 `tf.config.set_visible_devices` 方法。" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "id": "hPI--n_jhZhv" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Visible devices cannot be modified after being initialized\n" ] } ], "source": [ "gpus = tf.config.list_physical_devices('GPU')\n", "if gpus:\n", " # Restrict TensorFlow to only use the first GPU\n", " try:\n", " tf.config.set_visible_devices(gpus[0], 'GPU')\n", " logical_gpus = tf.config.list_logical_devices('GPU')\n", " print(len(gpus), \"Physical GPUs,\", len(logical_gpus), \"Logical GPU\")\n", " except RuntimeError as e:\n", " # Visible devices must be set before GPUs have been initialized\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "N3x4M55DhYk9" }, "source": [ "在某些情况下,我们希望进程最好只分配可用内存的一个子集,或者仅在进程需要时才增加内存使用量。TensorFlow 为此提供了两种控制方法。\n", "\n", "第一个选项是通过调用 `tf.config.experimental.set_memory_growth` 来开启内存增长。此选项会尝试根据运行时分配的需求分配尽可能充足的 GPU 内存:首先分配非常少的内存,随着程序的运行,需要的 GPU 内存逐渐增多,于是扩展 TensorFlow 进程的 GPU 内存区域。内存不会被释放,因为这样会产生内存碎片。要关闭特定 GPU 的内存增长,请在分配任何张量或执行任何运算之前使用以下代码。" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "id": "jr3Kf1boFnCO" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Physical devices cannot be modified after being initialized\n" ] } ], "source": [ "gpus = tf.config.list_physical_devices('GPU')\n", "if gpus:\n", " try:\n", " # Currently, memory growth needs to be the same across GPUs\n", " for gpu in gpus:\n", " tf.config.experimental.set_memory_growth(gpu, True)\n", " logical_gpus = tf.config.list_logical_devices('GPU')\n", " print(len(gpus), \"Physical GPUs,\", len(logical_gpus), \"Logical GPUs\")\n", " except RuntimeError as e:\n", " # Memory growth must be set before GPUs have been initialized\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "I1o8t51QFnmv" }, "source": [ "第二个启用此选项的方式是将环境变量 `TF_FORCE_GPU_ALLOW_GROWTH` 设置为 `true`。这是一个特定于平台的配置。\n", "\n", "第二种方法是使用 `tf.config.set_logical_device_configuration` 配置虚拟 GPU 设备,并且设置可在 GPU 上分配多少总内存的硬性限制。" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "id": "2qO2cS9QFn42" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Virtual devices cannot be modified after being initialized\n" ] } ], "source": [ "gpus = tf.config.list_physical_devices('GPU')\n", "if gpus:\n", " # Restrict TensorFlow to only allocate 1GB of memory on the first GPU\n", " try:\n", " tf.config.set_logical_device_configuration(\n", " gpus[0],\n", " [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])\n", " logical_gpus = tf.config.list_logical_devices('GPU')\n", " print(len(gpus), \"Physical GPUs,\", len(logical_gpus), \"Logical GPUs\")\n", " except RuntimeError as e:\n", " # Virtual devices must be set before GPUs have been initialized\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "Bsg1iLuHFoLW" }, "source": [ "这在要真正限制可供 TensorFlow 进程使用的 GPU 内存量时非常有用。在本地开发中,与其他应用(如工作站 GUI)共享 GPU 时,这是常见做法。" ] }, { "cell_type": "markdown", "metadata": { "id": "B27_-1gyjf-t" }, "source": [ "## 使用多 GPU 系统上的单个 GPU\n", "\n", "如果系统上有多个 GPU,则默认情况下会选择具有最小 ID 的 GPU。如果希望在不同的 GPU 上运行,则需要明确指定需要的 GPU:" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "id": "wep4iteljjG1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0\n" ] } ], "source": [ "tf.debugging.set_log_device_placement(True)\n", "\n", "try:\n", " # Specify an invalid GPU device\n", " with tf.device('/device:GPU:2'):\n", " a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", " b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n", " c = tf.matmul(a, b)\n", "except RuntimeError as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "jy-4cCO_jn4G" }, "source": [ "如果指定的设备不存在,则会引发 `RuntimeError` 错误:`.../device:GPU:2 unknown device`。\n", "\n", "当指定的设备不存在时,如果希望 TensorFlow 自动选择存在且支持的设备来执行运算,可以调用 `tf.config.set_soft_device_placement(True)`。" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "id": "sut_UHlkjvWd" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0\n", "tf.Tensor(\n", "[[22. 28.]\n", " [49. 64.]], shape=(2, 2), dtype=float32)\n" ] } ], "source": [ "tf.config.set_soft_device_placement(True)\n", "tf.debugging.set_log_device_placement(True)\n", "\n", "# Creates some tensors\n", "a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", "b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n", "c = tf.matmul(a, b)\n", "\n", "print(c)" ] }, { "cell_type": "markdown", "metadata": { "id": "sYTYPrQZj2d9" }, "source": [ "## 使用多个 GPU\n", "\n", "为多个 GPU 开发的模型可使用额外的资源进行扩展。如果在具有单个 GPU 的系统上进行开发,可以使用虚拟设备模拟多个 GPU。这样,无需额外的资源,您就可以轻松对多 GPU 设置进行测试。" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "id": "8EMGuGKbNkc6" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Virtual devices cannot be modified after being initialized\n" ] } ], "source": [ "gpus = tf.config.list_physical_devices('GPU')\n", "if gpus:\n", " # Create 2 virtual GPUs with 1GB memory each\n", " try:\n", " tf.config.set_logical_device_configuration(\n", " gpus[0],\n", " [tf.config.LogicalDeviceConfiguration(memory_limit=1024),\n", " tf.config.LogicalDeviceConfiguration(memory_limit=1024)])\n", " logical_gpus = tf.config.list_logical_devices('GPU')\n", " print(len(gpus), \"Physical GPU,\", len(logical_gpus), \"Logical GPUs\")\n", " except RuntimeError as e:\n", " # Virtual devices must be set before GPUs have been initialized\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": { "id": "xmNzO0FxNkol" }, "source": [ "建立可供运行时使用的多个逻辑 GPU 后,可以通过 `tf.distribute.Strategy` 或手动放置来利用多个 GPU。" ] }, { "cell_type": "markdown", "metadata": { "id": "IDZmEGq4j6kG" }, "source": [ "#### 使用 `tf.distribute.Strategy`\n", "\n", "使用多个 GPU 的最佳做法是使用 `tf.distribute.Strategy`。下面是一个简单示例:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "id": "1KgzY8V2AvRv" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "input: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "_EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:GPU:1\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "input: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "_EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:GPU:0\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "x: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "y: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "FloorMod: (FloorMod): /job:localhost/replica:0/task:0/device:GPU:0\n", "z_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op FloorMod in device /job:localhost/replica:0/task:0/device:GPU:0\n", "x: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "Cast: (Cast): /job:localhost/replica:0/task:0/device:GPU:0\n", "y_RetVal: (_DeviceRetval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Cast in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "input: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0\n", "_EagerConst: (_EagerConst): /job:localhost/replica:0/task:0/device:GPU:0\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "seed: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0\n", "StatelessRandomGetKeyCounter: (StatelessRandomGetKeyCounter): /job:localhost/replica:0/task:0/device:GPU:0\n", "key_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "counter_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op StatelessRandomGetKeyCounter in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "shape: (_DeviceArg): /job:localhost/replica:0/task:0/device:CPU:0\n", "key: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "counter: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "alg: (_DeviceArg): /job:localhost/replica:0/task:0/device:CPU:0\n", "StatelessRandomUniformV2: (StatelessRandomUniformV2): /job:localhost/replica:0/task:0/device:GPU:0\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op StatelessRandomUniformV2 in device /job:localhost/replica:0/task:0/device:GPU:0\n", "x: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "y: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "Sub: (Sub): /job:localhost/replica:0/task:0/device:GPU:0\n", "z_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Sub in device /job:localhost/replica:0/task:0/device:GPU:0\n", "x: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "y: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "Mul: (Mul): /job:localhost/replica:0/task:0/device:GPU:0\n", "z_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Mul in device /job:localhost/replica:0/task:0/device:GPU:0\n", "x: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "y: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "AddV2: (AddV2): /job:localhost/replica:0/task:0/device:GPU:0\n", "z_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AddV2 in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "value_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "input: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "Identity: (Identity): /job:localhost/replica:0/task:0/device:GPU:1\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "NoOp: (NoOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "dims: (_DeviceArg): /job:localhost/replica:0/task:0/device:CPU:0\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "Fill: (Fill): /job:localhost/replica:0/task:0/device:GPU:0\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "value_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "value_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "input: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "Identity: (Identity): /job:localhost/replica:0/task:0/device:GPU:1\n", "output_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "value: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "AssignVariableOp: (AssignVariableOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource: (_Arg): /job:localhost/replica:0/task:0/device:GPU:0\n", "ReadVariableOp: (ReadVariableOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "value_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Fill in device /job:localhost/replica:0/task:0/device:GPU:0\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:0\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:1\n", "resource_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "VarHandleOp: (VarHandleOp): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op VarHandleOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op AssignVariableOp in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op NoOp in device /job:localhost/replica:0/task:0/device:GPU:0\n" ] } ], "source": [ "tf.debugging.set_log_device_placement(True)\n", "gpus = tf.config.list_logical_devices('GPU')\n", "strategy = tf.distribute.MirroredStrategy(gpus)\n", "with strategy.scope():\n", " inputs = tf.keras.layers.Input(shape=(1,))\n", " predictions = tf.keras.layers.Dense(1)(inputs)\n", " model = tf.keras.models.Model(inputs=inputs, outputs=predictions)\n", " model.compile(loss='mse',\n", " optimizer=tf.keras.optimizers.SGD(learning_rate=0.2))" ] }, { "cell_type": "markdown", "metadata": { "id": "Dy7nxlKsAxkK" }, "source": [ "此程序会在每个 GPU 上运行模型的一个副本,并将输入数据拆分到每个 GPU 上,也就是所谓的“[数据并行](https://en.wikipedia.org/wiki/Data_parallelism)”。\n", "\n", "有关分布策略的详细信息,请查阅[此处](https://render.githubusercontent.com/view/distributed_training.ipynb)的指南。" ] }, { "cell_type": "markdown", "metadata": { "id": "8phxM5TVkAY_" }, "source": [ "#### 手动放置\n", "\n", "`tf.distribute.Strategy` 通过跨设备复制计算在后台运行。您可以通过在每个 GPU 上构建模型来手动实现复制。例如:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": { "id": "AqPo9ltUA_EY" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op _EagerConst in device /job:localhost/replica:0/task:0/device:GPU:1\n", "a: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "b: (_Arg): /job:localhost/replica:0/task:0/device:GPU:1\n", "MatMul: (MatMul): /job:localhost/replica:0/task:0/device:GPU:1\n", "product_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:GPU:1\n", "Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:1\n", "inputs_0: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0\n", "inputs_1: (_Arg): /job:localhost/replica:0/task:0/device:CPU:0\n", "AddN: (AddN): /job:localhost/replica:0/task:0/device:CPU:0\n", "sum_RetVal: (_Retval): /job:localhost/replica:0/task:0/device:CPU:0\n", "Executing op AddN in device /job:localhost/replica:0/task:0/device:CPU:0\n", "tf.Tensor(\n", "[[ 44. 56.]\n", " [ 98. 128.]], shape=(2, 2), dtype=float32)\n" ] } ], "source": [ "tf.debugging.set_log_device_placement(True)\n", "\n", "gpus = tf.config.list_logical_devices('GPU')\n", "if gpus:\n", " # Replicate your computation on multiple GPUs\n", " c = []\n", " for gpu in gpus:\n", " with tf.device(gpu.name):\n", " a = tf.constant([[1.0, 2.0, 3.0], [4.0, 5.0, 6.0]])\n", " b = tf.constant([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]])\n", " c.append(tf.matmul(a, b))\n", "\n", " with tf.device('/CPU:0'):\n", " matmul_sum = tf.add_n(c)\n", "\n", " print(matmul_sum)" ] } ], "metadata": { "accelerator": "GPU", "colab": { "collapsed_sections": [], "name": "gpu.ipynb", "toc_visible": true }, "kernelspec": { "display_name": "xxx", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.2" } }, "nbformat": 4, "nbformat_minor": 0 }