{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 自动微分\n", "\n", "{guilabel}`参考`:[autogradqs](https://pytorch.org/tutorials/beginner/basics/autogradqs_tutorial.html)\n", "\n", "在训练神经网络时,最常用的算法是 **反向传播** (back propagation)。在该算法中,参数(模型权值)根据损失函数相对于给定参数的 **梯度** (gradient)进行调整。\n", "\n", "为了计算这些梯度,PyTorch 内置了名为 {mod}`torch.autograd` 的微分引擎。它对任何计算图,支持自动计算梯度。\n", "\n", "考虑最简单的单层神经网络,输入 `x`,参数 `w` 和 `b`,以及一些损失函数。它可以在 PyTorch 中以如下方式定义:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import torch\n", "\n", "x = torch.ones(5) # 输入 tensor\n", "y = torch.zeros(3) # 期望的 output\n", "w = torch.randn(5, 3, requires_grad=True)\n", "b = torch.randn(3, requires_grad=True)\n", "z = torch.matmul(x, w)+b\n", "loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 张量、函数和计算图\n", "\n", "上面的代码定义了如下计算图:\n", "\n", "```{figure} images/comp-graph.png\n", "```\n", "\n", "在这个网络中,`w` 和 `b` 是需要优化的参数。因此,需要能够计算相对于这些变量的损失函数的梯度。为了做到这一点,设置这些张量的 `requires_grad` 属性。\n", "\n", "```{note}\n", ":class: alert-info\n", "可以在创建张量时设置 `requires_grad` 的值,或者稍后使用 `x.requires_grad_(True)` 方法。\n", "```\n", "\n", "应用在张量上构造计算图的函数实际上是 {class}`~torch.autograd.Function` 类的对象。该对象知道如何在前向过程计算损失,也知道如何在反向传播步骤中计算其导数。对向后传播函数的引用存储在张量的 ``grad_fn`` 属性中。" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Gradient function for z = \n", "Gradient function for loss = \n" ] } ], "source": [ "print(f\"Gradient function for z = {z.grad_fn}\")\n", "print(f\"Gradient function for loss = {loss.grad_fn}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 计算梯度\n", "\n", "为了优化神经网络中参数的权值,需要计算损失函数对参数的导数,即在 ``x`` 和 ``y`` 的某些固定值下,需要计算 $\\frac{\\partial loss}{\\partial b}$ 和 $\\frac{\\partial loss}{\\partial w}$。为了计算这些导数,调用 {func}`loss.backward`,然后从 `w.grad` 和 `b.grad` 检索值:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "tensor([[0.0234, 0.2205, 0.2532],\n", " [0.0234, 0.2205, 0.2532],\n", " [0.0234, 0.2205, 0.2532],\n", " [0.0234, 0.2205, 0.2532],\n", " [0.0234, 0.2205, 0.2532]])\n", "tensor([0.0234, 0.2205, 0.2532])\n" ] } ], "source": [ "loss.backward()\n", "print(w.grad)\n", "print(b.grad)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{note}\n", ":class: alert-info\n", "\n", "- 只能获得计算图的叶节点的 `grad` 属性,它们的 `requires_grad` 属性设置为 `True`。对于图中的所有其他节点,梯度将不可用。\n", "- 由于性能原因,只能在给定的图上使用一次 `backward` 梯度计算。如果需要对同一个图进行多次 `backward` 调用,则需要将 `retain_graph=True` 传递给 `backward` 调用。\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 禁用梯度追踪\n", "\n", "默认情况下,所有 `requires_grad=True` 的张量都会跟踪它们的计算历史,并支持梯度计算。但是,在某些情况下,不需要这样做,例如,训练模型后,只是想把它应用到一些输入数据上,即只想通过网络进行正向计算。可以通过使用 {func}`torch.no_grad` 块包围计算代码来停止跟踪计算:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "True\n", "False\n" ] } ], "source": [ "z = torch.matmul(x, w)+b\n", "print(z.requires_grad)\n", "\n", "with torch.no_grad():\n", " z = torch.matmul(x, w)+b\n", "print(z.requires_grad)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "另一种实现相同结果的方法是对张量使用 {meth}`detach` 方法:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False\n" ] } ], "source": [ "z = torch.matmul(x, w)+b\n", "z_det = z.detach()\n", "print(z_det.requires_grad)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "以下是禁用梯度跟踪的原因:\n", "\n", "- 将神经网络中的一些参数标记为 **冻结参数** (frozen parameters)。这是对[预训练的网络进行微调](https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html)的非常常见的场景。\n", "- 在只进行正向传播的情况下 **加快计算速度**,因为在不跟踪梯度的张量上的计算将更加有效。\n", "\n", "## 更多关于计算图的内容\n", "\n", "从概念上讲,{mod}`torch.autograd` 在由 {class}`torch.autograd.Function` 对象组成的有向无环图(DAG)中保存数据(张量)和所有执行的运算(以及产生的新张量)的记录。在这个 DAG 中,叶是输入张量,根是输出张量。通过从根到叶跟踪这个图,可以使用链式法则自动计算梯度。\n", "\n", "在forward 传播时,`autograd` 会同时做两件事:\n", "\n", "- 运行请求的运算来计算结果张量。\n", "- 在 DAG 中维护运算的 *梯度函数*。\n", "\n", "当在 DAG 根上调用 `.backward()` 时,后向传播开始。然后,`autograd`:从每个 ``.grad_fn`` 计算梯度,使用链式规则将它们累加到各自张量的 `.grad` 属性中,并一路传播到叶张量。\n", "\n", "```{note}\n", ":class: alert-info\n", "\n", "在 PyTorch 中,DAG 是动态的。在每次 `.backward()` 调用之后,`autograd` 开始填充新的图。这正是允许你在模型中使用控制流语句的原因;如果需要,您可以在每次迭代中更改形状、大小和运算。\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 选读:张量梯度和雅可比积\n", "\n", "在很多情况下,损失函数是标量的,需要计算关于一些参数的梯度。然而,也有输出函数是任意张量的情况。在这种情况下,PyTorch 允许你计算所谓的雅可比乘积(Jacobian product),而不是实际的梯度。\n", "\n", "对于向量函数 $\\vec{y}=f(\\vec{x})$,其中 $\\vec{x}=\\langle x_1,\\dots,x_n\\rangle$ 和 $\\vec{y}=\\langle y_1,\\dots,y_m\\rangle$,$\\vec{y}$ 对 $\\vec{x}$ 的梯度雅可比矩阵:\n", "\n", "$$\n", "\\begin{align}J=\\left(\\begin{array}{ccc}\n", " \\frac{\\partial y_{1}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{1}}{\\partial x_{n}}\\\\\n", " \\vdots & \\ddots & \\vdots\\\\\n", " \\frac{\\partial y_{m}}{\\partial x_{1}} & \\cdots & \\frac{\\partial y_{m}}{\\partial x_{n}}\n", " \\end{array}\\right)\n", "\\end{align}\n", "$$\n", "\n", "与计算雅可比矩阵本身不同,PyTorch 允许你为给定的输入向量 $v=(v_1 \\dots v_m)$ 计算雅可比积 $v^T\\cdot J$。这是通过 ``backward`` 调用参数 $v$ 实现的。$v$ 的大小应该和原始张量的大小一样,要根据它来计算乘积:\n" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "First call\n", "tensor([[4., 2., 2., 2., 2.],\n", " [2., 4., 2., 2., 2.],\n", " [2., 2., 4., 2., 2.],\n", " [2., 2., 2., 4., 2.],\n", " [2., 2., 2., 2., 4.]])\n", "\n", "Second call\n", "tensor([[8., 4., 4., 4., 4.],\n", " [4., 8., 4., 4., 4.],\n", " [4., 4., 8., 4., 4.],\n", " [4., 4., 4., 8., 4.],\n", " [4., 4., 4., 4., 8.]])\n", "\n", "Call after zeroing gradients\n", "tensor([[4., 2., 2., 2., 2.],\n", " [2., 4., 2., 2., 2.],\n", " [2., 2., 4., 2., 2.],\n", " [2., 2., 2., 4., 2.],\n", " [2., 2., 2., 2., 4.]])\n" ] } ], "source": [ "inp = torch.eye(5, requires_grad=True)\n", "out = (inp+1).pow(2)\n", "out.backward(torch.ones_like(inp), retain_graph=True)\n", "print(f\"First call\\n{inp.grad}\")\n", "out.backward(torch.ones_like(inp), retain_graph=True)\n", "print(f\"\\nSecond call\\n{inp.grad}\")\n", "inp.grad.zero_()\n", "out.backward(torch.ones_like(inp), retain_graph=True)\n", "print(f\"\\nCall after zeroing gradients\\n{inp.grad}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "```{caution}\n", "当使用相同的参数第二次调用 `backward` 时,梯度的值是不同的。这是因为在做反向传播时,PyTorch 会对梯度进行累加,即计算出的梯度的值被添加到计算图的所有叶子节点的 `grad` 属性中。如果你想计算正确的梯度,你需要在此之前将 `grad` 属性归零。在现实训练中,优化器可以帮助做到这一点。\n", "```\n", "\n", "\n", "```{note}\n", ":class: alert-info\n", "以前调用的是不带参数的 {func}`backward` 函数。这本质上相当于调用 `backward(torch.tensor(1.0))`,对于标量值函数(如神经网络训练期间的 loss),这是一种计算梯度的有用方法。\n", "```" ] } ], "metadata": { "interpreter": { "hash": "7a45eadec1f9f49b0fdfd1bc7d360ac982412448ce738fa321afc640e3212175" }, "kernelspec": { "display_name": "Python 3.10.4 ('torchx')", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.4" } }, "nbformat": 4, "nbformat_minor": 0 }