理论 有矩阵 A = [ a 1 , a 2 , ⋯ , a m ] T ∈ R m × n \mathbf{A} = [\mathbf{a}_1, \mathbf{a}_2, \cdots, \mathbf{a}_m]^T \in \mathbb{R}^{m \times n} A = [ a 1 , a 2 , ⋯ , a m ] T ∈ R m × n ,和向量 x = [ x 1 , x 2 , ⋯ , x n ] T ∈ R n \mathbf{x} = [x_1, x_2, \cdots, x_n]^T \in \mathbb{R}^{n} x = [ x 1 , x 2 , ⋯ , x n ] T ∈ R n ,则有:
A x = [ a 1 T x a 2 T x ⋮ a m T x ]
\mathbf{Ax} = \begin{bmatrix}
\mathbf{a}_1^T \mathbf{x} \\
\mathbf{a}_2^T \mathbf{x} \\
\vdots \\
\mathbf{a}_m^T \mathbf{x}
\end{bmatrix}
A x = ⎣ ⎢ ⎢ ⎢ ⎢ ⎡ a 1 T x a 2 T x ⋮ a m T x ⎦ ⎥ ⎥ ⎥ ⎥ ⎤
又有:
∇ x a i T x = a i
\nabla_{\mathbf{x}} \mathbf{a}_i^T \mathbf{x} = \mathbf{a}_i
∇ x a i T x = a i
1 T A x = ⟨ 1 , A x ⟩ = ⟨ A T 1 , x ⟩
\mathbf{1}^T \mathbf{A} \mathbf{x} = \langle \mathbf{1}, \mathbf{Ax} \rangle = \langle \mathbf{A}^T \mathbf{1}, \mathbf{x} \rangle
1 T A x = ⟨ 1 , A x ⟩ = ⟨ A T 1 , x ⟩
所以,
∇ x 1 T A x = ∇ x ∑ i = 1 m a i T x = ∑ i = 1 m a i = A T 1 m × 1 = A T 1
\nabla_{\mathbf{x}} \mathbf{1}^T \mathbf{A} \mathbf{x} =
\nabla_{\mathbf{x}} \sum_{i=1}^m \mathbf{a}_i^T \mathbf{x} = \sum_{i=1}^m \mathbf{a}_i = \mathbf{A}^T \mathbf{1}_{m \times 1} = \mathbf{A}^T \mathbf{1}
∇ x 1 T A x = ∇ x i = 1 ∑ m a i T x = i = 1 ∑ m a i = A T 1 m × 1 = A T 1
这样,有:
∇ x A x = ∇ x ⟨ A T , x 1 T ⟩ = A T
\nabla_{\mathbf{x}} \mathbf{Ax} = \nabla_{\mathbf{x}} \langle \mathbf{A}^T, \mathbf{x} \mathbf{1}^T \rangle = \mathbf{A}^T
∇ x A x = ∇ x ⟨ A T , x 1 T ⟩ = A T
∇ x x ⊤ A = A
\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} = \mathbf{A}
∇ x x ⊤ A = A
∇ x x ⊤ A x = ( A + A ⊤ ) x
\nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{A} \mathbf{x} = (\mathbf{A} + \mathbf{A}^\top)\mathbf{x}
∇ x x ⊤ A x = ( A + A ⊤ ) x
∇ x ∥ x ∥ 2 = ∇ x x ⊤ x = 2 x
\nabla_{\mathbf{x}} \|\mathbf{x} \|^2 = \nabla_{\mathbf{x}} \mathbf{x}^\top \mathbf{x} = 2\mathbf{x}
∇ x ∥ x ∥ 2 = ∇ x x ⊤ x = 2 x
∇ X ∥ X ∥ F 2 = 2 X
\nabla_{\mathbf{X}} \|\mathbf{X} \|_F^2 = 2\mathbf{X}
∇ X ∥ X ∥ F 2 = 2 X
即:
∇ x ⟨ f ( x ) , g ( x ) ⟩ = ⟨ ∇ x f ( x ) , g ( x ) ⟩ + ⟨ f ( x ) , ∇ x g ( x ) ⟩
\nabla_{\mathbf{x}} \langle f(\mathbf{x}), g(\mathbf{x}) \rangle = \langle \nabla{_\mathbf{x}} f(\mathbf{x}), g(\mathbf{x}) \rangle + \langle f(\mathbf{x}), \nabla_{\mathbf{x}} g(\mathbf{x}) \rangle
∇ x ⟨ f ( x ) , g ( x ) ⟩ = ⟨ ∇ x f ( x ) , g ( x ) ⟩ + ⟨ f ( x ) , ∇ x g ( x ) ⟩
下面看一个例子:
一个例子
MXNet
python
python
from xint import utils
from xint import mxnet as xint
np = xint. np
TensorFlow
python
python
from xint import utils
from xint import tensorflow as xint
np = xint. np
PyTorch
python
python
from xint import utils
from xint import torch as xint
np = xint. np
创建张量 x \mathbf{x} x :
python
python
x = np. arange( 4.0 ) . reshape( 4 , 1 )
计算函数 y = 2 x ⊤ x y = 2\mathbf{x}^{\top}\mathbf{x} y = 2 x ⊤ x 的梯度:
MXNet
python
python
x. attach_grad( )
x. grad
TensorFlow
PyTorch
python
python
x. requires_grad_( True )
x. grad
现在让计算 y y y :
MXNet
python
python
from mxnet import autograd
with autograd. record( ) :
y = 2 * x. T @ x
float ( y)
TensorFlow
python
python
with tf. GradientTape( ) as t:
y = 2 * tf. transpose( x) @ x
float ( y)
PyTorch
python
python
y = 2 * x. T @ x
float ( y)
28.0
接下来,我们可以通过调用反向传播函数来自动计算 y y y 关于 x \mathbf{x} x 每个分量的梯度,并打印这些梯度:
MXNet
python
python
y. backward( )
x. grad
array([[ 0.],
[ 4.],
[ 8.],
[12.]])
TensorFlow
python
python
with tf. GradientTape( ) as t:
y = 2 * tf. transpose( x) @ x
float ( y)
<tf.Tensor: shape=(4, 1), dtype=float64, numpy=
array([[ 0.],
[ 4.],
[ 8.],
[12.]])>
PyTorch
python
python
y. backward( )
x. grad
tensor([[ 0.],
[ 8.],
[16.],
[24.]])
可以计算 x \mathbf{x} x 的另一个函数:
MXNet
python
python
with autograd. record( ) :
y = x. sum ( )
y. backward( )
x. grad
array([[1.],
[1.],
[1.],
[1.]])
TensorFlow
python
python
with tf. GradientTape( ) as t:
y = tf. reduce_sum( x)
t. gradient( y, x)
<tf.Tensor: shape=(4, 1), dtype=float64, numpy=
array([[1.],
[1.],
[1.],
[1.]])>
PyTorch
python
python
x. grad. zero_( )
y = x. sum ( )
y. backward( )
x. grad
tensor([[1.],
[1.],
[1.],
[1.]])
注意:对于非标量变量的反向传播,MXNet/TensorFlow 直接调用相应的函数即可获得梯度,但是 Pytorch 不支持直接对非标量进行反向传播,故而需要先对其求和,再求梯度。比如:
MXNet
python
python
with autograd. record( ) :
y = x * x
y. backward( )
x. grad
TensorFlow
python
python
with tf. GradientTape( ) as t:
y = x * x
t. gradient( y, x)
PyTorch
python
python
x. grad. zero_( )
y = x * x
y. sum ( ) . backward( )
x. grad
分离计算 有时,我们希望将某些计算移动到记录的计算图之外。例如,假设 y \mathbf{y} y 是作为 x \mathbf{x} x 的函数计算的,而 z \mathbf{z} z 则是作为 y \mathbf{y} y 和 x \mathbf{x} x 的函数计算的。现在,想象一下,我们想计算 z \mathbf{z} z 关于 x \mathbf{x} x 的梯度,但由于某种原因,我们希望将 y \mathbf{y} y 视为一个常数,并且只考虑到 x \mathbf{x} x 在 y \mathbf{y} y 被计算后发挥的作用。
在这里,我们可以分离 y \mathbf{y} y 来返回一个新变量 u u u ,该变量与 y \mathbf{y} y 具有相同的值,但截断计算图中关于如何计算 y \mathbf{y} y 的任何信息。换句话说,梯度不会向后流经 u u u 到 x \mathbf{x} x 。因此,下面的反向传播函数计算 z = u ∗ x \mathbf{z} = u * \mathbf{x} z = u ∗ x 关于 x \mathbf{x} x 的偏导数,同时将 u u u 作为常数处理,而不是 z = x ∗ x ∗ x \mathbf{z} = \mathbf{x} * \mathbf{x} * \mathbf{x} z = x ∗ x ∗ x 关于 x \mathbf{x} x 的偏导数。
MXNet
python
python
with autograd. record( ) :
y = x * x
u = y. detach( )
z = u * x
z. backward( )
x. grad == u
TensorFlow
python
python
with tf. GradientTape( persistent= True ) as t:
y = x * x
u = tf. stop_gradient( y)
z = u * x
x_grad = t. gradient( z, x)
x_grad == u
PyTorch
python
python
x. grad. zero_( )
y = x * x
u = y. detach( )
z = u * x
z. sum ( ) . backward( )
x. grad == u
通用微分函数 令 f : R n → R f: \mathbb{R}^n \rightarrow \mathbb{R} f : R n → R ,x = [ x 1 , x 2 , … , x n ] ⊤ \mathbf{x} = [x_1, x_2, \ldots, x_n]^\top x = [ x 1 , x 2 , … , x n ] ⊤ ,有
∇ x f ( x ) = [ ∂ f ( x ) ∂ x 1 , ∂ f ( x ) ∂ x 2 , … , ∂ f ( x ) ∂ x n ] ⊤
\nabla_{\mathbf{x}} f(\mathbf{x}) = \bigg[\frac{\partial f(\mathbf{x})}{\partial x_1}, \frac{\partial f(\mathbf{x})}{\partial x_2}, \ldots, \frac{\partial f(\mathbf{x})}{\partial x_n}\bigg]^\top
∇ x f ( x ) = [ ∂ x 1 ∂ f ( x ) , ∂ x 2 ∂ f ( x ) , … , ∂ x n ∂ f ( x ) ] ⊤
若有 y = [ y 1 , y 2 , … , y m ] ⊤ \mathbf{y} = [y_1, y_2, \ldots, y_m]^\top y = [ y 1 , y 2 , … , y m ] ⊤ ,x ∈ R x \in \mathbb{R} x ∈ R ,则:
∂ y ∂ x = [ ∂ y 1 ∂ x , ∂ y 2 ∂ x , … , ∂ y m ∂ x ] ⊤
\frac{\partial \mathbf{y}}{\partial x} = \bigg[\frac{\partial y_1}{\partial x}, \frac{\partial y_2}{\partial x}, \ldots, \frac{\partial y_m}{\partial x}\bigg]^\top
∂ x ∂ y = [ ∂ x ∂ y 1 , ∂ x ∂ y 2 , … , ∂ x ∂ y m ] ⊤
还有,
∂ y ∂ x = [ ∂ y 1 ∂ x , ∂ y 2 ∂ x , … , ∂ y m ∂ x ] ⊤ = [ ∂ y 1 ∂ x 1 ∂ y 1 ∂ x 2 ⋯ ∂ y 1 ∂ x n ∂ y 2 ∂ x 1 ∂ y 2 ∂ x 2 ⋯ ∂ y 2 ∂ x n ⋮ ⋮ ⋱ ⋮ ∂ y m ∂ x 1 ∂ y m ∂ x 2 ⋯ ∂ y m ∂ x n ]
\frac{\partial \mathbf{y}}{\partial \mathbf{x}} = \bigg[\frac{\partial y_1}{\partial \mathbf{x}}, \frac{\partial y_2}{\partial \mathbf{x}}, \ldots, \frac{\partial y_m}{\partial \mathbf{x}}\bigg]^\top = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} &\cdots &\frac{\partial y_1}{\partial x_n} \\
\frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n}\\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n}
\end{bmatrix}
∂ x ∂ y = [ ∂ x ∂ y 1 , ∂ x ∂ y 2 , … , ∂ x ∂ y m ] ⊤ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ∂ x 1 ∂ y 1 ∂ x 1 ∂ y 2 ⋮ ∂ x 1 ∂ y m ∂ x 2 ∂ y 1 ∂ x 2 ∂ y 2 ⋮ ∂ x 2 ∂ y m ⋯ ⋯ ⋱ ⋯ ∂ x n ∂ y 1 ∂ x n ∂ y 2 ⋮ ∂ x n ∂ y m ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤