量化张量

量化张量#

PyTorch 支持逐张量和逐通道的非对称线性量化。逐张量意味着张量内的所有值都以相同的方式缩放。逐通道意味着对于每个维度，通常是张量的通道维度，张量中的值被不同的值缩放和偏移（实际上，缩放和偏移变成了向量）。这使得将张量转换为量化值的误差更小。

映射是通过使用变换浮点张量来执行的：

\[ Q(x, \operatorname{scale}, \operatorname{zero\_point}) = \operatorname{round}(\frac{x}{\operatorname{scale}} + \operatorname{zero\_point}) \]

创建量化张量#

通过量化浮点张量得到量化张量

import torch

float_tensor = torch.randn(2, 2, 3)

scale, zero_point = 1e-4, 2
dtype = torch.qint32
q_per_tensor = torch.quantize_per_tensor(float_tensor, scale, zero_point, dtype)
q_per_tensor

tensor([[[ 0.6471, -0.7330,  2.7292],
         [ 2.5431, -0.5702,  0.1856]],

        [[ 0.1423, -0.5374,  0.5672],
         [-0.4811, -2.0259, -0.3243]]], size=(2, 2, 3), dtype=torch.qint32,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

还支持逐通道量化：

scales = torch.tensor([1e-1, 1e-2, 1e-3])
zero_points = torch.tensor([-1, 0, 1])
channel_axis = 2
q_per_channel = torch.quantize_per_channel(float_tensor,
                                           scales,
                                           zero_points,
                                           axis=channel_axis,
                                           dtype=dtype)
q_per_channel

tensor([[[ 0.6000, -0.7300,  2.7290],
         [ 2.5000, -0.5700,  0.1860]],

        [[ 0.1000, -0.5400,  0.5670],
         [-0.5000, -2.0300, -0.3240]]], size=(2, 2, 3), dtype=torch.qint32,
       quantization_scheme=torch.per_channel_affine,
       scale=tensor([0.1000, 0.0100, 0.0010], dtype=torch.float64),
       zero_point=tensor([-1,  0,  1]), axis=2)

直接从 empty_quantized 函数创建量化张量

注意，_empty_affine_quantized 是私有 API，将用类似 torch 的方式替换它。将来使用 empty_quantized_tensor(sizes, quantizer)：

q = torch._empty_affine_quantized([10],
                                  scale=scale,
                                  zero_point=zero_point,
                                  dtype=dtype)
q

tensor([ 1.1369e+04, -2.0000e-04,  1.1369e+04, -2.0000e-04,  6.2000e-03,
        -2.0000e-04,  1.1000e-02, -2.0000e-04,  1.1372e+04, -2.0000e-04],
       size=(10,), dtype=torch.qint32,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

通过集合 int 张量和量化参数来创建量化张量

备注

注意，_per_tensor_affine_qtensor 是私有 API，我们将用类似 torch 的东西 torch.form_tensor(int_tensor, quantizer) 替换它

int_tensor = torch.randint(0, 100, size=(10,), dtype=torch.uint8)

数据类型为 torch.quint8，即对应的 torch.uint8，我们有以下对应的 torch int 类型和 torch 量化 int 类型：

torch.uint8 -> torch.quint8
torch.int8 -> torch.qint8
torch.int32 -> torch.qint32

q = torch._make_per_tensor_quantized_tensor(int_tensor, scale, zero_point)  # Note no `dtype`
q 

tensor([0.0063, 0.0037, 0.0081, 0.0077, 0.0089, 0.0034, 0.0063, 0.0021, 0.0045,
        0.0057], size=(10,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

在当前的 API 中，我们必须专一每个量化方案的函数，例如，如果我们想量化张量，我们将有 quantize_per_tensor 和 quantize_per_channel。类似地，对于 q_scale 和 q_zero_point，我们应该有以 Quantizer 作为参数的单一量化函数。为了检查量化参数，我们应该让量化张量返回 Quantizer 对象，这样我们就可以在 Quantizer 对象上检查量化参数，而不是把所有东西都放到张量 API 中。当前的基础设施还没有为这种支持做好准备，目前正在开发中。

量化张量的运算#

量化张量的反量化#

dequantized_tensor = q.dequantize()
dequantized_tensor

tensor([0.0063, 0.0037, 0.0081, 0.0077, 0.0089, 0.0034, 0.0063, 0.0021, 0.0045,
        0.0057])

量化张量支持切片#

量化张量像通常的张量一样支持切片：

s = q[2]
s

tensor(0.0081, size=(), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

备注

尺度（scale）和零点（zero_point）相同的量化张量，它包含与 q_made_per_tensor[2, :] 相同的原始量化张量的第二行值。

量化张量赋值#

q[0] = 3.5 # 量化 3.5 并将 int 值存储在量化张量中

量化张量拷贝#

我们可以从量化张量复制相同大小和 dtype 但不同尺度和零点的张量：

scale1, zero_point1 = 1e-1, 0
scale2, zero_point2 = 1, -1
q1 = torch._empty_affine_quantized([2, 3],
                                   scale=scale1,
                                   zero_point=zero_point1,
                                   dtype=torch.qint8)
q2 = torch._empty_affine_quantized([2, 3],
                                   scale=scale2,
                                   zero_point=zero_point2,
                                   dtype=torch.qint8)
q2.copy_(q1)

tensor([[  9.6000,  -3.6000,  -4.9000],
        [-11.2000,   7.5000,  12.7000]], size=(2, 3), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=0)

量化张量 Permutation#

q1.transpose(0, 1)  # see https://pytorch.org/docs/stable/torch.html#torch.transpose
q1.permute([1, 0])  # https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute
q1.contiguous()  # Convert to contiguous Tensor

tensor([[  9.6000,  -3.6000,  -4.9000],
        [-11.2000,   7.5000,  12.7000]], size=(2, 3), dtype=torch.qint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.1, zero_point=0)

量化张量序列化与反序列化#

import tempfile
with tempfile.NamedTemporaryFile() as f:
    torch.save(q2, f)
    f.seek(0)
    q3 = torch.load(f)

/media/pc/data/tmp/cache/conda/envs/tvmz/lib/python3.10/site-packages/torch/_utils.py:355: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  device=storage.device,

检查量化张量#

# Check size of Tensor
q.numel(), q.size()

(10, torch.Size([10]))

# Check whether the tensor is quantized
q.is_quantized

True

# Get the scale of the quantized Tensor, only works for affine quantized tensor
q.q_scale()

0.0001

# Get the zero_point of quantized Tensor
q.q_zero_point()

# get the underlying integer representation of the quantized Tensor
# int_repr() returns a Tensor of the corresponding data type of the quantized data type
# e.g.for quint8 Tensor it returns a uint8 Tensor while preserving the MemoryFormat when possible
q.int_repr()

tensor([255,  39,  83,  79,  91,  36,  65,  23,  47,  59], dtype=torch.uint8)

# If a quantized Tensor is a scalar we can print the value:
# item() will dequantize the current tensor and return a Scalar of float
q[0].item()

0.025299999862909317

# printing
print(q)

tensor([0.0253, 0.0037, 0.0081, 0.0077, 0.0089, 0.0034, 0.0063, 0.0021, 0.0045,
        0.0057], size=(10,), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

# indexing
print(q[0]) # q[0] is a quantized Tensor with one value

tensor(0.0253, size=(), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.0001, zero_point=2)

量化的算子/内核#

我们也在研究量化算子，如量化 QRelu、QAdd、QCat、QLinear、QConv 等。我们要么使用简单的操作符实现，要么在操作符中封装 fbgemm 实现。所有的操作员都是在 C10 中注册的，而且他们现在只在 CPU 中。我们也有关于如何写量化算子/内核的说明。

量化模型#

我们还有量化的模块，它们封装了这些内核实现，这些内核实现位于 torch.nn.quantized 命名空间中，将在模型开发中使用。我们将提供实用函数来将 torch.nn.Module 替换为 torch.nn.quantized.Module，但用户也可以自由地直接使用它们。我们会尽量将量化模块的 api 与 torch.nn.Module 中的对应 api 匹配。

torch.nn.qat

<module 'torch.nn.qat' from '/media/pc/data/tmp/cache/conda/envs/tvmz/lib/python3.10/site-packages/torch/nn/qat/__init__.py'>

import torch.nn.quantized as nnq