基于 x86 CPU 的神经网络自动调度

基于 x86 CPU 的神经网络自动调度#

原作者: Lianmin Zheng, Chengfan Jia

针对特定设备和工作负载的自动调优对于获得最佳性能至关重要。下面是关于如何用自动调度器调优 x86 CPU 的整个神经网络的教程。

为了自动调优神经网络,需要将网络划分为小的子图,并独立地调优它们。每个子图被视为一个搜索任务。任务调度程序对时间进行切片,并动态地为这些任务分配时间资源。任务调度器预测每个任务对端到端执行时间的影响,并优先考虑能够最大程度减少执行时间的任务。

对于每个子图,使用 tvm/python/topi 中的 compute 声明来获得张量表达式形式的计算 DAG。然后,使用自动调度器来构造 DAG 的搜索空间,并搜索良好的调度(低级优化)。

与基于模板的 autotvm 依赖手动模板定义搜索空间不同,自动调度程序不需要任何调度模板。换句话说,自动调度器只使用 tvm/python/topi 中的 compute,而不使用现有的调度模板。

import numpy as np

import tvm
from tvm import relay, auto_scheduler
from tvm.relay import data_dep_optimization as ddo
import tvm.relay.testing
from tvm.contrib import graph_executor

定义网络#

首先,需要用 relay 前端 AP I定义网络。可以从 tvm.relay.testing 加载一些预定义的网络。还可以从 MXNet、ONNX、PyTorch 和 TensorFlow 加载模型(参见前端教程)。

对于卷积神经网络,尽管自动调度器可以在任何布局下正确工作,但我们发现 NHWC 布局通常能获得最佳性能。我们还通过自动调度器实现了对 NHWC 布局的更多优化。因此,建议将模型转换为 NHWC 布局以使用自动调度器。可以使用 ConvertLayout pass 在 TVM 中进行布局转换。

def get_network(name, batch_size, layout="NHWC", dtype="float32", use_sparse=False):
    """获取网络的符号定义和随机权值"""

    # 自动调度首选 NHWC 布局
    if layout == "NHWC":
        image_shape = (224, 224, 3)
    elif layout == "NCHW":
        image_shape = (3, 224, 224)
    else:
        raise ValueError("无效布局: " + layout)

    input_shape = (batch_size,) + image_shape
    output_shape = (batch_size, 1000)

    if name.startswith("resnet-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name.startswith("resnet3d-"):
        n_layer = int(name.split("-")[1])
        mod, params = relay.testing.resnet.get_workload(
            num_layers=n_layer,
            batch_size=batch_size,
            layout=layout,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "mobilenet":
        mod, params = relay.testing.mobilenet.get_workload(
            batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape
        )
    elif name == "squeezenet_v1.1":
        assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout"
        mod, params = relay.testing.squeezenet.get_workload(
            version="1.1",
            batch_size=batch_size,
            dtype=dtype,
            image_shape=image_shape,
        )
    elif name == "inception_v3":
        input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3)
        mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype)
    elif name == "mxnet":
        # an example for mxnet model
        from mxnet.gluon.model_zoo.vision import get_model

        assert layout == "NCHW"

        block = get_model("resnet50_v1", pretrained=True)
        mod, params = relay.frontend.from_mxnet(block, shape={"data": input_shape}, dtype=dtype)
        net = mod["main"]
        net = relay.Function(
            net.params, relay.nn.softmax(net.body), None, net.type_params, net.attrs
        )
        mod = tvm.IRModule.from_expr(net)
    elif name == "mlp":
        mod, params = relay.testing.mlp.get_workload(
            batch_size=batch_size, dtype=dtype, image_shape=image_shape, num_classes=1000
        )
    else:
        raise ValueError("Network not found.")

    if use_sparse:
        from tvm.topi.sparse.utils import convert_model_dense_to_sparse

        mod, params = convert_model_dense_to_sparse(mod, params, bs_r=4, random_params=True)

    return mod, params, input_shape, output_shape


# 定义神经网络和编译目标。
# 如果目标机器支持 avx512 指令,将 "llvm -mcpu=core-avx2" 替换为 "llvm -mcpu=skylake-avx512"
network = "resnet-50"
use_sparse = False
batch_size = 1
layout = "NHWC"
target = tvm.target.Target("llvm -mcpu=core-avx2")
dtype = "float32"
log_file = f"{network}-{layout}-B{batch_size:d}-{target.kind.name}.json"

提取搜索任务#

接下来,从网络中提取搜索任务及其权重。任务的权重是该任务的子图在整个网络中出现的次数。通过使用权重,可以将网络的端到端延迟近似为 sum(latency[t] * weight[t]),其中 latency[t] 是任务的延迟,weight[t] 是任务的权重。任务调度器会优化这个目标。

# 从网络中提取任务
print("获取模型...")
mod, params, input_shape, output_shape = get_network(
    network,
    batch_size,
    layout,
    dtype=dtype,
    use_sparse=use_sparse,
)
print("提取任务...")
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    print(task.compute_dag)
获取模型...
提取任务...
========== Task 0  (workload key: ["2d10de6646307f0e3e5cf4b31c20e69b", [1, 56, 56, 64], [1, 1, 64, 256], [1, 56, 56, 256]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 64]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 64, 256]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])

========== Task 1  (workload key: ["3060808fc5c74e18b1276729071fbae0", [1, 14, 14, 256], [1, 1, 256, 1024], [1, 14, 14, 1024], [1, 14, 14, 1024]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 256]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 256, 1024]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])

========== Task 2  (workload key: ["76afb7bf408a1ffa0b8b7bc09d077dc3", [1, 14, 14, 256], [1, 1, 256, 1024], [1, 14, 14, 1024], [1, 1, 1, 1024], [1, 14, 14, 1024]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 256]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 256, 1024]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 14, 14, 1024]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])
p3 = PLACEHOLDER [1, 1, 1, 1024]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 3  (workload key: ["2beb39e9afe4c74822fffbcbb8533595", [1, 14, 14, 1024], [1, 1, 1024, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 1024]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 1024, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 4  (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 56, 56, 256], [1, 1, 256, 512], [1, 28, 28, 512]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 256]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 256, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])

========== Task 5  (workload key: ["38552500208b25b4035682b0e93cbce3", [1, 14, 14, 256], [6, 6, 256, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 256]
data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*4) + eps), ((floormod(p, 4)*4) + nu), ci]
B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)),  ..(OMITTED)..  (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f))))))))))))))))))))))))))))))))))))
data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
p1 = PLACEHOLDER [6, 6, 256, 256]
bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci])
A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)),  ..(OMITTED)..  6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))))))))))
inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw])
conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n*4)*4) + (floordiv(h, 4)*4)) + floordiv(w, 4)), co]
p2 = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 6  (workload key: ["6d628209072e3e3dd8f49359935acea6", [1, 14, 14, 1024], [1, 1, 1024, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 1024]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 1024, 256]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 7  (workload key: ["cfd09cf1ca9e943f0ee12a18813a5c75", [1, 28, 28, 128], [6, 6, 128, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 128]
data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*4) + eps), ((floormod(p, 7)*4) + nu), ci]
B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)),  ..(OMITTED)..  (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f))))))))))))))))))))))))))))))))))))
data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
p1 = PLACEHOLDER [6, 6, 128, 128]
bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci])
A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)),  ..(OMITTED)..  6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))))))))))
inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw])
conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n*7)*7) + (floordiv(h, 4)*7)) + floordiv(w, 4)), co]
p2 = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 8  (workload key: ["3060808fc5c74e18b1276729071fbae0", [1, 56, 56, 64], [1, 1, 64, 256], [1, 56, 56, 256], [1, 56, 56, 256]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 64]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 64, 256]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])

========== Task 9  (workload key: ["3060808fc5c74e18b1276729071fbae0", [1, 28, 28, 128], [1, 1, 128, 512], [1, 28, 28, 512], [1, 28, 28, 512]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 128]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 128, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])

========== Task 10  (workload key: ["7d79c516e212fe1d73f5dbb90eaca2cf", [1, 1000], [1, 1000]]) ==========
p0 = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= p0[i0, k]
T_softmax_exp(i0, i1) = tir.exp((p0[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])

========== Task 11  (workload key: ["8c53ca2904398da2889aa7508082d7bb", [1, 7, 7, 2048], [1, 1, 1, 2048]]) ==========
p0 = PLACEHOLDER [1, 7, 7, 2048]
adaptive_pool_sum(ax0, ax1, ax2, ax3) += p0[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3]
adaptive_pool_avg(ax0, ax1, ax2, ax3) = (adaptive_pool_sum[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7)))))

========== Task 12  (workload key: ["3060808fc5c74e18b1276729071fbae0", [1, 7, 7, 512], [1, 1, 512, 2048], [1, 7, 7, 2048], [1, 7, 7, 2048]]) ==========
p0 = PLACEHOLDER [1, 7, 7, 512]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 512, 2048]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])

========== Task 13  (workload key: ["6d628209072e3e3dd8f49359935acea6", [1, 28, 28, 512], [1, 1, 512, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 512]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 512, 128]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 14  (workload key: ["2beb39e9afe4c74822fffbcbb8533595", [1, 28, 28, 512], [1, 1, 512, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 512]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 512, 256]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 15  (workload key: ["76afb7bf408a1ffa0b8b7bc09d077dc3", [1, 56, 56, 64], [1, 1, 64, 256], [1, 56, 56, 256], [1, 1, 1, 256], [1, 56, 56, 256]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 64]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 64, 256]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 56, 56, 256]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])
p3 = PLACEHOLDER [1, 1, 1, 256]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 16  (workload key: ["6d628209072e3e3dd8f49359935acea6", [1, 56, 56, 256], [1, 1, 256, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 256]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 256, 64]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 17  (workload key: ["6d628209072e3e3dd8f49359935acea6", [1, 7, 7, 2048], [1, 1, 2048, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ==========
p0 = PLACEHOLDER [1, 7, 7, 2048]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 2048, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 18  (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 28, 28, 512], [1, 1, 512, 1024], [1, 14, 14, 1024]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 512]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 512, 1024]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])

========== Task 19  (workload key: ["76afb7bf408a1ffa0b8b7bc09d077dc3", [1, 28, 28, 128], [1, 1, 128, 512], [1, 28, 28, 512], [1, 1, 1, 512], [1, 28, 28, 512]]) ==========
p0 = PLACEHOLDER [1, 28, 28, 128]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 128, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 28, 28, 512]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])
p3 = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 20  (workload key: ["d37380659057397544e056461ea3bad3", [1, 56, 56, 64], [3, 3, 64, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 64]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
p1 = PLACEHOLDER [3, 3, 64, 64]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 21  (workload key: ["d37380659057397544e056461ea3bad3", [1, 7, 7, 512], [3, 3, 512, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ==========
p0 = PLACEHOLDER [1, 7, 7, 512]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f)
p1 = PLACEHOLDER [3, 3, 512, 512]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 512]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 22  (workload key: ["2beb39e9afe4c74822fffbcbb8533595", [1, 56, 56, 256], [1, 1, 256, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 256]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 256, 128]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 128]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 23  (workload key: ["f07e228ef5f642b386d23a62df615e7b", [1, 7, 7, 512], [1, 1, 512, 2048], [1, 7, 7, 2048], [1, 1, 1, 2048], [1, 1, 1, 2048], [1, 7, 7, 2048]]) ==========
p0 = PLACEHOLDER [1, 7, 7, 512]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 512, 2048]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 7, 7, 2048]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3])
p3 = PLACEHOLDER [1, 1, 1, 2048]
T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*p3[ax0, 0, 0, ax3])
p4 = PLACEHOLDER [1, 1, 1, 2048]
T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + p4[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 24  (workload key: ["07f9fcad27bdd3233f86fe35a5185d33", [1, 224, 224, 3], [7, 7, 3, 64], [1, 1, 1, 64], [1, 112, 112, 64]]) ==========
p0 = PLACEHOLDER [1, 224, 224, 3]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), p0[i0, (i1 - 3), (i2 - 3), i3], 0f)
p1 = PLACEHOLDER [7, 7, 3, 64]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 25  (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 14, 14, 1024], [1, 1, 1024, 2048], [1, 7, 7, 2048]]) ==========
p0 = PLACEHOLDER [1, 14, 14, 1024]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 1024, 2048]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff])

========== Task 26  (workload key: ["00a059b856ac30ac172b6252254479a6", [1, 2048], [1000, 2048], [1, 1000], [1, 1000]]) ==========
p0 = PLACEHOLDER [1, 2048]
p1 = PLACEHOLDER [1000, 2048]
T_matmul_NT(i, j) += (p0[i, k]*p1[j, k])
p2 = PLACEHOLDER [1, 1000]
T_add(ax0, ax1) = (T_matmul_NT[ax0, ax1] + p2[ax0, ax1])

========== Task 27  (workload key: ["6d012ba18a086c11ee2b85c7324e16f2", [1, 112, 112, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ==========
p0 = PLACEHOLDER [1, 112, 112, 64]
pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), p0[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f)
pool_max(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + rv0), ((ax2*2) + rv1), ax3]
p1 = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (pool_max[ax0, ax1, ax2, ax3] + p1[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

========== Task 28  (workload key: ["6d628209072e3e3dd8f49359935acea6", [1, 56, 56, 64], [1, 1, 64, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ==========
p0 = PLACEHOLDER [1, 56, 56, 64]
pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3]
p1 = PLACEHOLDER [1, 1, 64, 64]
conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff])
p2 = PLACEHOLDER [1, 1, 1, 64]
T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3])
T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f)

开始调优#

现在,设置了一些调优选项并启动搜索任务

  • num_measure_trials 是在调优期间可以使用的度量试验的数量。您可以将它设置为小的数字(例如,200)以进行快速演示运行。在实践中,建议将它设置在 800 * len(tasks) 左右,这通常足以让搜索收敛。例如 resnet-50 中有 29 个任务,所以可以将其设置为 20000。可以根据自己的时间预算调整该参数。

  • 此外,使用 RecordToFile 将测量记录转储到日志文件中,测量记录可以用于查询历史,恢复搜索,并在以后进行更多的分析。

  • 查阅 tvm.auto_scheduler.TuningOptionstvm.auto_scheduler.LocalRunner 获取更多参数。

def run_tuning():
    print("开始调优...")
    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # 将其更改为 20000 以实现最佳性能
        runner=auto_scheduler.LocalRunner(repeat=10, enable_cpu_cache_flush=True),
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )

    if use_sparse:
        from tvm.topi.sparse.utils import sparse_sketch_rules

        search_policy = [
            auto_scheduler.SketchPolicy(
                task,
                program_cost_model=auto_scheduler.XGBModel(),
                init_search_callbacks=sparse_sketch_rules(),
            )
            for task in tasks
        ]

        tuner.tune(tune_option, search_policy=search_policy)
    else:
        tuner.tune(tune_option)

run_tuning()

解释在调优期间的打印信息

在调优过程中,控制台上将打印大量信息。它们用于调试目的。最重要的信息是任务调度器的输出。下表是示例输出。

----------------------------------------------------------------------
------------------------------  [ Task Scheduler ]
----------------------------------------------------------------------
|  ID  | Latency (ms) | Speed (GFLOPS) | Trials |
-------------------------------------------------
|    0 |        0.010 |           0.40 |     64 |
|    1 |        0.087 |          47.19 |     64 |
|    2 |        0.008 |          -0.00 |     64 |
|    3 |        0.177 |         582.07 |     64 |
|    4 |        0.268 |         862.37 |    256 |
|    5 |        0.166 |         621.13 |    128 |
|    6 |        0.170 |         605.10 |    128 |
|    7 |        0.128 |         403.20 |     64 |
|    8 |        0.189 |         545.71 |     64 |
|    9 |        0.231 |        1001.01 |    448 |
|   10 |        0.155 |         664.80 |    256 |
|   11 |        0.155 |         662.86 |    256 |
|   12 |        0.119 |         434.08 |     64 |
|   13 |        0.199 |         522.13 |     64 |
|   14 |        0.235 |         986.56 |    320 |
|   15 |        0.149 |         689.13 |    128 |
|   16 |        0.155 |         664.80 |    192 |
|   17 |        0.151 |         340.64 |     64 |
|   18 |        0.176 |         597.55 |    128 |
|   19 |        0.220 |        1054.37 |    192 |
|   20 |        0.150 |         686.01 |    128 |
|   21 |        0.159 |         650.88 |    128 |
|   22 |        0.073 |         358.19 |     64 |
|   23 |        0.031 |          70.63 |     64 |
|   24 |        0.251 |         947.73 |    128 |
|   25 |        0.157 |         652.47 |    128 |
|   26 |        0.215 |         954.84 |    128 |
|   27 |        0.237 |         868.92 |    128 |
|   28 |        0.266 |         774.06 |    128 |
-------------------------------------------------
Estimated total latency: 10.016 ms      Trials: 3992    Used time : 1131 s      Next ID: 15

该表列出了所有任务的延迟和(估计的)速度。它还列出了所有任务的测量试验分配。最后一行打印这些任务的总加权延迟,这可以粗略估计网络的端到端执行时间。最后一行还输出测试的总数、自动调优所花费的总时间和下一个要调优的任务的 id。

也会有一些 “tvm::Error” 的错误,因为自动调度程序将尝试一些无效的调度。如果可以继续进行调优,则可以安全地忽略它们,因为这些错误与主进程隔离开来。

提前终止调优

可以通过强制终止此进程提前终止调优。只要为日志文件中的每个任务获得至少一个有效的调度,就应该能够进行编译(见下面的部分)。

编译和评估#

在自动调优之后,可以用找到的最佳调度来编译网络。在自动调优期间,所有测量记录都被转储到日志文件中,因此可以读取日志文件并加载最佳调度。

# 使用最佳调度编译
print("编译...")
with auto_scheduler.ApplyHistoryBest(log_file):
    with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}):
        lib = relay.build(mod, target=target, params=params)

# 创建 graph executor
dev = tvm.device(str(target), 0)
module = graph_executor.GraphModule(lib["default"](dev))
data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype))
module.set_input("data", data_tvm)

# Evaluate
print("评估推理时间成本...")
print(module.benchmark(dev, repeat=3, min_repeat_ms=500))
编译...
Evaluate inference time cost...
Execution time summary:
 mean (ms)   median (ms)    max (ms)     min (ms)     std (ms)  
  673.4945     672.7222     675.7006     672.0608      1.5831   
               

其他技巧#

  1. 在调优过程中,自动调度器需要编译许多程序并从中提取特征。该部分是 CPU 密集型的,因此建议使用多核的高性能 CPU,以加快搜索速度。

  2. 可以使用 python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json 提取大的日志文件,只保存最好的有用的记录。

  3. 可以从上一个日志文件恢复搜索。在函数 run_tuning 中创建任务调度器时,只需要添加新的参数 load_log_file,即 tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)

  4. 如果有多个目标 CPU,您可以将它们都用于测量,从而使测量并行化。请查阅如何使用 RPC 跟踪器和 RPC 服务器。要在自动调度器中使用 RPC 跟踪器,请将 TuningOptions 中的运行器替换为 auto_scheduler.RPCRunner