Relay Arm® Compute Library Integration#

Author: Luke Hutton

简介#

Arm Compute Library (ACL) 开源项目,为 Arm CPU 和 GPU 提供加速内核。目前,集成将算子卸载到 ACL 以使用库中的手工汇编例程(hand-crafted assembler routines)。通过将选定的算子从 relay 图卸载到 ACL,可以在这些设备上实现性能提升。

安装 Arm Compute Library#

在安装 Arm Compute Library 之前,了解要构建的体系结构非常重要。一种确定方法是使用 lscpu 命令并查找 CPU 的 “Model name”。然后,您可以使用此信息通过在线查询来确定体系结构。

TVM 仅支持 ACL 的单个版本,目前是 v21.08。构建和安装所需库有两种推荐方法:

  • 使用位于 docker/install/ubuntu_download_arm_compute_lib_binaries.sh 的脚本。您可以使用此脚本为 target_lib 中指定的体系结构和扩展名下载 ACL 二进制文件,这些将安装到由 install_path 表示的位置。

  • 或者,您可以从以下位置下载预构建的二进制文件:ARM-software/ComputeLibrary。使用此软件包时,您需要选择所需的体系结构和扩展名的二进制文件,然后确保它们对 CMake 可见:

    cd <acl-prebuilt-package>/lib
    mv ./<architecture-and-extensions-required>/* .
    

在这两种情况下,您需要将 USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR 设置为 ACL 软件包所在的位置的路径。CMake 将在 /path-to-acl/ 和 /path-to-acl/lib 以及 /path-to-acl/build 中查找所需的二进制文件。有关如何使用这些配置选项的更多信息,请参阅下面的部分。

使用ACL支持进行构建#

当前的实现在 CMake 中有两个单独的构建选项。这种分离的原因是 ACL 不能在 x86 机器上使用。但是,我们仍然希望能够在 x86 机器上编译 ACL 运行时模块。

  • USE_ARM_COMPUTE_LIB=ON/OFF - 启用此标志将为编译ACL运行时模块添加支持。

  • USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR=ON/OFF/path-to-acl - Enabling this flag will allow the graph executor to compute the ACL offloaded functions.

These flags can be used in different scenarios depending on your setup. For example, if you want to compile an ACL module on an x86 machine and then run the module on a remote Arm device via RPC, you will need to use USE_ARM_COMPUTE_LIB=ON on the x86 machine and USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR=ON on the remote AArch64 device.

By default both options are set to OFF. Using USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR=ON will mean that ACL binaries are searched for by CMake in the default locations (see https://cmake.org/cmake/help/v3.4/command/find_library.html). In addition to this, /path-to-tvm-project/acl/ will also be searched. It is likely that you will need to set your own path to locate ACL. This can be done by specifying a path in the place of ON.

These flags should be set in your config.cmake file. For example:

set(USE_ARM_COMPUTE_LIB ON)
set(USE_ARM_COMPUTE_LIB_GRAPH_EXECUTOR /path/to/acl)

Usage#

备注

This section may not stay up-to-date with changes to the API.

Create a relay graph. This may be a single operator or a whole graph. The intention is that any relay graph can be input. The ACL integration will only pick supported operators to be offloaded whilst the rest will be computed via TVM. (For this example we will use a single max_pool2d operator).

import tvm
from tvm import relay

data_type = "float32"
data_shape = (1, 14, 14, 512)
strides = (2, 2)
padding = (0, 0, 0, 0)
pool_size = (2, 2)
layout = "NHWC"
output_shape = (1, 7, 7, 512)

data = relay.var('data', shape=data_shape, dtype=data_type)
out = relay.nn.max_pool2d(data, pool_size=pool_size, strides=strides, layout=layout, padding=padding)
module = tvm.IRModule.from_expr(out)

Annotate and partition the graph for ACL.

from tvm.relay.op.contrib.arm_compute_lib import partition_for_arm_compute_lib
module = partition_for_arm_compute_lib(module)

Build the Relay graph.

target = "llvm -mtriple=aarch64-linux-gnu -mattr=+neon"
with tvm.transform.PassContext(opt_level=3, disabled_pass=["AlterOpLayout"]):
    lib = relay.build(module, target=target)

Export the module.

lib_path = '~/lib_acl.so'
cross_compile = 'aarch64-linux-gnu-c++'
lib.export_library(lib_path, cc=cross_compile)

Run Inference. This must be on an Arm device. If compiling on x86 device and running on AArch64, consider using the RPC mechanism. Tutorials for using the RPC mechanism

dev = tvm.cpu(0)
loaded_lib = tvm.runtime.load_module('lib_acl.so')
gen_module = tvm.contrib.graph_executor.GraphModule(loaded_lib['default'](dev))
d_data = np.random.uniform(0, 1, data_shape).astype(data_type)
map_inputs = {'data': d_data}
gen_module.set_input(**map_inputs)
gen_module.run()

More examples#

The example above only shows a basic example of how ACL can be used for offloading a single Maxpool2D. If you would like to see more examples for each implemented operator and for networks refer to the tests: tests/python/contrib/test_arm_compute_lib. Here you can modify test_config.json to configure how a remote device is created in infrastructure.py and, as a result, how runtime tests will be run.

An example configuration for test_config.json:

  • connection_type - The type of RPC connection. Options: local, tracker, remote.

  • host - The host device to connect to.

  • port - The port to use when connecting.

  • target - The target to use for compilation.

  • device_key - The device key when connecting via a tracker.

  • cross_compile - Path to cross compiler when connecting from a non-arm platform e.g. aarch64-linux-gnu-g++.

{
  "connection_type": "local",
  "host": "127.0.0.1",
  "port": 9090,
  "target": "llvm -mtriple=aarch64-linux-gnu -mattr=+neon",
  "device_key": "",
  "cross_compile": ""
}

Operator support#

Relay Node

Remarks

nn.conv2d

fp32:

Simple: nn.conv2d Composite: nn.pad?, nn.conv2d, nn.bias_add?, nn.relu?

Normal and depth-wise (when kernel is 3x3 or 5x5 and strides are 1x1 or 2x2) convolution supported. Grouped convolution is not supported.

qnn.conv2d

uint8:

Composite: nn.pad?, nn.conv2d, nn.bias_add?, nn.relu?, qnn.requantize

Normal and depth-wise (when kernel is 3x3 or 5x5 and strides are 1x1 or 2x2) convolution supported. Grouped convolution is not supported.

nn.dense

fp32:

Simple: nn.dense Composite: nn.dense, nn.bias_add?

qnn.dense

uint8:

Composite: qnn.dense, nn.bias_add?, qnn.requantize

nn.max_pool2d

fp32, uint8

nn.global_max_pool2d

fp32, uint8

nn.avg_pool2d

fp32:

Simple: nn.avg_pool2d

uint8:

Composite: cast(int32), nn.avg_pool2d, cast(uint8)

nn.global_avg_pool2d

fp32:

Simple: nn.global_avg_pool2d

uint8:

Composite: cast(int32), nn.avg_pool2d, cast(uint8)

power(of 2) + nn.avg_pool2d + sqrt

A special case for L2 pooling.

fp32:

Composite: power(of 2), nn.avg_pool2d, sqrt

reshape

fp32, uint8

maximum

fp32

add

fp32

qnn.add

uint8

备注

A composite operator is a series of operators that map to a single Arm Compute Library operator. You can view this as being a single fused operator from the view point of Arm Compute Library. ‘?’ denotes an optional operator in the series of operators that make up a composite operator.

添加新算子#

添加新算子需要对一系列位置进行更改。本节将提示需要更改的内容和位置,但不会深入探讨单个算子的复杂性。这留给开发人员自行处理。

需要对一系列文件进行更改:

  • python/relay/op/contrib/arm_compute_lib.py 文件中,使用 op.register 装饰器定义了希望使用 ACL 卸载的算子。这将意味着注解 pass 将识别此算子为可由 ACL 卸载的。

  • src/relay/backend/contrib/arm_compute_lib/codegen.cc 文件中,实现 Create[OpName]JSONNode 方法。这是声明算子应该如何由 JSON 表示的地方。这将用于创建 ACL 模块。

  • src/runtime/contrib/arm_compute_lib/acl_runtime.cc 实现 Create[OpName]Layer 方法。这是定义 JSON 表示法如何用于创建 ACL 函数的地方。简单地定义了如何将 JSON 表示法转换为 ACL API。

  • tests/python/contrib/test_arm_compute_lib 为给定的算子添加单元测试。