设备/目标交互#
本文档旨在帮助对 TVM 框架如何与特定设备 API 交互感兴趣,或者希望实现对新 API 或新硬件支持的开发人员。
为任何新的运行时环境,必须实现三个主要方面。
DeviceAPI 类提供了特定设备的句柄以及用于与其进行交互的 API。它定义了用于查询设备参数(例如可用内存、线程数等)以及执行简单行为(例如从主机复制内存或在设备上在缓冲区之间复制内存)的通用接口。
Target 类包含了函数将要运行的设备的说明。它既向目标代码生成器暴露,也向优化 passes 暴露。
target code generators 构造来自 IRModule,由一个或多个 PackedFunc 组成的 Module。
DeviceAPI#
DeviceAPI
表示对特定硬件设备 API 的句柄。(例如,CUDADeviceAPI
通过 CUDA 框架处理所有交互。)大多数 DeviceAPI
方法接受 device_id
参数来指定应该访问哪个设备。在 Python 中,通常使用 tvm.runtime.device()
函数来访问它们,该函数返回通过特定 API 访问的特定设备的句柄。(例如,tvm.runtime.device('cuda',0)
可以访问物理设备 0
,并通过 CUDA API 进行访问。)
属性查询 -
GetAttr
允许查询不同的设备特定参数,例如设备名称、线程数等。可以查询的参数在 device_api.h 中的enum DeviceAttrKind
定义。并非所有可查询的参数都受所有设备支持。如果参数无法查询(例如 Vulkan 上的kMaxClockRate
),或者该参数不适用(例如 CPU 上的kWarpSize
),则这些查询应返回nullptr
。Setting active device -
SetDevice
should set a particular device as being active. If aPackedFunc
generated by the target-specific code gen requires execution on a device, it should run on the active device.Memory management - Utilities for allocating and deallocating memory on the device.
Allocate data space -
AllocDataSpace
andFreeDataSpace
allocate and free space on the device. These allocations can be provided as inputs and outputs to an operator and make up the primary data flow of the operator graph. It must be possible to transfer data from the host to/from a data space. The return value is an opaquevoid*
. While some implementations return a memory address, this is not required, and thevoid*
may be an opaque handle that is interpretable only by the device backend that generated it. Thevoid*
is used as an argument to other backend-specific functions, such asCopyDataFromTo
.Allocate work space -
AllocWorkspace
andFreeWorkspace
allocate and free space on the device. Unlike data space, these are used for storage of intermediate values within an operator definition, and are not required to be transferable to/from the host device. If aDeviceAPI
subclass does not implement these methods, they will default to calling the correspondingDataSpace
functions.Copy data -
CopyDataFromTo
should copy data from one location to another. The type of copy is determined by thedev_from
anddev_to
parameters. Implementations should support copying memory from CPU to device, from device to CPU, and from one buffer to another on a single device. If the source or destination locations are on the CPU, the correspondingvoid*
points to a CPU address that can be passed intomemcpy
. If the source or destinations locations are on the device, the correspondingvoid*
was previously generated by eitherAllocDataSpace
orAllocWorkspace
.These copies are queued to execute on a specific
TVMStreamHandle
. However, implementations should not assume that CPU buffers remains valid or accessible after the call toCopyDataFromTo
completes.
Execution stream management - Utilities for handling
TVMStreamHandle
, which represents parallel streams of execution used to execute commands.Create stream -
CreateStream
andFreeStream
should allocate/free a handle to a stream of execution. If a device implements only a single queue of commands, thenCreateStream
should returnnullptr
.Set active stream -
SetStream
should set a stream as being active. While active, if aPackedFunc
generated by the target-specific code gen requires execution on a device, the work should be submitted to the active stream.Synchronize to CPU -
StreamSync
should synchronize a stream of execution to the CPU. The call toStreamSync
should return once all memory transfers and computations submitted prior to theStreamSync
call have completed.Synchronize between streams -
SyncStreamFromTo
should introduce a synchronization barrier between the source and destination stream. That is, the destination stream may not proceed beyond commands currently queued until the source stream has completed all commands that are currently queued.
In order to be usable by the TVM framework, the new DeviceAPI should then be registered with the following steps.
Create a function that instantiates the new DeviceAPI, and returns a pointer to it:
FooDeviceAPI* FooDeviceAPI::Global() { static FooDeviceAPI inst; return &inst; }
Register the function to the tvm registry:
TVM_REGISTER_GLOBAL("device_api.foo").set_body_typed(FooDeviceAPI::Global);
Add an entry for the new DeviceAPI to the
TVMDeviceExtType
enum in c_runtime_api.h. The value should be an unused value greater thanDLDeviceType::kDLExtDev
, but less thanDeviceAPIManager::kMaxDeviceAPI
.Add a case in
DeviceName
in device_api.h to convert from the enum value to a string representation. This string representation should match the name given toTVM_REGISTER_GLOBAL
.Add entries to the
MASK2STR
andSTR2MASK
dictionaries oftvm.runtime.Device
for the new enum value.
目标定义#
Target
对象是关于物理设备的属性的查找表,包括其硬件/驱动程序的限制和能力。Target
在优化和代码生成阶段都可访问。虽然所有运行时目标都使用相同的 Target
类,但每个运行时目标可能需要添加特定于目标的选项。
在 target_kind.cc 中,添加新的 TVM_REGISTER_TARGET_KIND
声明,传递新目标的字符串名称以及该目标应运行的设备上的 TVMDeviceExtType
或 DLDeviceType
枚举值。通常,目标名称与设备名称相匹配。(例如,"cuda"
目标运行在 kDLCUDA
设备上)。也有例外情况,例如同一物理设备上可以运行多个不同的代码生成目标。(例如,"llvm"
和 "c"
目标都运行在 kDLCPU
设备类型上。)
All options for a specific target kind are added with the
add_attr_option
function, with optional default values. A Target
parser can be added with set_target_parser
to process
any parameters that are dynamically based on other parameters or
queried from device properties.
这个参数定义定义了一个解析器,可以解包目标的字符串描述。这是在 C++ 中的 Target::Target(const String&)
构造函数中完成的,它接受 JSON 格式的字符串,通常使用 tvm.target.Target
Python 对象调用。例如,tvm.target.Target('{"kind": "cuda", "max_num_threads": 1024}')
将创建 cuda
目标,同时覆盖默认的最大线程数。
在代码生成器中,可以使用 C++ 中的 target->GetAttr<T>(param_name)
或 Python 中的 target.attrs
字典来访问目标属性。
目标代码生成器#
代码生成器将优化过的 IRModule
转换为可执行表示形式。每个代码生成器都必须注册才能被 TVM 框架使用。这是通过注册名为 "target.build.foo"
的函数来完成的,其中 foo
与上面在 TVM_REGISTER_TARGET_KIND
定义中使用的名称相同。
tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
TVM_REGISTER_GLOBAL("target.build.foo").set_body_typed(GeneratorFooCode);
代码生成器接受两个参数。第一个参数是 IRModule
,需要编译;第二个参数是描述代码应运行的设备的 Target
。由于执行编译的环境不一定与将执行代码的环境相同,因此代码生成器不应在设备本身上执行任何属性查找,而应访问存储在 Target
中的参数。
输入 IRModule
中的每个函数在输出 runtime::Module
中都应以名称可访问。