设备/目标交互

设备/目标交互#

本文档旨在帮助对 TVM 框架如何与特定设备 API 交互感兴趣,或者希望实现对新 API 或新硬件支持的开发人员。

为任何新的运行时环境,必须实现三个主要方面。

  • DeviceAPI 类提供了特定设备的句柄以及用于与其进行交互的 API。它定义了用于查询设备参数(例如可用内存、线程数等)以及执行简单行为(例如从主机复制内存或在设备上在缓冲区之间复制内存)的通用接口。

  • Target 类包含了函数将要运行的设备的说明。它既向目标代码生成器暴露,也向优化 passes 暴露。

  • target code generators 构造来自 IRModule,由一个或多个 PackedFunc 组成的 Module

DeviceAPI#

DeviceAPI 表示对特定硬件设备 API 的句柄。(例如,CUDADeviceAPI 通过 CUDA 框架处理所有交互。)大多数 DeviceAPI 方法接受 device_id 参数来指定应该访问哪个设备。在 Python 中,通常使用 tvm.runtime.device() 函数来访问它们,该函数返回通过特定 API 访问的特定设备的句柄。(例如,tvm.runtime.device('cuda',0) 可以访问物理设备 0,并通过 CUDA API 进行访问。)

  • 属性查询 - GetAttr 允许查询不同的设备特定参数,例如设备名称、线程数等。可以查询的参数在 device_api.h 中的 enum DeviceAttrKind 定义。并非所有可查询的参数都受所有设备支持。如果参数无法查询(例如 Vulkan 上的 kMaxClockRate),或者该参数不适用(例如 CPU 上的 kWarpSize),则这些查询应返回 nullptr

  • Setting active device - SetDevice should set a particular device as being active. If a PackedFunc generated by the target-specific code gen requires execution on a device, it should run on the active device.

  • Memory management - Utilities for allocating and deallocating memory on the device.

    • Allocate data space - AllocDataSpace and FreeDataSpace allocate and free space on the device. These allocations can be provided as inputs and outputs to an operator and make up the primary data flow of the operator graph. It must be possible to transfer data from the host to/from a data space. The return value is an opaque void*. While some implementations return a memory address, this is not required, and the void* may be an opaque handle that is interpretable only by the device backend that generated it. The void* is used as an argument to other backend-specific functions, such as CopyDataFromTo.

    • Allocate work space - AllocWorkspace and FreeWorkspace allocate and free space on the device. Unlike data space, these are used for storage of intermediate values within an operator definition, and are not required to be transferable to/from the host device. If a DeviceAPI subclass does not implement these methods, they will default to calling the corresponding DataSpace functions.

    • Copy data - CopyDataFromTo should copy data from one location to another. The type of copy is determined by the dev_from and dev_to parameters. Implementations should support copying memory from CPU to device, from device to CPU, and from one buffer to another on a single device. If the source or destination locations are on the CPU, the corresponding void* points to a CPU address that can be passed into memcpy. If the source or destinations locations are on the device, the corresponding void* was previously generated by either AllocDataSpace or AllocWorkspace.

      These copies are queued to execute on a specific TVMStreamHandle. However, implementations should not assume that CPU buffers remains valid or accessible after the call to CopyDataFromTo completes.

  • Execution stream management - Utilities for handling TVMStreamHandle, which represents parallel streams of execution used to execute commands.

    • Create stream - CreateStream and FreeStream should allocate/free a handle to a stream of execution. If a device implements only a single queue of commands, then CreateStream should return nullptr.

    • Set active stream - SetStream should set a stream as being active. While active, if a PackedFunc generated by the target-specific code gen requires execution on a device, the work should be submitted to the active stream.

    • Synchronize to CPU - StreamSync should synchronize a stream of execution to the CPU. The call to StreamSync should return once all memory transfers and computations submitted prior to the StreamSync call have completed.

    • Synchronize between streams - SyncStreamFromTo should introduce a synchronization barrier between the source and destination stream. That is, the destination stream may not proceed beyond commands currently queued until the source stream has completed all commands that are currently queued.

In order to be usable by the TVM framework, the new DeviceAPI should then be registered with the following steps.

  1. Create a function that instantiates the new DeviceAPI, and returns a pointer to it:

    FooDeviceAPI* FooDeviceAPI::Global() {
      static FooDeviceAPI inst;
      return &inst;
    }
    
  2. Register the function to the tvm registry:

    TVM_REGISTER_GLOBAL("device_api.foo").set_body_typed(FooDeviceAPI::Global);
    
  1. Add an entry for the new DeviceAPI to the TVMDeviceExtType enum in c_runtime_api.h. The value should be an unused value greater than DLDeviceType::kDLExtDev, but less than DeviceAPIManager::kMaxDeviceAPI.

  2. Add a case in DeviceName in device_api.h to convert from the enum value to a string representation. This string representation should match the name given to TVM_REGISTER_GLOBAL.

  3. Add entries to the MASK2STR and STR2MASK dictionaries of tvm.runtime.Device for the new enum value.

目标定义#

Target 对象是关于物理设备的属性的查找表,包括其硬件/驱动程序的限制和能力。Target 在优化和代码生成阶段都可访问。虽然所有运行时目标都使用相同的 Target 类,但每个运行时目标可能需要添加特定于目标的选项。

target_kind.cc 中,添加新的 TVM_REGISTER_TARGET_KIND 声明,传递新目标的字符串名称以及该目标应运行的设备上的 TVMDeviceExtTypeDLDeviceType 枚举值。通常,目标名称与设备名称相匹配。(例如,"cuda" 目标运行在 kDLCUDA 设备上)。也有例外情况,例如同一物理设备上可以运行多个不同的代码生成目标。(例如,"llvm""c" 目标都运行在 kDLCPU 设备类型上。)

All options for a specific target kind are added with the add_attr_option function, with optional default values. A Target parser can be added with set_target_parser to process any parameters that are dynamically based on other parameters or queried from device properties.

这个参数定义定义了一个解析器,可以解包目标的字符串描述。这是在 C++ 中的 Target::Target(const String&) 构造函数中完成的,它接受 JSON 格式的字符串,通常使用 tvm.target.Target Python 对象调用。例如,tvm.target.Target('{"kind": "cuda", "max_num_threads": 1024}') 将创建 cuda 目标,同时覆盖默认的最大线程数。

在代码生成器中,可以使用 C++ 中的 target->GetAttr<T>(param_name) 或 Python 中的 target.attrs 字典来访问目标属性。

目标代码生成器#

代码生成器将优化过的 IRModule 转换为可执行表示形式。每个代码生成器都必须注册才能被 TVM 框架使用。这是通过注册名为 "target.build.foo" 的函数来完成的,其中 foo 与上面在 TVM_REGISTER_TARGET_KIND 定义中使用的名称相同。

tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
TVM_REGISTER_GLOBAL("target.build.foo").set_body_typed(GeneratorFooCode);

代码生成器接受两个参数。第一个参数是 IRModule,需要编译;第二个参数是描述代码应运行的设备的 Target。由于执行编译的环境不一定与将执行代码的环境相同,因此代码生成器不应在设备本身上执行任何属性查找,而应访问存储在 Target 中的参数。

输入 IRModule 中的每个函数在输出 runtime::Module 中都应以名称可访问。