Device/Target 交互

Device/Target 交互#

本文档旨在为对理解 TVM 框架如何与特定设备 API 交互感兴趣的开发者提供指导,或者那些想要实现对新 API 或新硬件支持的开发者。

为了构建任何新的运行时环境,必须落实三个主要方面。

  • DeviceAPI 类提供了特定设备的句柄,以及与之交互的 API。它定义了通用接口,用于查询设备参数(例如可用内存、线程数等),以及执行简单操作(例如从主机复制内存或在设备上的缓冲区之间复制)。

  • Target 类包含了描述函数运行设备的说明。它既对目标代码生成器开放,也对优化过程开放。

  • 目标代码生成器 构建由一个或多个 PackedFunc 组成的来自 IRModule 的 模块。"

DeviceAPI#

The DeviceAPI represents a handle to a specific hardware device API. (e.g. CUDADeviceAPI handles all interactions through the CUDA framework.) Most DeviceAPI methods accept a device_id parameter to specify which device should be accessed. In Python, these are typically accessed using the tvm.runtime.device() function, which returns a handle to a specific device, accessed through a specific API. (e.g. tvm.runtime.device('cuda',0) gives access to physical device 0, accessed through the CUDA API.)

  • Attribute queries - GetAttr allows different device-specific parameters to be queried, such as the device name, number of threads, etc. The parameters that can be queried are defined in enum DeviceAttrKind in device_api.h. Not all query-able parameters are supported by all devices. If a parameter cannot be queried (e.g. kMaxClockRate on Vulkan), or if a parameter isn't applicable (e.g. kWarpSize on CPU), then those queries should return nullptr.

  • Setting active device - SetDevice should set a particular device as being active. If a PackedFunc generated by the target-specific code gen requires execution on a device, it should run on the active device.

  • Memory management - Utilities for allocating and deallocating memory on the device.

    • Allocate data space - AllocDataSpace and FreeDataSpace allocate and free space on the device. These allocations can be provided as inputs and outputs to an operator and make up the primary data flow of the operator graph. It must be possible to transfer data from the host to/from a data space. The return value is an opaque void*. While some implementations return a memory address, this is not required, and the void* may be an opaque handle that is interpretable only by the device backend that generated it. The void* is used as an argument to other backend-specific functions, such as CopyDataFromTo.

    • Allocate work space - AllocWorkspace and FreeWorkspace allocate and free space on the device. Unlike data space, these are used for storage of intermediate values within an operator definition, and are not required to be transferable to/from the host device. If a DeviceAPI subclass does not implement these methods, they will default to calling the corresponding DataSpace functions.

    • Copy data - CopyDataFromTo should copy data from one location to another. The type of copy is determined by the dev_from and dev_to parameters. Implementations should support copying memory from CPU to device, from device to CPU, and from one buffer to another on a single device. If the source or destination locations are on the CPU, the corresponding void* points to a CPU address that can be passed into memcpy. If the source or destinations locations are on the device, the corresponding void* was previously generated by either AllocDataSpace or AllocWorkspace.

      These copies are queued to execute on a specific TVMStreamHandle. However, implementations should not assume that CPU buffers remains valid or accessible after the call to CopyDataFromTo completes.

  • Execution stream management - Utilities for handling TVMStreamHandle, which represents parallel streams of execution used to execute commands.

    • Create stream - CreateStream and FreeStream should allocate/free a handle to a stream of execution. If a device implements only a single queue of commands, then CreateStream should return nullptr.

    • Set active stream - SetStream should set a stream as being active. While active, if a PackedFunc generated by the target-specific code gen requires execution on a device, the work should be submitted to the active stream.

    • Synchronize to CPU - StreamSync should synchronize a stream of execution to the CPU. The call to StreamSync should return once all memory transfers and computations submitted prior to the StreamSync call have completed.

    • Synchronize between streams - SyncStreamFromTo should introduce a synchronization barrier between the source and destination stream. That is, the destination stream may not proceed beyond commands currently queued until the source stream has completed all commands that are currently queued.

In order to be usable by the TVM framework, the new DeviceAPI should then be registered with the following steps.

  1. Create a function that instantiates the new DeviceAPI, and returns a pointer to it:

    FooDeviceAPI* FooDeviceAPI::Global() {
      static FooDeviceAPI inst;
      return &inst;
    }
    
  2. Register the function to the tvm registry:

    TVM_REGISTER_GLOBAL("device_api.foo").set_body_typed(FooDeviceAPI::Global);
    
  1. Add an entry for the new DeviceAPI to the TVMDeviceExtType enum in c_runtime_api.h. The value should be an unused value greater than DLDeviceType::kDLExtDev, but less than DeviceAPIManager::kMaxDeviceAPI.

  2. Add a case in DeviceName in device_api.h to convert from the enum value to a string representation. This string representation should match the name given to TVM_REGISTER_GLOBAL.

  3. Add entries to the MASK2STR and STR2MASK dictionaries of tvm.runtime.Device for the new enum value.

Target Definition#

The Target object is a lookup table of properties about a physical device, its hardware/driver limits, and its capabilities. The Target is accessible both during optimization and code generation stages. While the same Target class is used for all runtime targets, each runtime target may need to add target-specific options.

In target_kind.cc, add a new declaration of TVM_REGISTER_TARGET_KIND, passing a string name of the new target, and the TVMDeviceExtType or DLDeviceType enum value for the device on which that target should run. Typically, the target name and the device name will match. (e.g. The "cuda" target runs on the kDLCUDA device.) There are exceptions, such as when multiple different code generation targets can run on the same physical device. (e.g. The "llvm" and "c" targets both run on the kDLCPU device type.)

All options for a specific target kind are added with the add_attr_option function, with optional default values. A Target parser can be added with set_target_parser to process any parameters that are dynamically based on other parameters or queried from device properties.

This argument definition defines a parser that can unpack a string description of a target. This is done in the Target::Target(const String&) constructor in C++, which accepts a JSON-formatted string and is typically called using the tvm.target.Target python object. For example, tvm.target.Target('{"kind": "cuda", "max_num_threads": 1024}') will create a cuda target, while overriding the default maximum number of threads.

In a code generator, the target properties can be accessed using target->GetAttr<T>(param_name) in C++, or with the target.attrs dictionary in Python.

Target Code Generators#

The code generators take an optimized IRModule and converts it into an executable representation. Each code generator must be registered in order to be used by the TVM framework. This is done by registering a function named "target.build.foo", where foo is the same name as was used in the TVM_REGISTER_TARGET_KIND definition above.

tvm::runtime::Module GeneratorFooCode(IRModule mod, Target target);
TVM_REGISTER_GLOBAL("target.build.foo").set_body_typed(GeneratorFooCode);

The code generator takes two arguments. The first is the IRModule to compile, and the second is the Target that describes the device on which the code should run. Because the environment performing the compilation is not necessarily the same as the environment that will be executing the code, code generators should not perform any attribute lookups on the device itself, and should instead access parameters stored in the Target.

Each function in the input IRModule should be accessible by name in the output runtime::Module.