带你自己的 Codegen 到 TVM#

随着深度学习工作负载所针对的硬件设备数量不断增加,用户在各种设备上实现高性能所需的知识也在不断增加。为了让用户在开发新模型时不必担心性能问题,硬件后端提供商提供了许多常用的深度学习算子库,例如 DNNL(Intel OneDNN)或 cuDNN,或者提供了 TensorRT 等框架,以让用户以某种方式描述其模型以实现高性能。然而,当用户尝试使用新的库或设备时,他们必须学习新的编程接口。因此,对统一编程接口的需求变得越来越重要,以便1)让所有用户和硬件后端提供商站在同一页上,2)提供可行的解决方案,允许专门的硬件或库仅支持广泛使用的、具有极高性能的算子,但对于不支持的算子则回退到像 CPU/GPU 这样的通用设备。

在本开发指南中,将演示作为硬件后端提供者,如何轻松实现自己的代码生成,并将其注册为 Relay 后端编译器,以支持硬件设备/库。本指南根据你需要的不同 graph 表示涵盖了两种类型的代码:

1. 您需要生成 C 代码。

如果您的硬件已经有一个良好优化的 C/C++ 库,如 Intel CBLAS/MKL 到 CPU 和 NVIDIA CUBLAS 到 GPU,那么这就是您要寻找的。幸运的是,C 源代码模块与 TVM 运行时模块完全兼容,这意味着生成的代码可以由任何具有适当编译标志的 C/C++ 编译器编译,因此您唯一的任务是实现代码生成器,为子图生成 C 代码,并实现 C source module 集成到 TVM 运行时模块中。将在下一节中演示如何为您的硬件实现 C 代码生成器。

2. 希望生成任何其他 graph 表示。

您的硬件可能需要其他形式的 graph 表示,例如 JSON。在这种情况下,您不仅需要实现 codegen,还需要实现定制 TVM 运行时模块,以便让 TVM 运行时知道应该如何执行这个 graph 表示。如果您的硬件已经有了一个完整的 graph 执行引擎,比如 GPU 的 tensort,那么这是一个可以考虑的解决方案。

在您完成代码生成和运行时之后,您就可以让您的客户使用您的定制 tag 来注解他们的模型,以使用它们。最终用户注解和启动特定 codegen 的教程在这里(TBA)。

实现 C Codegen#

在本部分中,将演示如何实现代码生成器,它生成带有预先实现的算子函数的 C 代码。为了简化,示例代码生成不依赖于第三方库。相反,我们在 C 中手动实现了两个宏:

#define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)         \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
        for (int64_t i = 0; i < p_DIM1_; ++i) {             \
            out[i] = a[i] p_OP_ b[i];                       \
        }                                                   \
    }

#define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {   \
        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
            for (int64_t j = 0; j < p_DIM2_; ++j) {           \
                int64_t k = i * p_DIM2_ + j;                  \
                out[k] = a[k] p_OP_ b[k];                     \
            }                                                 \
        }                                                     \
    }

利用这两个宏,我们可以生成一维和二维张量的二元算子。例如,给定如下的子图。假设所有输入都是形状为 (10,10) 的二维张量。

c_compiler_input0
       |
      add <-- c_compiler_input1
       |
    subtract <-- c_compiler_input2
       |
    multiply <-- c_compiler_input3
       |
      out

我们的目标是生成以下可编译代码来执行子图:

#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/packed_func.h>
#include <dlpack/dlpack.h>
#include <cstdint>
#include <cstring>
#include <iostream>

#define GCC_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)           \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      out[i] = a[i] p_OP_ b[i];                           \
    }                                                     \
  }

#define GCC_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      for (int64_t j = 0; j < p_DIM2_; ++j) {             \
        int64_t k = i * p_DIM2_ + j;                      \
        out[k] = a[k] p_OP_ b[k];                         \
      }                                                   \
    }                                                     \
  }

// Note 1
GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);
GCC_BINARY_OP_2D(gcc_0_1, -, 10, 10);
GCC_BINARY_OP_2D(gcc_0_2, +, 10, 10);

// Note 2
extern "C" void gcc_0_(float* gcc_input0, float* gcc_input1,
                       float* gcc_input2, float* gcc_input3, float* out) {
  float* buf_0 = (float*)malloc(4 * 100);
  float* buf_1 = (float*)malloc(4 * 100);
  gcc_0_2(gcc_input0, gcc_input1, buf_0);
  gcc_0_1(buf_0, gcc_input2, buf_1);
  gcc_0_0(buf_1, gcc_input3, out);
  free(buf_0);
  free(buf_1);
}

// Note 3
extern "C" int gcc_0_wrapper(DLTensor* arg0, DLTensor* arg1, DLTensor* arg2,
                             DLTensor* arg3, DLTensor* out) {
  gcc_0_(static_cast<float*>(arg0->data), static_cast<float*>(arg1->data),
         static_cast<float*>(arg2->data), static_cast<float*>(arg3->data),
         static_cast<float*>(out->data));
  return 0;
}
TVM_DLL_EXPORT_TYPED_FUNC(gcc_0, gcc_0_wrapper);

这里高亮了上面代码中标记的注解:

  • Note 1 是子图中三个节点的功能实现。

  • Note 2 是通过分配中间缓冲区和调用相应函数来执行子图的函数。

  • Note 3 是与 TVM 运行时兼容的包装函数。它接受一个输入张量列表和一个输出张量(最后一个参数),将它们转换为正确的数据类型,并调用在 Note 2 中描述的子图函数。此外,TVM_DLL_EXPORT_TYPED_FUNC 是 TVM 宏,它生成另一个函数 gcc_0,并通过将所有张量打包到 TVMArgs 中来统一函数参数。结果,TVM 运行时可以直接调用 gcc_0 来执行子图,而无需额外的努力。通过生成上述代码,TVM 能够将其与图的其余部分一起编译,并导出用于部署的单个库。

在本节的其余部分中,将逐步实现代码生成器来生成上述代码。您自己的代码生成器必须位于 src/relay/backend/contrib/<your-codegen-name>/ 目录下。在我们的例子中,我们将代码生成器命名为 “codegen_c”,并将其放在 /src/relay/backend/contrib/codegen_c/ 下。

具体来说,我们将在此文件中实现两个类,它们之间的关系如下:

                     subgraph                                subgraph
TVM backend -----------------------------> CSourceCodegen -------------> CodegenC
       ^                                       |    ^                       |
       |                                       |    |                       |
       ----------------------------------------      ------------------------
          generated C source runtime module              generated C code

当 TVM 后端在 Relay 图中找到带有已注册编译器标签(在这个例子中是 ccompiler)的函数(子图)时,TVM 后端会调用 CSourceCodegen 并将子图传递给它。CSourceCodegen 的成员函数 CreateCSourceModule 将执行以下操作:1) 为子图生成 C 代码;2) 将生成的 C 代码包装成 C 源代码运行时模块,以便 TVM 后端编译和部署。特别是,由于 CodegenC 类提供了许多有用的实用程序来简化代码生成实现,因此 C 代码生成对 CodegenC 类来说是透明的。以下部分将按照自底向上的顺序实现这两个类。

实现 CodegenC#

src/relay/backend/contrib/codegen_c/codegen.cc 中,首先在 tvm.relay.contrib 命名空间下创建代码生成类骨架:

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

#include "codegen_c.h"

namespace tvm {
namespace relay {
namespace contrib {

class CodegenC : public ExprVisitor, public CodegenCBase {
  public:
    explicit CodegenC(const std::string& id) { this->ext_func_id_ = id; }

    void VisitExpr_(const VarNode* node) { ; }
    void VisitExpr_(const CallNode* call) final { ; }
    std::string JIT() { ; }

  private:
    /*! \brief The function id that represents a C source function. */
    std::string ext_func_id_ = "";
    /*! \brief The index of a wrapped C function. */
    int func_idx = 0;
    /*! \brief The index of allocated buffers. */
    int buf_idx_ = 0;
    /*! \brief The arguments of a C compiler compatible function. */
    std::vector<std::string> ext_func_args_;
    /*! \brief The statements of a C compiler compatible function. */
    std::vector<std::string> ext_func_body;
    /*! \brief The declaration statements of a C compiler compatible function. */
    std::vector<std::string> func_decl_;
    /*! \brief The declaration statements of buffers. */
    std::vector<std::string> buf_decl_;
    /*! \brief The name and index pairs for output. */
    std::vector<std::pair<std::string, int>> out_;
}

CodegenC 类继承了两个类:ExprVisitor 提供了遍历子图并收集所需信息的能力,以及生成子图函数(如 gcc_0_); CodegenCBase 提供了生成包装函数的能力,如上述示例中的 gcc_0。如上所示,我们只需要在代码生成类中实现三个函数即可使其正常工作。

算子的代码生成#

首先实现 VisitExpr_(const CallNode* call)。这个函数在遍历子图时访问所有的调用节点。每个调用节点都包含希望卸载到您的硬件的算子。因此,需要以拓扑顺序生成相应的 C 代码,并使用正确的算子。我们分步骤实现这个函数,如下所示。

1. 生成函数声明

示例结果:GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);

要生成函数声明,如上所示,我们需要1)函数名(例如 gcc_0_0),2)算子的类型(例如 *),3)输入张量的形状(例如 (10, 10))。幸运的是,这些信息可以从 CallNode 轻松获取:

std::ostringstream macro_stream;
std::ostringstream decl_stream;
std::ostringstream buf_stream;

// Generate a unique function name you like.
std::string func_name = ext_func_id_ + "_" + std::to_string(func_idx++);

// Make function declaration string.
macro_stream << "CSOURCE_BINARY_OP_" << call->args.size() << "D(" << func_name << ", ";

// Check the operator type.
if (IsOp(call, "add")) {
  macro_stream << "+";
} else if (IsOp(call, "subtract")) {
  macro_stream << "-";
} else if (IsOp(call, "multiply")) {
  macro_stream << "*";
} else {
  LOG(FATAL) << "Unrecognized op";
}

// Extract the input tensor shape.
auto in_shape = GetShape(call->args[0]->checked_type());
for (size_t i = 0; i < in_shape.size(); ++i) {
  macro_stream << ", " << in_shape[i];
}
macro_stream << ");";
func_decl_.push_back(macro_stream.str());

如上所示,我们将生成的代码推送到类成员变量 func_decl_ 中。这意味着在我们完成遍历整个子图之后,我们已经收集了所有必需的函数声明,唯一需要做的是让 GCC 编译它们。 VisitExpr_(const CallNode* call) 的其余实现也遵循这个概念。

2. 生成函数调用

示例结果:gcc_0_0(buf_1, gcc_input3, out);

在生成函数声明之后,我们需要生成具有适当输入和输出的函数调用。要了解调用此函数时应放置哪些输入或缓冲区,我们必须访问其参数:

bool first = true;
decl_stream << func_name << "(";
for (size_t i = 0; i < call->args.size(); ++i) {
  VisitExpr(call->args[i]); // Note 1
  for (auto out : out_) {
    if (!first) {
      decl_stream << ", ";
    }
    first = false;
    decl_stream << out.first;
  }
}
// Note 2

再次强调上述代码中的注释:

Note 1VisitExpr(call->args[i]) 是对当前函数参数的递归调用。一个参数可能是另一个节点的输出或输入张量。在我们的示例实现中,我们确保在访问者离开之前每个节点都更新了一个类变量 out_。以下是一个例子:

  arg_node                 arg_node <- Visit arg (Note 1)       arg_node
     |                        |                                    |
 curr_node <- Process      curr_node                            curr_node <- Put "buf_0" as an input buffer

(a) out_ = {}            (b) out_ = {}                   (c) out_ = {("buf_0", 20)}

我们可以看到上面的图,在访问参数节点之前,类变量 out_ 是空的,并用 arg_node 的输出缓冲区名称和大小填充。因此,当我们完成访问参数节点时,我们可以通过查看 out_ 来知道应该放入的正确输入缓冲区。您将在本节末尾以及下一节中找到我们如何更新 out_ 的方法。

Note 2:您可能会注意到,在这一步中我们没有关闭函数调用字符串。当前的函数调用字符串看起来像:gcc_0_0(buf_1, gcc_input3。这是因为我们还没有将最后一个参数(即输出)放到这个调用中。函数调用的输出可能是一个分配的临时缓冲区或子图输出张量。为了简化,在这个例子中,我们为每个调用节点分配一个输出缓冲区(下一步),并将最后一个缓冲区的结果复制到输出张量中。

3. 生成输出缓冲区

示例结果:float* buf_0 = (float*)malloc(4 * 100);

如前一步所述,除了子图的输入和输出张量之外,我们还可能需要缓冲区来保存中间结果。为了生成缓冲区,我们提取形状信息以确定缓冲区的类型和大小:

// This example only supports single output.
auto type_node = call->checked_type().as<TensorTypeNode>();
ICHECK(type_node != nullptr && runtime::TypeMatch(type_node->dtype, kDLFloat, 32))
      << "Only support single output tensor with float type";

// Generate a unique buffer name.
std::string out = "buf_" + std::to_string(buf_idx_++);

// Extract the shape to be the buffer size.
auto out_shape = GetShape(call->checked_type());
int out_size = 1;
for (size_t i = 0; i < out_shape.size(); ++i) {
  out_size *= out_shape[i];
}

// Make the buffer allocation and push to the buffer declarations.
buf_stream << "float* " << out << " = (float*)std::malloc(4 * " << out_size << ");";
buf_decl_.push_back(buf_stream.str());

在我们分配了输出缓冲区之后,现在可以关闭函数调用字符串,并将生成的函数调用推送到类变量 ext_func_body

decl_stream << ", " << out << ");";
ext_func_body.push_back(decl_stream.str());

4. 更新输出缓存

为了让下一个节点(它接受当前调用节点的输出作为其输入)知道它应该取哪个缓冲区,我们需要在离开这个访问函数之前更新类变量 out_

out_.clear();
out_.push_back({out, out_size});

恭喜!我们已经完成了这个类中最难的函数。在接下来的两节中,我们只需要补充这个函数中的一些小缺失部分。

输入变量的代码生成#

Recall that we collected the input buffer information by visiting the arguments of a call node (2nd step in the previous section), and handled the case when its argument is another call node (4th step). In this section, we demonstrate how to handle other nodes by taking VarNode as an example.

VarNode represents input tensors in a model. The only but important information it has is a name hint (e.g., data, weight, etc). When visiting a VarNode, we simply update class variable out_ to pass the name hint so that the descendant call nodes can generate the correct function call.

void VisitExpr_(const VarNode* node) {
  ext_func_args_.push_back(node->name_hint());
  out_.clear();
  out_.push_back({node->name_hint(), 0});
}

Note that in this example we assume the subgraph we are offloading has only call nodes and variable nodes. If your subgraphs contain other types of nodes, such as TupleNode, then you also need to visit them and bypass the output buffer information.

Code Emitting#

The final part in this codegen class is a JIT function that emits a C function for the subgraph and uses the C code we just generated as the function body. Remember, in addition to the subgraph function we generated in the previous sections, we also need a wrapper function with a unified argument for TVM runtime to invoke and pass data. Fortunately, the base class we inherited already provides an implementation, JitImpl, to generate the function. For example, we can invoke JitImpl as follows:

JitImpl("gcc_0" /* Subgraph symbol (ID) */,
        {"gcc_input0", "gcc_input1", "gcc_input2", "gcc_input3"} /* Input arguments */,
        {"float *buf_0 = (float*)malloc(4 * 20)", ...} /* Buffer allocations */,
        {"gcc_0_2(gcc_input0, gcc_input1, buf_0);"} /* Function body */,
        {"out"} /* Output */);

The above call will generate three functions (one from the TVM wrapper macro):

  1. The subgraph function gcc_0_ (with one more underline at the end of the function name) with all C code we generated to execute a subgraph.

  2. The wrapper function gcc_0__wrapper_ with a list of DLTensor arguments that casts data to the right type and invokes gcc_0_.

  3. The TVM runtime compatible function gcc_0 with TVM unified function arguments that unpacks TVM packed tensors and invokes gcc_0__wrapper_.

Accordingly, the only thing we need in JIT implementation is passing all subgraph function code we generated to JitImpl:

std::string JIT() {
  // Write function macros
  for (auto decl : func_decl_) {
    code_stream_ << decl << "\n";
  }
  return JitImpl(ext_func_id_, ext_func_args_, buf_decl_, ext_func_body, out_);
}

All variables (ext_func_id, etc) we passed are class variables and were filled when we traversed the subgraph.

Implement CSourceCodegen#

Again, let’s create a class skeleton and implement the required functions. Note that it inherits CSourceModuleCodegenBase

class CSourceCodegen : public CSourceModuleCodegenBase {
 public:
  // Pass a subgraph function, and generate the C code.
  void GenCFunc(const Function& func) { ; }

  // Use GenCFunc to generate the C code and wrap it as a C source module.
  runtime::Module CreateCSourceModule(const NodeRef& ref) override { ; }

 private:
  std::ostringstream code_stream_;
};

Implement GenCFunc#

GenCFunc simply uses the CodegenC we just implemented to traverse a Relay function (subgraph) and obtains the generated C code. The builtin function GetExtSymbol retrieves a unique symbol name (e.g., gcc_0) in the Relay function and we must use it as the C function name, because this symbol is going to be used for DSO runtime lookup.

void GenCFunc(const Function& func) {
  ICHECK(func.defined()) << "Input error: expect a Relay function.";

  // Record the external symbol for runtime lookup.
  auto sid = GetExtSymbol(func);

  CodeGenC builder(sid);
  builder.VisitExpr(func->body);
  code_stream_ << builder.JIT();
}

Implement CreateCSourceModule#

This function creates a runtime module for the external library. In this example, we create a CSourceModule that can be directly compiled and linked together with a TVM generated DSOModule. After you have implemented CodegenC, implementing this function is relatively straightforward:

runtime::Module CreateCSourceModule(const NodeRef& ref) override {
  // Create headers
  code_stream_ << "#include <cstdint>\n";
  code_stream_ << "#include <iostream>\n";
  code_stream_ << "#include <cstdlib>\n";
  code_stream_ << "#include <stdio.h>\n";
  code_stream_ << "#include <cstring>\n";
  code_stream_ << "#include <tvm/runtime/c_runtime_api.h>\n";
  code_stream_ << "#include <dlpack/dlpack.h>\n";

  // Append some common macro for operator definition.
  const char* operator_macro = R"op_macro(
  #define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)       \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
      for (int64_t i = 0; i < p_DIM1_; ++i) {               \
        out[i] = a[i] p_OP_ b[i];                           \
      }                                                     \
    }

  #define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {     \
      for (int64_t i = 0; i < p_DIM1_; ++i) {                   \
        for (int64_t j = 0; j < p_DIM2_; ++j) {                 \
          int64_t k = i * p_DIM2_ + j;                          \
          out[k] = a[k] p_OP_ b[k];                             \
        }                                                       \
      }                                                         \
    }
  )op_macro";

  code_stream_ << operator_macro << "\n\n";

  // Generate C code for the subgraph.
  if (ref->IsInstance<FunctionNode>()) {
    GenCFunc(Downcast<Function>(ref));
  } else if (ref->IsInstance<relay::ModuleNode>()) {
    relay::Module mod = Downcast<relay::Module>(ref);
    for (const auto& it : mod->functions) {
      GenCFunc(Downcast<Function>(it.second));
    }
  } else {
    LOG(FATAL) << "The input ref is expected to be a Relay function or module"
               << "\n";
  }

  // Create a CSourceModule
  const auto* pf = runtime::Registry::Get("module.csource_module_create");
  ICHECK(pf != nullptr) << "Cannot find csource module to create the external runtime module";
  return (*pf)(code_stream_.str(), "cc");
}

Register Your Codegen#

The last step is registering your codegen to TVM backend. We first implement a simple function to invoke our codegen and generate a runtime module.

runtime::Module CCompiler(const NodeRef& ref) {
  CSourceCodegen csource;
  return csource.CreateCSourceModule(ref);
}

Finally, we register this function to TVM backend:

TVM_REGISTER_GLOBAL("relay.ext.ccompiler").set_body_typed(CCompiler);

where ccompiler is a customized tag to let TVM know this is the codegen it should use to generate and offload subgraphs when the subgraph is annotated with ccompiler.

Finally, a good practice is to set up a CMake configuration flag to include your compiler only for your customers. We first create a cmake file: cmake/modules/contrib/CODEGENC.cmake:

if(USE_CODEGENC)
  file(GLOB CSOURCE_RELAY_CONTRIB_SRC src/relay/backend/contrib/codegen_c/codegen.cc)
  list(APPEND COMPILER_SRCS ${CSOURCE_RELAY_CONTRIB_SRC})
endif(USE_CODEGENC)

So that users can configure whether to include your compiler when configuring TVM using config.cmake:

set(USE_CODEGENC ON)

Implement a Codegen for Your Representation#

Although we have demonstrated how to implement a C codegen, your hardware may require other forms of graph representation, such as JSON. In this case, you could modify CodegenC class we have implemented to generate your own graph representation and implement a customized runtime module to let TVM runtime know how this graph representation should be executed.

To simplify, we define a graph representation named “ExampleJSON” in this guide. ExampleJSON does not mean the real JSON but just a simple representation for graphs without a control flow. For example, assuming we have the following subgraph named subgraph_0:

 input0
   |
  add <-- input1
   |
subtract <-- input2
   |
multiply <-- input3
   |
  out

Then the ExampleJON of this subgraph looks like:

subgraph_0
  input 0 10 10
  input 1 10 10
  input 2 10 10
  input 3 10 10
  add 4 inputs: 0 1 shape: 10 10
  sub 5 inputs: 4 2 shape: 10 10
  mul 6 inputs: 5 3 shape: 10 10

The input keyword declares an input tensor with its ID and shape; while the other statements describes computations in <op> <output ID> inputs: [input ID] shape: [shape] syntax.

In this section, our goal is to implement the following customized TVM runtime module to execute ExampleJSON graphs.

runtime::Module ExampleJsonCompiler(const NodeRef& ref) {
    ExampleJsonCodeGen codegen(ref);
    std::string code = codegen.gen(); // Note 1
    const auto* pf = runtime::Registry::Get("module.examplejson_module_create"); // Note 2
    ICHECK(pf != nullptr) << "Cannot find ExampleJson module to create the external runtime module";
    return (*pf)(code);
}
TVM_REGISTER_GLOBAL("relay.ext.examplejsoncompiler").set_body_typed(ExampleJsonCompiler);

Note 1: We will implement a customized codegen later to generate a ExampleJSON code string by taking a subgraph.

Note 2: This line obtains a pointer to a function for creating the customized runtime module. You can see that it takes subgraph code in ExampleJSON format we just generated and initializes a runtime module.

In the following sections, we are going to introduce 1) how to implement ExampleJsonCodeGen and 2) how to implement and register examplejson_module_create.

Implement ExampleJsonCodeGen#

Similar to the C codegen, we also derive ExampleJsonCodeGen from ExprVisitor to make use of visitor patterns for subgraph traversing. On the other hand, we do not have to inherit CodegenCBase because we do not need TVM C++ wrappers. The codegen class is implemented as follows:

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

namespace tvm {
namespace relay {
namespace contrib {

class ExampleJsonCodeGen : public ExprVisitor {
  public:
    explicit ExampleJsonCodeGen();

    // Note 1
    void VisitExpr_(const VarNode* node) { /* Skip in this example. */ }
    void VisitExpr_(const CallNode* call) final { /* Skip in this example. */ }

    // Note 2
    std::string gen(NodeRef& ref) {
        this->code = "";
        if (ref->IsInstance<FunctionNode>()) {
            this->visit(Downcast<Function>(ref));
        } else if (ref->IsInstance<relay::ModuleNode>()) {
            relay::Module mod = Downcast<relay::Module>(ref);
            for (const auto& it : mod->functions) {
                this->visit(Downcast<Function>(it.second));
            }
        } else {
            LOG(FATAL) << "The input ref is expected to be a Relay function or module";
        }
        return this->code;
    }

  private:
      /*! \brief The function id that represents a C source function. */
     std::string code;
}

Note 1: We again implement corresponding visitor functions to generate ExampleJSON code and store it to a class variable code (we skip the visitor function implementation in this example as their concepts are basically the same as C codegen). After finished the graph visiting, we should have an ExampleJSON graph in code.

Note 2: We define an internal API gen to take a subgraph and generate a ExampleJSON code. This API can be in an arbitrary name you prefer.

The next step is to implement a customized runtime to make use of the output of ExampleJsonCodeGen.

Implement a Customized Runtime#

In this section, we will implement a customized TVM runtime step-by-step and register it to TVM runtime modules. The customized runtime should be located at src/runtime/contrib/<your-runtime-name>/. In our example, we name our runtime “example_ext_runtime”.

Again, we first define a customized runtime class as follows. The class has to be derived from TVM ModuleNode in order to be compatible with other TVM runtime modules.

#include <dmlc/logging.h>
#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/memory.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/ndarray.h>
#include <tvm/runtime/object.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>

#include <fstream>
#include <cmath>
#include <map>
#include <sstream>
#include <string>
#include <vector>

namespace tvm {
namespace runtime {
class ExampleJsonModule : public ModuleNode {
 public:
  explicit ExampleJsonModule(std::string graph_json);

  PackedFunc GetFunction(const std::string& name,
                         const ObjectPtr<Object>& sptr_to_self) final;

  const char* type_key() const { return "examplejson"; }

  void SaveToBinary(dmlc::Stream* stream) final;

  static Module LoadFromBinary(void* strm);

  static Module Create(const std::string& path);

  std::string GetSource(const std::string& format = "");

  void Run(int id, const std::vector<int>& inputs, int output);

  void ParseJson(const std::string& json);

 private:
  /* \brief The json string that represents a computational graph. */
  std::string graph_json_;
  /* \brief The subgraph that being processed. */
  std::string curr_subgraph_;
  /*! \brief A simple graph from subgraph id to node entries. */
  std::map<std::string, std::vector<NodeEntry>> graph_;
  /* \brief A simple pool to contain the tensor for each node in the graph. */
  std::vector<NDArray> data_entry_;
  /* \brief A mapping from node id to op name. */
  std::vector<std::string> op_id_;
};

In particular, there are some functions derived from ModuleNode that we must implement in ExampleJsonModule:

  • Constructor: The constructor of this class should accept a subgraph (in your representation), process and store it in any format you like. The saved subgraph could be used by the following two functions.

  • GetFunction: This is the most important function in this class. When TVM runtime wants to execute a subgraph with your compiler tag, TVM runtime invokes this function from your customized runtime module. It provides the function name as well as runtime arguments, and GetFunction should return a packed function implementation for TVM runtime to execute.

  • SaveToBinary and LoadFromBinary: SaveToBinary serialize the runtime module to a binary format for later deployment. This function will be called by TVM when users use export_library API. On the other hand, since we are now using our own graph representation, we have to make sure that LoadFromBinary is able to construct the same runtime module by taking the serialized binary generated by SaveToBinary.

  • GetSource (optional): If you would like to see the generated ExampleJSON code, you can implement this function to dump it; otherwise you can skip the implementation.

Other functions and class variables will be introduced along with the implementation of above must-have functions.

Implement Constructor#

explicit ExampleJsonModule(std::string graph_json) {
  this->graph_json_ = graph_json;
  ParseJson(this->graph_json_);
}

Then, we implement ParseJson to parse a subgraph in ExampleJSON format and construct a graph in memory for later usage. Since we do not support subgraph with branches in this example, we simply use an array to store every nodes in a subgraph in order.

void ParseJson(const std::string& json) {
  std::string line;
  std::string curr_subgraph;
  std::stringstream ss(json);

  while (std::getline(ss, line, '\n')) {
    std::stringstream ss2(line);
    std::string token;
    int id = 0;

    ss2 >> token;
    if (token.find("subgraph_") != std::string::npos) {
      curr_subgraph = token;
      continue;
    }

    ss2 >> id;
    if (op_id_.size() <= static_cast<size_t>(id)) {
      op_id_.resize(id + 1);
      data_entry_.resize(id + 1);
    }

    int64_t total_elements = 1;
    std::vector<int64_t> shape;
    if (token == "input") {
      int64_t size = 0;
      while (ss2 >> size) {
        total_elements *= size;
        shape.push_back(size);
      }
    } else {
      op_id_[id] = token; // Note 1
      bool shape_data = false;
      NodeEntry entry;
      while (ss2 >> token) {
        if (token == "shape:") {
          shape_data = true;
        } else if (shape_data) {
          total_elements *= std::stoll(token);
          shape.push_back(std::stoll(token));
        } else if (token != "inputs:") {
          entry.inputs.push_back(std::stoi(token));
        }
      }
      entry.id = id;
      entry.output = id;
      graph_[curr_subgraph].push_back(entry); // Note 2
    }
    DLDevice dev;
    dev.device_type = static_cast<DLDeviceType>(1);
    dev.device_id = 0;
    data_entry_[id] = NDArray::Empty(shape, DLDataType{kDLFloat, 32, 1}, dev); // Note 3
  }
}

Note 1: We use a class variable op_id_ to map from subgraph node ID to the operator name (e.g., add) so that we can invoke the corresponding operator function in runtime.

Note 2: We use a class variable graph_ to map from subgraph name to an array of nodes. GetFunction will query graph nodes by a subgraph ID in runtime.

Note 3: We use a class variable data_entry_ to map from a subgraph node ID to a tensor data placeholder. We will put inputs and outputs to the corresponding data entry in runtime.

Implement GetFunction#

After the construction, we should have the above class variables ready. We then implement GetFunction to provide executable subgraph functions to TVM runtime:

PackedFunc GetFunction(const std::string& name,
                       const ObjectPtr<Object>& sptr_to_self) final {
  if (this->graph_.find(name) != this->graph_.end()) {
    this->curr_subgraph_ = name;
    return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {

      // Copy input tensors to corresponding data entries.
      for (auto i = 0; i < args.size(); ++i) {
        ICHECK(args[i].type_code() == kNDArrayContainer || args[i].type_code() == kArrayHandle)
            << "Expect NDArray or DLTensor as inputs\n";
        if (args[i].type_code() == kArrayHandle) {
          DLTensor* arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        } else {
          NDArray arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        }
      }

      // Execute the subgraph.
      for (const auto& it : this->graph_[this->curr_subgraph_]) {
        this->Run(it.id, it.inputs, it.output);
      }
      ICHECK_GT(graph_.count(this->curr_subgraph_), 0U);

      // Copy the output from a data entry back to TVM runtime argument.
      auto out_idx = graph_[this->curr_subgraph_].back().output;
      if (args[args.size() - 1].type_code() == kArrayHandle) {
        DLTensor* arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      } else {
        NDArray arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      }
      *rv = data_entry_.back();
    });
  } else {
    LOG(FATAL) << "Unknown subgraph: " << name << "\n";
    return PackedFunc();
  }
}

As can be seen, GetFunction is composed of three major parts. The first part copies data from TVM runtime arguments to the corresponding data entries we assigned in the constructor. The second part executes the subgraph with Run function (will implement later) and saves the results to another data entry. The third part copies the results from the output data entry back to the corresponding TVM runtime argument for output.

Implement Run#

Now let’s implement Run function. This function accepts 1) a subgraph ID, 2) a list of input data entry indexs, and 3) an output data entry index.

void Run(int id, const std::vector<int>& inputs, int output) {
  // Make a list data entry indexs.
  std::vector<int> args(inputs.begin(), inputs.end());
  args.push_back(output);

  // Initialize data holders.
  std::vector<TVMValue> values(args.size());
  std::vector<int> type_codes(args.size());

  // Initialize a TVM arg setter with TVMValue and its type code.
  TVMArgsSetter setter(values.data(), type_codes.data());

  // Set each argument to its corresponding data entry.
  if (op_id_[id] == "add" || op_id_[id] == "sub" || op_id_[id] == "mul") {
    for (size_t i = 0; i < args.size(); i++) {
      setter(i, data_entry_[args[i]]);
    }
  }

  // Invoke the corresponding operator function.
  if (op_id_[id] == "add") {
    Add(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "sub") {
    Sub(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "mul") {
    Mul(values.data(), type_codes.data(), args.size());
  } else {
    LOG(FATAL) << "Unknown op: " << op_id_[id] << "\n";
  }
}

Run function mainly has two parts. The first part allocates a list of TVMValue, and maps corresponding data entry blocks. This will become the arguments of our operator functions. The second part than invokes our operator functions. Although we use the same C functions as the previous example, you can replace Add, Sub, and Mul with your own engine. You only need to make sure your engine stores the results to the last argument so that they can be transferred back to TVM runtime.

With above functions implemented, our customized codegen and runtime can now execute subgraphs. The last step is registering an API (examplejson_module_create) to create this module:

TVM_REGISTER_GLOBAL("module.examplejson_module_create")
.set_body_typed([](std::string code){
    auto n = make_object<ExampleJsonModule>(code);
    return runtime::Module(n);
});

Implement SaveToBinary and LoadFromBinary#

So far we have implemented the main features of a customized runtime so that it can be used as other TVM runtimes. However, when users want to save the built runtime to a disk for deployment, TVM has no idea about how to save it. This is the reason we want to implement SaveToBinary and LoadFromBinary, which tell TVM how should this customized runtime be persist and restored.

We first implement SaveToBinary function to allow users to save this module in disk.

void SaveToBinary(dmlc::Stream* stream) final {
    stream->Write(this->graph_json_);
}

We can find that this function is pretty simple. Recall that the only argument we took in constructor is a subgraph representation, meaning that we only need a subgraph representation to construct/recover this customized runtime module. As a result, SaveToBinary simply writes the subgraph to an output DMLC stream. That is, when users use export_library API to export the module, the customized module will be an ExampleJSON stream of a subgraph.

Similarity, LoadFromBinary reads the subgraph stream and re-constructs the customized runtime module:

static Module LoadFromBinary(void* strm) {
  dmlc::Stream* stream = static_cast<dmlc::Stream*>(strm);
  std::string graph_json;
  stream->Read(&graph_json);
  auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
  return Module(n);
}

We also need to register this function to enable the corresponding Python API:

TVM_REGISTER_GLOBAL("module.loadbinary_examplejson")
.set_body_typed(ExampleJsonModule::LoadFromBinary);

The above registration means when users call tvm.runtime.load_module(lib_path) API and the exported library has an ExampleJSON stream, our LoadFromBinary will be invoked to create the same customized runtime module.

In addition, if you want to support module creation directly from an ExampleJSON file, you can also implement a simple function and register a Python API as follows:

static Module Create(const std::string& path) {
    std::ifstream filep;
    filep.open(path, std::ios::in);
    std::string graph_json;
    std::string line;
    while (std::getline(filep, line)) {
        graph_json += line;
        graph_json += "\n";
    }
    filep.close();
    auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
    return Module(n);
}

TVM_REGISTER_GLOBAL("module.loadfile_examplejson")
.set_body([](TVMArgs args, TVMRetValue* rv) {
    *rv = ExampleJsonModule::Create(args[0]);
});

It means users can manually write/modify an ExampleJSON file, and use Python API tvm.runtime.load_module("mysubgraph.examplejson", "examplejson") to construct a customized module.

Summary#

In summary, here is a checklist for you to refer:

  • A codegen class derived from ExprVisitor and CodegenCBase (only for C codegen) with following functions.

    • VisitExpr_(const CallNode* call) to collect call node information.

    • Other visitor functions you needed to collect subgraph information.

    • JIT to generate subgraph code.

    • Register codegen.

  • A function to create CSourceModule (for C codegen).

  • A runtime module class derived from ModuleNode with following functions (for your graph representation).

    • Constructor.

    • GetFunction to generate a TVM runtime compatible PackedFunc.

    • Run to execute a subgraph.

    • Register a runtime creation API.

    • SaveToBinary and LoadFromBinary to serialize/deserialize customized runtime module.

    • Register LoadFromBinary API to support tvm.runtime.load_module(your_module_lib_path).

    • (optional) Create to support customized runtime module construction from subgraph file in your representation.

  • An annotator to annotate a user Relay program to make use of your compiler and runtime (TBA).