将算子添加到 Relay

将算子添加到 Relay#

在本文档中，将详细介绍在 Relay 中注册新 TVM 算子所需的步骤。将以添加 cumulative product （累积乘积）算子的 PR 为例进行说明。该 PR 本身基于另一个添加 cumulative sum （累积求和）算子的PR。

注册新的算子需要以下几个步骤：

添加属性节点，用于声明在编译时已知的固定参数。
为您的运算编写类型关系，以便将其集成到 Relay 的类型系统中。
在 C++ 中使用 RELAY_REGISTER_OP 宏来注册算子的参数数量、类型以及为编译器提供的其他提示信息。
编写算子的计算方式。
将计算和调度与 Relay 算子进行注册。
定义 C++ 函数来为该算子生成调用节点，并为该函数注册 Python API 钩子。
将上述 Python API 钩子封装在更简洁的接口中。
为新的 Relay 算子编写测试。

"1. 定义属性节点"#

属性是指在编译时就应该已知的固定参数。卷积算子中的步幅（stride）和扩张率（dilation）就是典型的例子，这些字段可能属于卷积算子的属性节点。

属性应该在 include/tvm/relay/attrs/ 文件夹内的文件中定义。

最终，希望创建算子，其接口可以在最终的 Python 接口中清晰地体现出来：

def cumprod(data, axis=None, dtype=None, exclusive=None):
    """Numpy style cumprod op. Return the cumulative inclusive product of the elements along
    a given axis.
    Parameters
    ----------
    data : relay.Expr
        The input data to the operator.
    axis : int, optional
        Axis along which the cumulative product is computed. The default (None) is to compute
        the cumprod over the flattened array.
    dtype : string, optional
        Type of the returned array and of the accumulator in which the elements are multiplied.
        If dtype is not specified, it defaults to the dtype of data.
    exclusive : bool, optional
        If true will return exclusive product in which the first element is not
        included. In other terms, if true, the j-th output element would be
        the product of the first (j-1) elements. Otherwise, it would be the product of
        the first j elements. The product of zero elements will be 1.
    Returns
    -------
    result : relay.Expr
        The result has the same size as data, and the same shape as data if axis is not None.
        If axis is None, the result is a 1-d array.
    """

cumsum() 也有类似的接口。

因此，当在 include/tvm/relay/attrs/transform.h 中定义属性时，选择轴（axis）、累加数据类型（accumulation dtype）以及运算的排他性（exclusivity）作为结构体的适当字段。

/*! \brief Attributes used in cumsum and cumprod operator */
struct ScanopAttrs : public tvm::AttrsNode<ScanopAttrs> {
  Integer axis;
  DataType dtype;
  Bool exclusive = Bool(false);
  TVM_DECLARE_ATTRS(ScanopAttrs, "relay.attrs.ScanopAttrs") {
    TVM_ATTR_FIELD(axis).describe("The axis to operate over").set_default(NullValue<Integer>());
    TVM_ATTR_FIELD(dtype).describe("Output data type").set_default(NullValue<DataType>());
    TVM_ATTR_FIELD(exclusive)
        .describe("The first element is not included")
        .set_default(Bool(false));
  }
};

2. 编写类型关系#

为了在注册算子时提供更大的灵活性，并在表达 Relay 中的类型时具有更强的表达能力和粒度，算子使用输入和输出类型之间的关系进行类型化。这些关系表示为函数，它们接收输入类型和输出类型的列表（其中任何类型都可能是不完整的），并返回满足该关系的输入和输出类型列表。这包括可以在编译时静态确定的形状信息。本质上，算子的关系可以强制执行所有必要的类型规则（即通过检查输入类型），同时计算输出类型。

累积乘积和求和算子的类型关系可以在 src/relay/op/tensor/transform.cc 中找到：

TVM_REGISTER_NODE_TYPE(ScanopAttrs);
bool ScanopRel(const Array<Type>& types, int num_inputs, const Attrs& attrs, const TypeReporter& reporter) {
    // types: [data, output]
    ICHECK_EQ(types.size(), 2) << "Expects two types, one for the input and another for the output";
    const auto* data = types[0].as<TensorTypeNode>();
    if (data == nullptr) {
        ICHECK(types[0].as<IncompleteTypeNode>())
        << "Scanop: expect input type to be TensorType but get " << types[0];
        return false;
    }

    const auto* param = attrs.as<ScanopAttrs>();

    auto dtype = param->dtype;
    if (dtype.is_void()) {
        dtype = data->dtype;
    }

    if (param->axis.defined()) {
        reporter->Assign(types[1], TensorType(data->shape, dtype));
    } else {
        auto prod = data->shape[0];
        for (size_t i = 1; i < data->shape.size(); ++i) {
            prod = prod * data->shape[i];
        }
        reporter->Assign(types[1], TensorType({prod}, dtype));
    }

    return true;
}

3. 将参数数量和属性关联起来#

然后，注册新算子的名称，并使用调用接口对其进行注释。C++ 中的 RELAY_REGISTER_OP 宏允许开发者指定 Relay 中算子的以下信息：

Arity（参数的数量）
位置参数的名称和描述
支持级别（1表示内部固有算子；数字越大表示越不核心或外部支持的算子）
算子的类型关系
在优化运算时有用的其他注释。

再次将此添加到 src/relay/op/tensor/transform.cc 文件中：

RELAY_REGISTER_OP("cumsum")
    .describe(
        R"doc(Return the cumulative sum of the elements along a given axis.)doc" TVM_ADD_FILELINE)
    .set_num_inputs(1)
    .add_argument("data", "Tensor", "The input tensor.")
    .set_support_level(3)
    .add_type_rel("Cumsum", ScanopRel)
    .set_attr<TOpPattern>("TOpPattern", kOpaque);

RELAY_REGISTER_OP("cumprod")
    .describe(
        R"doc(Return the cumulative product of the elements along a given axis.)doc" TVM_ADD_FILELINE)
    .set_num_inputs(1)
    .add_argument("data", "Tensor", "The input tensor.")
    .set_support_level(3)
    .add_type_rel("Cumprod", ScanopRel)
    .set_attr<TOpPattern>("TOpPattern", kOpaque);

在这种情况下，TOpPattern 向编译器提供了关于该算子计算模式的提示，这对于算子的融合可能非常有用。kOpaque 则告知 TVM 无需尝试融合此算子。

4. 定义运算的计算过程#

While we've now defined the interface for our operations we still need to define how to perform the actual calculations for cumulative sum and product.

编写此代码不在本教程的范围内。目前，假设有经过充分测试的运算计算实现。有关如何做到这一点的更多详细信息，建议查阅关于张量表达式和 TVM 算子库 (topi) 的教程，并参考 python/tvm/topi/scan.py 中的累积和与累积积示例实现，以及 python/tvm/topi/cuda/scan.py 中的 GPU 版本实现。

1. 将计算与策略与 Relay 连接起来#

在您实现了计算函数之后，现在需要将其与 Relay 算子连接起来。在 TVM 中，这不仅意味着定义计算，还包括为算子定义调度。策略是一种选择使用哪种计算和调度的方法。例如，对于 2D 卷积，可能会识别出正在进行深度卷积，并因此分派到更高效的计算和调度。然而，在案例中，除了在 CPU 和 GPU 实现之间进行分派外，不需要这样的策略。在 python/tvm/relay/op/strategy/generic.py 和 python/tvm/relay/op/strategy/cuda.py 中，添加了以下策略：

def wrap_compute_scanop(topi_compute):
    """Wrap scanop style topi compute"""

    def _compute_scanop(attrs, inputs, _):
        return [topi_compute(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]

    return _compute_scanop


@override_native_generic_func("cumsum_strategy")
def cumsum_strategy(attrs, inputs, out_type, target):
    """cumsum generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumsum),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumsum.generic",
    )
    return strategy


@override_native_generic_func("cumprod_strategy")
def cumprod_strategy(attrs, inputs, out_type, target):
    """cumprod generic strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cumprod),
        wrap_topi_schedule(topi.generic.schedule_extern),
        name="cumprod.generic",
    )
    return strategy

@cumsum_strategy.register(["cuda", "gpu"])
def cumsum_strategy_cuda(attrs, inputs, out_type, target):
    """cumsum cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumsum),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumsum.cuda",
    )
    return strategy


@cumprod_strategy.register(["cuda", "gpu"])
def cumprod_strategy_cuda(attrs, inputs, out_type, target):
    """cumprod cuda strategy"""
    strategy = _op.OpStrategy()
    strategy.add_implementation(
        wrap_compute_scanop(topi.cuda.cumprod),
        wrap_topi_schedule(topi.cuda.schedule_scan),
        name="cumprod.cuda",
    )
    return strategy

在每个策略中，在 add_implementation() 内定义了编写的计算和要使用的调度。最后，在 python/tvm/relay/op/_transform.py 中将策略和计算与定义的 Relay 算子链接起来：

# cumsum
@_reg.register_compute("cumsum")
def compute_cumsum(attrs, inputs, output_type):
    """Compute definition of cumsum"""
    return [topi.cumsum(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]


_reg.register_strategy("cumsum", strategy.cumsum_strategy)
_reg.register_shape_func("cumsum", False, elemwise_shape_func)

# cumprod
@_reg.register_compute("cumprod")
def compute_cumprod(attrs, inputs, output_type):
    """Compute definition of cumprod"""
    return [topi.cumprod(inputs[0], attrs.axis, attrs.dtype, attrs.exclusive)]


_reg.register_strategy("cumprod", strategy.cumprod_strategy)
_reg.register_shape_func("cumprod", False, elemwise_shape_func)

形状函数用于在给定动态形状张量的情况下确定输出形状。在这种情况下，告诉 TVM 输出形状将与输入形状相同。

6. 创建 Relay 调用节点并暴露 Python 钩子#

现在已经有了可运行的运算，现在只需要通过 Relay 调用节点来正确调用它。这一步只需要编写一个函数，该函数接收算子的参数（作为 Relay 表达式），并返回指向该算子的调用节点（即应该插入到 Relay AST 中，表示调用该算子的节点）。

目前，调用属性和类型参数（最后两个字段）尚未得到支持，因此使用 Op::Get 从算子注册表中获取算子的信息，并将参数传递给调用节点即可，如下所示。在 src/relay/op/tensor/transform.cc 文件中：

Expr MakeCumsum(Expr data, Integer axis, DataType dtype, Bool exclusive) {
    auto attrs = make_object<ScanopAttrs>();
    attrs->dtype = dtype;
    attrs->axis = axis;
    attrs->exclusive = exclusive;
    static const Op& op = Op::Get("cumsum");
    return Call(op, {data}, Attrs(attrs), {});
}

TVM_REGISTER_GLOBAL("relay.op._make.cumsum").set_body_typed(MakeCumsum);

Expr MakeCumprod(Expr data, Integer axis, DataType dtype, Bool exclusive) {
    auto attrs = make_object<ScanopAttrs>();
    attrs->dtype = dtype;
    attrs->axis = axis;
    attrs->exclusive = exclusive;
    static const Op& op = Op::Get("cumprod");
    return Call(op, {data}, Attrs(attrs), {});
}

TVM_REGISTER_GLOBAL("relay.op._make.cumprod").set_body_typed(MakeCumprod);

其中，TVM_REGISTER_GLOBAL 通过 relay.op._make.cumsum(...) 和 relay.op._make.cumprod(...) 将 MakeCumsum 和 MakeCumprod 函数暴露给 Python 使用。

7. 包含更简洁的 Python API 接口#

通常在 Relay 中的惯例是，通过 TVM_REGISTER_GLOBAL 导出的函数应该封装在单独的 Python 函数中，而不是直接在 Python 中调用。对于算子可以在 python/tvm/relay/op/transform.py 中提供了更简洁的接口。

def cumsum(data, axis=None, dtype=None, exclusive=None):
    return _make.cumsum(data, axis, dtype, exclusive)

def cumprod(data, axis=None, dtype=None, exclusive=None):
    return _make.cumprod(data, axis, dtype, exclusive)

需要注意的是，这些 Python 封装函数也可能是为算子提供更简便接口的良好机会。例如，concat 算子在注册时仅接受一个参数，即包含需要连接张量的元组，但 Python 封装函数则直接接受张量作为参数，并在生成调用节点之前将它们组合成一个元组：

def concat(*args):
    """Concatenate the input tensors along the zero axis.

    Parameters
    ----------
    args: list of Tensor

    Returns
    -------
    tensor: The concatenated tensor.
    """
    tup = Tuple(list(args))
    return _make.concat(tup)

8. 编写单元测试#

这已经非常直观了！关于累加和累乘算子的一些单元测试示例可以在 tests/python/relay/test_op_level3.py 中找到。

其他主题#

梯度算子#

梯度算子对于在 Relay 中编写可微分的程序至关重要。尽管 Relay 的自动微分算法能够对一等语言结构进行微分，但算子是不透明的。由于 Relay 无法查看算子的内部实现，因此必须提供显式的微分规则。

虽然 Python 和 C++ 都可以用来编写梯度算子，但示例主要集中在 Python 上，因为 Python 的使用更为普遍。

在 Python 中添加梯度#

可以在 python/tvm/relay/op/_tensor_grad.py 中找到一系列 Python 梯度算子的示例。将通过两个具有代表性的例子来讲解：sigmoid 和 multiply。

@register_gradient("sigmoid")
def sigmoid_grad(orig, grad):
    """Returns [grad * sigmoid(x) * (1 - sigmoid(x))]."""
    return [grad * orig * (ones_like(orig) - orig)]

这里的输入是原始算子 orig 和用于累积的梯度 grad。返回的是列表，其中第 i 个元素是算子相对于其第 i 个输入的导数。通常情况下，梯度会返回列表，其元素数量与基础算子的输入数量相同。

在进一步分析这个定义之前，首先应该回顾一下 sigmoid 函数的导数公式：\(\frac{\partial \sigma}{\partial x} = \sigma(x)(1 - \sigma(x))\)。上面的定义与数学公式非常相似，但有重要的补充，将在下面进行说明。

项 orig * (ones_like(orig) - orig) 直接对应了导数公式，因为这里的 orig 就是 sigmoid 函数。但不仅仅要关注如何计算这个函数的梯度，还关注如何将这个梯度与其他梯度结合起来，以便在整个程序中累积梯度。这就是 grad 项的作用。在表达式 grad * orig * (ones_like(orig) - orig) 中，乘以 grad 指定了如何将当前导数与迄今为止的梯度结合起来。

现在，来看稍微复杂一些的例子：multiply （乘法算子）：

@register_gradient("multiply")
def multiply_grad(orig, grad):
    """Returns [grad * y, grad * x]"""
    x, y = orig.args
    return [collapse_sum_like(grad * y, x),
            collapse_sum_like(grad * x, y)]

在这个例子中，返回的列表中有两个元素，因为 multiply 是一个二元算子。回顾一下，如果 \(f(x, y) = xy\)，那么偏导数分别为 \(\frac{\partial f}{\partial x} = y\) 和 \(\frac{\partial f}{\partial y} = x\)。

对于 multiply 来说，有必需的步骤是 sigmoid 所不需要的，因为 multiply 具有广播语义。由于 grad 的形状可能与输入的形状不匹配，使用 collapse_sum_like 来处理 grad * <var> 项的内容，使其形状与要微分的输入形状相匹配。

在 C++ 中添加梯度#

在 C++ 中添加梯度与在 Python 中添加梯度类似，但注册的接口略有不同。

首先，确保包含 src/relay/transforms/pattern_utils.h 头文件。它提供了用于在 Relay AST 中创建节点的辅助函数。然后，按照与 Python 示例类似的方式定义梯度：

tvm::Array<Expr> MultiplyGrad(const Expr& orig_call, const Expr& output_grad) {
    const Call& call = orig_call.Downcast<Call>();
    return { CollapseSumLike(Multiply(output_grad, call.args[1]), call.args[0]),
             CollapseSumLike(Multiply(output_grad, call.args[0]), call.args[1]) };
}

需要注意的是，在 C++ 中不能像在 Python 中那样使用运算符重载，而是需要进行类型转换，因此实现会显得更加冗长。尽管如此，可以轻松验证这个定义与之前的 Python 示例是一致的。

现在，不再使用 Python 装饰器，而是需要在基础算子注册的末尾添加 set_attr 调用，并指定 "FPrimalGradient" 属性，以注册梯度。

RELAY_REGISTER_OP("multiply")
    // ...
    // Set other attributes
    // ...
    .set_attr<FPrimalGradient>("FPrimalGradient", MultiplyGrad);