tvm.tir.schedule

目录

tvm.tir.schedule#

Namespace for the TensorIR schedule API.

exception tvm.tir.schedule.ScheduleError[源代码]#

Error that happens during TensorIR scheduling.

class tvm.tir.schedule.BlockRV[源代码]#

A random variable that refers to a block

__init__()[源代码]#

Construct a new BlockRV.

返回类型:

None

class tvm.tir.schedule.BlockScope[源代码]#

An object corresponds to each block sref in the sref tree, which tracks the producer-consumer dependency between blocks.

Glossary:

  • Block scope: A contiguous subtree of the sref tree, rooted at each block sref, whose components are:

    • scope root: a block sref

    • internal srefs: loop srefs

    • scope leaves: block srefs

  • Child block: The scope leaf blocks under the scope root or a specific internal sref

get_deps_by_dst(block)[源代码]#

Get all dependencies whose dst is the target block.

Parameters#

block: StmtSRef

The queried block

Returns#

blocks: List[Dependency]

The dependencies

参数:

block (StmtSRef)

返回类型:

List[Dependency]

get_deps_by_src(block)[源代码]#

Get all dependencies whose src is the target`block`.

Parameters#

block: StmtSRef

The queried block

Returns#

blocks: List[Dependency]

The dependencies

参数:

block (StmtSRef)

返回类型:

List[Dependency]

class tvm.tir.schedule.DepKind(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[源代码]#

Type of dependency.

Attributes#

RAWint = 0

Read-after-write dependency

WAWint = 1

Write-after-write dependency

WARint = 2

Write-after-read dependency. Not supported in TensorIR for now.

OPAQUE: int = 3

Opaque dependency

class tvm.tir.schedule.Dependency[源代码]#

A tuple (src, dst, kind) representing certain types of dependency. For example, (A, B, kRAW) means block B depends on block A, and the dependency kind is read-after-write, which means block B reads the result written by block A.

Parameters#

srcStmtSRef

The source of the dependency relation

dstStmtSRef

The destination of the dependency relation

kindDepKind

The dependency kind

tvm.tir.schedule.ExprRV#

PrimExpr 的别名

class tvm.tir.schedule.Instruction(kind, inputs, attrs, outputs)[源代码]#

Schedule instructions each corresponds to a schedule primitive

Attributes#

kindInstructionKind

The kind of the instruction

inputsList[INPUT_RV_TYPE]

The input random variables of the instruction, and the type of each element can be one of the following: - BlockRV - LoopRV - ExprRV - float - int - str - None

attrsList[ATTR_TYPE]

The attributes of the instruction. Similar to attributes of an operator, attributes of an instruction are arbitrary constant metadata required by the instructions. For example, the name of the block to be retrieved in GetBlock.

outputsList[OUTPUT_RV_TYPE]

The output random variables of the instruction, and the type of each element can be one of the following: - BlockRV - LoopRV - ExprRV, atomic variables only, won't be constants or composite PrimExpr

__init__(kind, inputs, attrs, outputs)[源代码]#

Constructor

Parameters#

kindInstructionKind

The kind of the instruction

inputsList[INPUT_RV_TYPE]

The input random variables of the instruction, and the type of each element can be one of the following: - BlockRV - LoopRV - ExprRV - float - int - str - None

attrsList[ATTR_TYPE]

The attributes of the instruction. Similar to attributes of an operator, attributes of an instruction are arbitrary constant metadata required by the instructions. For example, the name of the block to be retrieved in GetBlock.

outputsList[OUTPUT_RV_TYPE]

The output random variables of the instruction, and the type of each element can be one of the following: - BlockRV - LoopRV - ExprRV, atomic variables only, won't be constants or composite PrimExpr

参数:
返回类型:

None

参数:
class tvm.tir.schedule.InstructionKind[源代码]#

Kind of an instruction, e.g. Split, Reorder, etc. Besides the name, every kind of instruction has its own properties, including: 1) A boolean indicating if the instruction is pure, i.e. change nothing in the schedule state 2) A functor that applies the instruction to a TensorIR schedule 3) A functor that converts the instruction to a statement in python syntax 4) A functor that serialize its attributes to JSON 5) A functor that deserialize its attributes from JSON

Unlike tvm.ir.op, InstructionKind doesn't support unstructured properties, mainly because there is no such usecase yet to add any other property.

Attributes#

namestr

The name of a kind of instructions

Note#

The functor properties are not exposed on python side at the moment

static get(name)[源代码]#

Retrieve an InstructionKind using its name

Parameters#

namestr

The registered name of the InstructionKind

Returns#

kindInstructionKind

The InstructionKind retrieved

参数:

name (str)

返回类型:

InstructionKind

property is_pure: bool#

Indicates if the instruction is pure, i.e. removing it alone doesn't mutate the schedule state. For example, the instruction GetBlock is pure because it changes nothing, while ComputeInline is not because removing it leads to a different resulting schedule.

Returns#

purebool

The boolean flag indicating if the instruction is pure

class tvm.tir.schedule.LoopRV[源代码]#

A random variable that refers to a loop

__init__()[源代码]#

Construct a new LoopRV.

返回类型:

None

class tvm.tir.schedule.Schedule(mod, *, seed=None, debug_mask='none', error_render_level='detail', enable_check=True)[源代码]#

The user-facing schedule class

A schedule is a set of transformations that change the order of computation but preserve the semantics of computation. Some example of schedules: 1) Split a loop into two; 2) Reorder two loops; 3) Inline the computation of a specific buffer into its consumer

The schedule class stores auxiliary information to schedule correctly and efficiently.

Link to tutorial: https://tvm.apache.org/docs/tutorials/language/schedule_primitives.html

参数:
__init__(mod, *, seed=None, debug_mask='none', error_render_level='detail', enable_check=True)[源代码]#

Construct a TensorIR schedule class from an IRModule

Parameters#

modUnion[PrimFunc, IRModule]

The IRModule or PrimFunc to be scheduled

seed: Optional[int]

The seed value for schedule's random state Note that None and -1 means use device random, otherwise only integer between 1 and 2147483647 is allowed.

debug_maskUnion[str, int]

Do extra correctness checking after the class creation and each time after calling the Replace method. Possible choices of debug_mask: 1) "all" - Turn on all the checks 2) "none" - Turn off all the checks 3) An integer - Turn on checks according to the bitmasks provided in ScheduleDebugMask

error_render_levelstr = "detail"

The level of error rendering. Choices: "detail", "fast", "none". - "detail": Render a detailed error message, with the TIR and error locations printed - "fast: Show a simple error message without rendering or string manipulation - "none": Do not show any error message.

enable_checkbool = True

The default schedule checks are too strict and might prevent us performing some valid schedules. enable_check is an argument to control whether we enable prerequisite checks for some schedule primitives or not: - true: perform prerequisite check before applying some schedules. - false: do not perform some check before applying schedules, but still raise error if schedule fails.

It's user duty to guarantee schedule correctness if enable_check is set to False.

Note#

The checks performed includes: 1) VerifySRefTree 2) VerifyCachedFlags

参数:
返回类型:

None

static _create_non_traced(mod, *, seed=None, debug_mask='none', error_render_level='detail', enable_check=True)[源代码]#

Construct a non-traced TensorIR schedule class from an IRModule.

参数:
返回类型:

Schedule

add_unit_loop(block_or_loop)[源代码]#

Create a new unit loop on top of the specific block or loop.

Parameters#

block_or_loopUnion[LoopRV, BlockRV]

The block above which the new loop is created

Returns#

new_loopLoopRV

The new unit loop

Examples#

Before add_unit_loop, in TensorIR, the IR is:

@T.prim_func
def before_add_unit_loop(
    A: T.Buffer((), "int32"),
    B: T.Buffer((), "int32"),
    C: T.Buffer((), "int32"),
) -> None:
    with T.block("C"):
        vi = T.axis.spatial(1, 0)
        C[()] = A[()] + B[()]

Create the schedule and do add-unit-loop:

sch = tir.Schedule(before_add_unit_loop)
sch.add_unit_loop(sch.get_block("C"))
print(sch.mod["main"].script())

After applying add-unit-loop, the IR becomes:

@T.prim_func
def after_add_unit_loop(
    A: T.Buffer((), "int32"),
    B: T.Buffer((), "int32"),
    C: T.Buffer((), "int32"),
) -> None:
    for u in T.serial(1):
        with T.block("C"):
            vi = T.axis.spatial(1, 0)
            C[()] = A[()] + B[()]
参数:

block_or_loop (LoopRV | BlockRV)

返回类型:

LoopRV

annotate(block_or_loop, ann_key, ann_val)[源代码]#

Annotate a block/loop with a key value pair

Parameters#

block_or_loop: Union[BlockRV, LoopRV]

The block/loop to be annotated

ann_keystr

The annotation key

ann_valAnnotationValueT

The annotation value

Examples#

Before annotate, in TensorIR, the IR is:

@T.prim_func
def before_annotate(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do annotate:

sch = tir.Schedule(before_annotate)
sch.annotate(sch.get_block("B"), "ann_key", "ann_value")
print(sch.mod["main"].script())

After applying annotate, the IR becomes:

@T.prim_func
def after_annotate(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            T.block_attr({"ann_key", "ann_value"})
            B[vi, vj] = A[vi, vj] * 2.0
参数:
返回类型:

None

annotate_buffer_access(block, buffer_index, buf_type, gen_new_ranges)[源代码]#

Annotate the read or write region of a block

Parameters#

blockBlockRV

The block to be annotated

buffer_indexint

The index of the buffer in block's read or write region

buf_typestr

The buffer type: "read" or "write"

gen_new_rangesCallable

A function that takes the block's iter_vars and returns a Tuple[Union[PrimExpr, Tuple[PrimExpr, PrimExpr]], ...] which defines the new read or write region for the buffer. Each element in the tuple can be: - A single PrimExpr representing the iter_var itself - A tuple of two PrimExprs representing the range (begin, end)

Examples#

Annotate a 2D read region for a buffer. Before annotate_buffer_access, in TensorIR, the IR is:

@T.prim_func
def before_annotate_buffer_access(
    A: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do annotate_buffer_access:

sch = tir.Schedule(before_annotate_buffer_access)
block = sch.get_block("B")
sch.annotate_buffer_access(block, 0, "read",
lambda vi, vj: ((vi - 1, vi + 1), (vj - 1, vj + 1)))
print(sch.mod["main"].script())

After applying annotate_buffer_access, the IR becomes:

@T.prim_func
def after_annotate_buffer_access(
    A: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            T.reads(A[vi - 1:vi + 1, vj - 1:vj + 1])
            T.writes(B[vi, vj])
            T.block_attr({"explicit_read_region": 0})
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

This annotates the read region for buffer A (index 0) in block "B" to be [vi-1:vi+1, vj-1:vj+1] for each (vi, vj) in the block's iteration domain.

Note#

This function allows manual specification of read or write regions, which can be useful in cases where the compiler cannot accurately infer the access pattern, such as complex data-dependent accesses. It overrides the automatically inferred region for the specified buffer. The function adds an annotation to the block, indicating that an explicit region has been provided for the buffer at the given index. This annotation is used in the CompactBufferAllocation pass to respect the manually specified region instead of relying on automatic inference.

Caution should be exercised when using this function, as incorrect annotations may lead to incorrect code generation or runtime errors. It's crucial to ensure that the specified region covers all actual reads or writes performed by the block for the given buffer.

参数:
返回类型:

None

bind(loop, thread_axis)[源代码]#

Bind the input loop to the given thread axis. It requires: 1) The scope block that the loop is in should have stage-pipeline property 2) All the blocks under the loop are complete blocks or reduction blocks, and have affine bindings 3) For each block under the loop, if the thread axis starts with "threadIdx`, the loop can only be contained in data-parallel block iter and reduction block iters' bindings. Otherwise the loop can only be contained in data-parallel block iters' bindings

Parameters#

loopLoopRV

The loop to be bound to the thread axis

thread_axisstr

The thread axis to be bound to the loop. Possible candidates: - blockIdx.x/y/z - threadIdx.x/y/z - vthread.x/y/z - vthread (It is a legacy behavior that will be deprecated. Please use vthread.x/y/z instead.)

Examples#

Before bind, in TensorIR, the IR is:

@T.prim_func
def before_bind(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do bind:

sch = tir.Schedule(before_bind)
i, j = sch.get_loops(sch.get_block("B"))
sch.bind(i, "blockIdx.x")
sch.bind(j, "threadIdx.x")

After applying bind, the IR becomes:

@T.prim_func
def after_bind(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i in T.thread_binding(0, 128, thread = "blockIdx.x"):
        for j in T.thread_binding(0, 128, thread = "threadIdx.x"):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
参数:
返回类型:

None

blockize(target, preserve_unit_iters=True)[源代码]#

Convert multiple blocks or the subtree rooted at a specific loop into a block.

Parameters#

targetLoopRV or List[BlockRV]

The root of the subtree or the specified blocks.

preserve_unit_itersbool

Whether or not to preserve unit iterators in block bindings

Returns#

resultBlockRV

The new block.

Examples#

Before blockize, in TensorIR, the IR is:

@T.prim_func
def before_blockize(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32")
) -> None:
    for i_0, j_0, i_1, j_1 in T.grid(8, 8, 16, 16):
        with T.block("B"):
            vi = T.axis.spatial(128, i_0 * 16 + i_1)
            vj = T.axis.spatial(128, j_0 * 16 + j_1)
            T.reads(A[vi, vj])
            T.writes(B[vi, vj])
            B[vi, vj] = A[vi, vj] * T.float32(2)

Create the schedule and do set_scope:

sch = tir.Schedule(before_blockize)
B = sch.get_block("B")
_, _, i1, _ = sch.get_loops(B)
sch.blockize(i1)
print(sch.mod["main"].script())

After applying blockize, the IR becomes:

@T.prim_func
def after_blockize(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32")
)-> None:
    for i_0, j_0 in T.grid(8, 8):
        with T.block("B_o"):
            vio, vjo = T.axis.remap("SS", [i_0, j_0])
            T.reads(A[vio * 16 : vio * 16 + 16, vjo * 16 : vjo * 16 + 16])
            T.writes(B[vio * 16 : vio * 16 + 16, vjo * 16 : vjo * 16 + 16])
            for i_1, j_1 in T.grid(16, 16):
                with T.block("B"):
                    vi, vj = T.axis.remap("SS", [i_1, j_1])
                    T.reads(A[vio * 16 + vi, vjo * 16 + vj])
                    T.writes(B[vio * 16 + vi, vjo * 16 + vj])
                    B[vio * 16 + vi, vjo * 16 + vj] = A[vio * 16 + vi, vjo * 16 + vj]                                                                   * T.float32(2)

Note#

blockize requires there is exactly one block under the given loop and the bindings of the block are divisible by the subspace represented by the loops starting at the given loop.

参数:
返回类型:

BlockRV

cache_index(block, storage_scope, cse_thresh=0)[源代码]#

Create a block to cache precomputed index for later use. if there is no index computation, keep unchanged.

Parameters#

blockUnion[BlockRV, str]

The target block operates on the target buffer.

storage_scope: str

The storage scope of cached block.

cse_thresh: int

The repeat threshold that determines a common sub expr, default 0 means cache all index computation.

Returns#

cached_blocksList[BlockRV]

The blocks of the stage writing the cache buffers

Examples#

Before cache_inplace, in TensorIR, the IR is:

@T.prim_func
def resize(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (1, 3, 40, 40))
    B = T.match_buffer(b, (1, 3, 80, 80))
    for i0, i1, i2, i3 in T.grid(1, 3, 80, 80):
        with T.block("A"):
            n, c, vi, vj = T.axis.remap("SSSS", [i0, i1, i2, i3])
            B[n, c, vi, vj] = A[n, c, vi//4 + vj//4, vj//2]

Create the schedule and cache_index:

sch = tir.Schedule(resize)
block_a = sch.get_block("A")
sch.cache_index(block_a, "global", 1)
print(sch.mod["main"].script())

After applying cache_index, the IR becomes:

@T.prim_func
def resize_cache_index(
    A: T.Buffer((1, 3, 40, 40), "float32"), B: T.Buffer((1, 3, 80, 80), "float32")
) -> None:
    index_var_0 = T.alloc_buffer([80, 80], dtype="int32", strides=[1])
    index_var_1 = T.alloc_buffer([80], dtype="int32", strides=[1])
    for ax0, ax1 in T.grid(80, 80):
        with T.block("index_0"):
            v0 = T.axis.spatial(80, ax0)
            v1 = T.axis.spatial(80, ax1)
            T.reads()
            T.writes(index_var_0[v0, v1])
            index_var_0[v0, v1] = v0 // 4 + v1 // 4
    for ax0 in T.serial(80):
        with T.block("index_1"):
            v0 = T.axis.spatial(80, ax0)
            T.reads()
            T.writes(index_var_1[v0])
            index_var_1[v0] = v0 // 2
    for i0, i1, i2, i3 in T.grid(1, 3, 80, 80):
        with T.block("A"):
            n, c, vi, vj = T.axis.remap("SSSS", [i0, i1, i2, i3])
            T.reads(A[n, c, vi // 4 + vj // 4, vj // 2])
            T.writes(B[n, c, vi, vj])
            B[n, c, vi, vj] = A[n, c, index_var_0[vi, vj], index_var_1[vj]]
参数:
返回类型:

List[BlockRV]

cache_inplace(block, read_buffer_index, storage_scope)[源代码]#

Create blocks that reads & write a buffer region into a cache block. It requires the target block both read & write the target buffer. Mainly for inplace operation.

Parameters#

blockUnion[BlockRV, str]

The target block operates on the target buffer.

read_buffer_index: int

The index of the buffer in block's read region, the unique name of a read buffer in the block, or a Buffer object that is within the blocks read region.

storage_scope: str

The target storage scope.

Returns#

cached_blocksList[BlockRV]

The blocks of the cache stage, read cache first, write cache second

Examples#

Before cache_inplace, in TensorIR, the IR is:

@T.prim_func
def before_cache_inplace(data_io: T.Buffer((64), "int32")):
    for i0 in T.serial(1):
        with T.block("A"):
            T.reads(data_io[:64])
            T.writes(data_io[:64])
            T.evaluate(T.call_extern("call_impl", data_io.data, dtype=""))

Create the schedule and cache_inplace:

sch = tir.Schedule(before_cache_inplace)
block_a = sch.get_block("A")
sch.cache_inplace(block_a, 0, "local")
print(sch.mod["main"].script())

After applying cache_inplace, the IR becomes:

@T.prim_func
def cache_inplace(data_io: T.Buffer(64, "int32")) -> None:
    data_io_local = T.alloc_buffer([64], dtype="int32", scope="local")
    for i0 in T.serial(1):
        for ax0 in T.serial(64):
            with T.block("data_io_local"):
                v0 = T.axis.spatial(64, ax0)
                T.reads(data_io[v0])
                T.writes(data_io_local[v0])
                data_io_local[v0] = data_io[v0]
        with T.block("A"):
            T.reads(data_io_local[0 : 64])
            T.writes(data_io_local[0 : 64])
            T.evaluate(T.call_extern("call_impl", data_io_local.data, dtype=""))
        for ax0 in T.serial(64):
            with T.block("data_io_local"):
                v0 = T.axis.spatial(64, ax0)
                T.reads(data_io_local[v0])
                T.writes(data_io[v0])
                data_io[v0] = data_io_local[v0]
参数:
返回类型:

List[BlockRV]

cache_read(block, read_buffer_index, storage_scope, consumer_blocks=None)[源代码]#

Create a block that reads a buffer region into a read cache. It requires:

  1. There is at most one block who write the buffer in the scope.

  2. The scope block have stage-pipeline property.

Parameters#

blockUnion[BlockRV, str]

The consumer block of the target buffer.

buffer: Union[int, str, Buffer]

The index of the buffer in block's read region, the unique name of a read buffer in the block, or a Buffer object that is within the blocks read region.

storage_scope: str

The target storage scope.

consumer_blocks: Optional[List[Union[BlockRV, str]]]

An optional list of consumers that should read from the cache. If not specified, all consumers will use the cache.

Returns#

cached_blockBlockRV

The block of the cache stage

Examples#

Before cache_read, in TensorIR, the IR is:

@T.prim_func
def before_cache_read(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and cache_read:

sch = tir.Schedule(before_cache_read)
block_b = sch.get_block("B")
sch.cache_read(block_b, 0, "local")
print(sch.mod["main"].script())

After applying cache_read, the IR becomes:

@T.prim_func
def after_cache_read(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    A_local = T.alloc_buffer((128, 128), scope="local")
    for i, j in T.grid(128, 128):
        with T.block("A_local"):
            vi, vj = T.axis.remap("SS", [i, j])
            A_local[vi, vj] = A[vi, vj]
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A_local[vi, vj] * 2.0
参数:
返回类型:

BlockRV

cache_write(block, write_buffer_index, storage_scope, consumer_blocks=None)[源代码]#

Create a block that reads a buffer region into a write cache. It requires:

  1. There is only one block who write the buffer in the scope.

  2. The scope block have stage-pipeline property.

Parameters#

blockUnion[BlockRV, str]

The producer block of the target buffer.

write_buffer_index: int

The index of the buffer in block's write region, the unique name of a write buffer in the block, or a Buffer object that is within the blocks write region.

storage_scope: str

The target storage scope.

consumer_blocks: Optional[List[Union[BlockRV, str]]]

An optional list of consumers that should read directly from the cache. If not specified, all consumers will read from the original buffer.

Returns#

cached_blockBlockRV

The block of the cache stage

Examples#

Before cache_write, in TensorIR, the IR is:

@T.prim_func
def before_cache_write(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and cache_write:

sch = tir.Schedule(before_cache_write)
block_b = sch.get_block("B")
sch.cache_write(block_b, 0, "local")
print(sch.mod["main"].script())

After applying cache_write, the IR becomes:

@T.prim_func
def after_cache_write(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    B_local = T.alloc_buffer((128, 128), scope="local")
    for i, j in T.grid(128, 128):
        with T.block("A_local"):
            vi, vj = T.axis.remap("SS", [i, j])
            B_local[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = B_local[vi, vj]
参数:
返回类型:

BlockRV

can_decompose_padding(block, loop)[源代码]#

Check whether the block match padding pattern and can be decomposed.

参数:
返回类型:

bool

compute_at(block, loop, preserve_unit_loops=False, index=-1)[源代码]#

Compute-At. Move a producer block under the specific loop, and regenerate the loops induced by the block so that the buffer region produced by the producer block could cover those regions consumed by its consumer blocks under the given loop. It requires:

  1. block and loop are under the same scope, loop is not the ancestor of block

  2. The scope block has stage-pipeline property

3) The subtree of the scope block, where the given block is in, satisfies the compact dataflow condition. i.e. all the blocks in the scope block's subtree must be either complete block or reduction block

4) The block is not an output block with regard to the scope block, i.e. the buffers written by the block are allocated under the scope block

  1. All the consumers of the block are under the given loop

Parameters#

blockUnion[BlockRV, str]

The block to be moved

loop: LoopRV

The loop where the block to be moved under

preserve_unit_loops: bool

Whether to keep the trivial loops whose extents are 1

index: int

The block index of the loop body subtree blocks: - index = -1 means inserted into the last possible insertion point; - index = -2 means inserted into the first possible insertion point; - Otherwise, index is a nonnegative number that indicates the insertion point

Examples#

Before compute-at, in TensorIR, the IR is:

@T.prim_func
def before_compute_at(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((128, 128), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do compute-at:

sch = tir.Schedule(before_compute_at)
block = sch.get_block("B")
loop, _ = sch.get_loops(sch.get_block("C"))
sch.compute_at(block, loop, preserve_unit_loops=False)
print(sch.mod["main"].script())

After applying compute-at, the IR becomes:

@T.prim_func
def after_compute_at(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((128, 128), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i in T.serial(0, 128):
        for j in T.serial(0, 128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
        for j in T.serial(0, 128):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i, j])
                C[vi, vj] = B[vi, vj] + 1.0
参数:
返回类型:

None

compute_inline(block)[源代码]#

Inline a block into its consumer(s). It requires:

  1. The block is a complete non-root block, which only produces one buffer

  2. The block must not be the only leaf in the scope.

  3. The body of the block must be a BufferStore statement in the form of, A[i, j, k, ...] = ... where the indices of the LHS are all distinct atomic variables, and no variables other than those indexing variables are allowed in the statement.

Parameters#

blockUnion[BlockRV, str]

The block to be inlined to its consumer(s)

Examples#

Before compute-inline, in TensorIR, the IR is:

@T.prim_func
def before_inline(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.alloc_buffer((128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do compute-inline:

sch = tir.Schedule(before_inline)
sch.compute_inline(sch.get_block("B"))
print(sch.mod["main"].script())

After applying compute-inline, the IR becomes:

@T.prim_func
def after_inline(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = A[vi, vj] * 2.0 + 1.0
参数:

block (BlockRV | str)

返回类型:

None

copy()[源代码]#

Returns a copy of the schedule, including both the state and the symbol table, * guaranteeing that * 1) SRef tree is completely reconstructed; * 2) The IRModule being scheduled is untouched; * 3) All the random variables are valid in the copy, pointing to the corresponding sref * reconstructed

Returns#

copySchedule

A new copy of the schedule

返回类型:

Schedule

decompose_padding(block, loop)[源代码]#

Decompose a block of padding computation pattern into two separate blocks.

  1. The block which fill const pad values into full write region;

  2. The block which fill in-bound values into region where pad predicate is true.

The pad value filling block is inserted right before the given loop.

The schedule primitive requires:

  1. The input block is a complete block.

  2. The input loop is the ancestor of the block.

  3. The input block is a block which match padding pattern.

Parameters#

blockUnion[BlockRV, str]

The padding block to be decomposed.

loopLoopRV

The loop above which the pad value filling block is inserted before.

Returns#

pad_value_blockBlockRV

The block filling const pad values.

Examples#

Before decompose-padding, in TensorIR, the IR is:

@T.prim_func
def before_decompose(x: T.Buffer(128, "int32"), y: T.Buffer(140, "int32")):
    for i in range(140):
        with T.block("block"):
            vi = T.axis.remap("S", [i])
            y[vi] = T.if_then_else(vi >= 6 and vi < 134, x[vi - 6], 0, dtype="int32")

Create the schedule and do decompose-padding with specified loop:

sch = tir.Schedule(before_decompose, debug_mask="all")
block = sch.get_block("block")
sch.decompose_padding(block, sch.get_loops(block)[0])
print(sch.mod["main].script())

After applying decompose-padding, the IR becomes:

@T.prim_func
def after_decompose(x: T.Buffer(128, "int32"), y: T.Buffer(140, "int32")):
    for i in T.serial(140):
        with T.block("block_pad_const"):
            vi = T.axis.spatial(140, i)
            y[vi] = 0
    for i in T.serial(128):
        with T.block("block"):
            vi = T.axis.spatial(128, i)
            y[vi + 6] = x[vi]
参数:
返回类型:

BlockRV

decompose_reduction(block, loop)[源代码]#

Decompose a reduction block into two separate blocks.

  1. The init block, which is translated from the init statement of the reduction block;

  2. The update block, which is the original block without init statement.

The init block is inserted right before the given loop.

The schedule primitive requires:

  1. The input block is a reduction block.

  2. The input loop is the ancestor of the block.

  3. The input loop is not lower than all the loops related to reduce block var.

Parameters#

blockUnion[BlockRV, str]

The reduction block to be decomposed

loopLoopRV

The loop above which the init block is inserted before.

Returns#

init_blockBlockRV

The init block

Examples#

Before decompose-reduction, in TensorIR, the IR is:

@T.prim_func
def before_decompose(a: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [128, 128])
    B = tir.match_buffer(b, [128, 128])
    C = tir.match_buffer(c, [128, 128])
    for i, j, k in tir.grid(128, 128, 128):
        with tir.block([128, 128, tir.reduce_axis(0, 128)], "C") as [vi, vj, vk]:
            with tir.init():
                C[vi, vj] = 0.0
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

Create the schedule and do decompose-reduction with specified loop:

sch = tir.Schedule(before_decompose)
C = sch.get_block("C")
i, j, k = sch.get_loops(C)
sch.decompose_reduction(C, i)
print(sch.mod["main"].script())

After applying decompose-reduction, the IR becomes:

@T.prim_func
def after_decompose(a: ty.handle, c: ty.handle) -> None:
    A = tir.match_buffer(a, [128, 128])
    B = tir.match_buffer(b, [128, 128])
    C = tir.match_buffer(c, [128, 128])
    for i in tir.serial(128):
        for j in tir.serial(128):
            with tir.block([128, 128]) as [vi, vj]:
                C[vi, vj] = 0.0
    for i, j, k in tir.grid(128, 128, 128):
        with tir.block([128, 128, tir.reduce_axis(0, 128)], "C") as [vi, vj, vk]:
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]
参数:
返回类型:

BlockRV

enter_postproc()[源代码]#

A no-op that marks the start of postprocessing phase of scheduling

返回类型:

None

fork_seed()[源代码]#

Returns a forked random state as seed for new schedules

Returns#

seedint

The forked random state, not the same as the current random state

返回类型:

int

fuse(*loops, preserve_unit_iters=True)[源代码]#

Fuse a list of consecutive loops into one. It requires: 1) The loops can't have annotations or thread bindings. 2) The (i+1)-th loop must be the only child of the i-th loop. 3) All loops must start with 0. 4) The domain of a loop to be fused cannot depend on another loop to be fused.

Parameters#

*loopsList[LoopRV]

The loops to be fused

Returns#

fused_loopLoopRV

The new loop after fusion

Examples#

Before applying fuse, in TensorIR, the IR is:

@T.prim_func
def before_fuse(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do fuse:

sch = tir.Schedule(before_fuse)
i, j = sch.get_loops(sch.get_block("B"))
sch.fuse(i, j)
print(sch.mod["main"].script())

After applying fuse, the IR becomes:

@T.prim_func
def after_fuse(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    # the 2 loops are fused into 1
    for i_j_fused in T.serial(0, 16384):
        with T.block("B"):
            vi = T.axis.S(128, T.floordiv(i_j_fused, 128))
            vj = T.axis.S(128, T.floormod(i_j_fused, 128))
            B[vi, vj] = A[vi, vj] * 2.0
参数:
返回类型:

LoopRV

get(rand_var_or_sref)[源代码]#

Returns: - the corresponding Block that a BlockRV evaluates to; - the corresponding For that a LoopRV evaluates to; - the corresponding integer that a ExprRV evaluates to; - the corresponding Block that a block sref points to; - the corresponding For that a loop sref points to;

Parameters#

rand_var_or_srefUnion[ExprRV, BlockRV, LoopRV, StmtSRef]

The random variable / sref to be evaluated

Returns#

resultOptional[Union[int, Block, For]]

The corresponding result

参数:

rand_var_or_sref (PrimExpr | BlockRV | LoopRV | StmtSRef)

返回类型:

int | Block | For | None

get_block(name, func_name=None)[源代码]#

Retrieve a block in a specific function with its name

By default, if func_name is not specified, the schedule will search for the block in the function that is currently being "worked on". To switch the function to be worked on, use work_on before calling this method.

Parameters#

namestr

The name of the block

func_nameOptional[str] = None

The name of the function

Returns#

blockBlockRV

The block retrieved IndexError is raised if 0 or multiple blocks exist with the specific name.

参数:
  • name (str)

  • func_name (str | None)

返回类型:

BlockRV

get_child_blocks(block_or_loop)[源代码]#

Get the leaf blocks of a specific block/loop

Parameters#

block_or_loopUnion[BlockRV, LoopRV]

The query block/loop

Returns#

blocksList[LoopRV]

A list of leaf blocks inside a specific block/loop

参数:

block_or_loop (BlockRV | LoopRV)

返回类型:

List[BlockRV]

get_consumers(block)[源代码]#

Get the consumers of a specific block

Parameters#

blockUnion[BlockRV, str]

The block in the query

Returns#

consumersList[BlockRV]

A list of consumers of the given block

参数:

block (BlockRV | str)

返回类型:

List[BlockRV]

get_loops(block)[源代码]#

Get the parent loops of the block in its scope, from outer to inner

Parameters#

blockUnion[BlockRV, str]

The query block

Returns#

loopsList[LoopRV]

A list of loops above the given block in its scope, from outer to inner

参数:

block (BlockRV | str)

返回类型:

List[LoopRV]

get_output_blocks(scope_block)[源代码]#

Get the list of output blocks within the given scope An output block is a block which has atleast one buffer being written to, but is not allocated within the PrimFunc

Parameters#

scope_blockUnion[BlockRV, str],

The scope block from which output blocks are collected

Returns#

output_blocksList[BlockRV]

A list of all blocks that write to some output buffer

参数:

scope_block (BlockRV | str)

返回类型:

List[BlockRV]

get_producers(block)[源代码]#

Get the producers of a specific block

Parameters#

blockUnion[BlockRV, str]

The block in the query

Returns#

producersList[BlockRV]

A list of producers of the given block

参数:

block (BlockRV | str)

返回类型:

List[BlockRV]

get_sref(rand_var_or_stmt)[源代码]#

Returns the corresponding sref to the given 1) LoopRV 2) BlockRV 3) Block 4) For

Parameters#

rand_var_or_stmtUnion[BlockRV, LoopRV, Block, For]

The random variable / sref to be evaluated

Returns#

resultOptional[StmtSRef]

The corresponding result

参数:

rand_var_or_stmt (BlockRV | LoopRV | Block | For)

返回类型:

StmtSRef | None

loop_partition(loop, factors, preserve_unit_iters=True)[源代码]#

Partition a loop into a list of consecutive loops. It requires: 1) The loop can't have annotation or thread binding. Predicates may be added to ensure the total loop numbers keeps unchanged. In factors, at most one of the factors can be None, which will be automatically inferred.

Parameters#

loopLoopRV

The loop to be partition

factors: List[Union[int, ExprRV, None]]

The partitioning factors Potential inputs are: - None - ExprRV - Positive constant integers

preserve_unit_itersbool

Whether or not to preserve unit iterators in block bindings

Returns#

partition_loopsList[LoopRV]

The new loops after partition

Examples#

Before partition, in TensorIR, the IR is:

@T.prim_func
def before_partition(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do partition:

sch = tir.Schedule(before_partition)
i, j = sch.get_loops(sch.get_block("B"))
sch.partition(i, factors=[2, 64])
print(sch.mod["main"].script())

After applying partition, the IR becomes:

def after_partition(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    # the original loop is partition into 3 loops
    with T.block("root"):
        T.reads()
        T.writes()
        with T.block("B_i_common"):
            T.reads()
            T.writes()
            with T.block("B_i0_partition"):
                T.reads()
                T.writes()
                for i0, j in T.grid(2, 128):
                    with T.block("B_i0"):
                        vi, vj = T.axis.remap("SS", [i0, j])
                        T.reads(A[0:2, 0:128])
                        T.writes(B[0:2, 0:128])
                        B[vi, vj] = A[vi, vj] * T.float32(2)
            with T.block("B_i1_partition"):
                T.reads()
                T.writes()
                for i1 in range(2, 66):
                    for j in range(128):
                        with T.block("B_i1"):
                            vi, vj = T.axis.remap("SS", [i1, j])
                            T.reads(A[2:66, 0:128])
                            T.writes(B[2:66, 0:128])
                            B[vi, vj] = A[vi, vj] * T.float32(2)
            with T.block("B_partition_2"):
                T.reads()
                T.writes()
                for i2 in range(66, 128):
                    for j in range(128):
                        with T.block("B_i2"):
                            vi, vj = T.axis.remap("SS", [i2, j])
                            T.reads(A[66:128, 0:128])
                            T.writes(B[66:128, 0:128])
                            B[vi, vj] = A[vi, vj] * T.float32(2)
参数:
返回类型:

List[LoopRV]

merge(*loops)[源代码]#

Merge a list of loops into one. The loops under their LCA requires: 1) Under the same scope. 2) Can't have annotations or thread bindings. 3) Start with 0 and have same extent and same nesting depth. 4) From target loop to their LCA, The inner loop must be the only child of the outer loop.

Parameters#

*loopsList[LoopRV]

The loops to be merged

Returns#

fused_loopLoopRV

The new loop after merge

Examples#

Before applying merge, in TensorIR, the IR is:

@T.prim_func
def before_merge(a: T.handle, b: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do fuse:

sch = tir.Schedule(before_fuse)
i1, _ = sch.get_loops(sch.get_block("B"))
i2, _ = sch.get_loops(sch.get_block("C"))
sch.merge(i1, i2)
print(sch.mod["main"].script())

After applying fuse, the IR becomes:

@T.prim_func
def after_fuse(a: T.handle, b: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    C = T.match_buffer(c, (128, 128))
    # the 2 loops are merged into 1
    for i_m in range(128):
        for j in range(128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i_m, j])
                T.reads(A[vi, vj])
                T.writes(B[vi, vj])
                B[vi, vj] = A[vi, vj] * T.float32(2)
        for j in range(128):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i_m, j])
                T.reads(A[vi, vj])
                T.writes(C[vi, vj])
                C[vi, vj] = A[vi, vj] * T.float32(2)
参数:

loops (List[LoopRV])

返回类型:

LoopRV

pad_einsum(block, padding)[源代码]#

Pad the computation of Einsum.

On a block with trivial binding, this primitive pads the iteration domain of the block by the given padding factors, for example, 127 -> 128, 132 -> 144 when padding factor is 16. Extra producer and consumer padding blocks will be generated to avoid out-of-bound buffer access.

Einsum pattern means all the indices on the buffer access are either by constants (e.g. B[0]) or by variables (e.g. B[i]), but not by composite expressions (e.g. B[i + 1]).

Parameters#

blockUnion[BlockRV, str]

The block that matches the Einsum pattern.

paddingList[int]

The padding for each block iter.

Examples#

Before applying pad-einsum, in TensorIR, the IR is:

@T.prim_func
def before_pad_einsum(
    A: T.Buffer((127, 127), "float32"),
    B: T.Buffer((127, 127), "float32"),
    C: T.Buffer((127, 127), "float32"),
) -> None:
    for i0, i1, i2 in T.grid(127, 127, 127):
        with T.block("C_shared"):
            i, j, k = T.axis.remap("SSR", [i0, i1, i2])
            with T.init():
                C[i, j] = T.float32(0)
            C[i, j] = C[i, j] + A[i, k] * B[k, j]

Create the schedule and do pad-einsum with specified block:

sch = tir.Schedule(before_pad_einsum, debug_mask="all")
block = sch.get_block("C_shared")
sch.pad_einsum(block, [32, 32, 32])
print(sch.mod["main"].script())

After applying decompose-padding, the IR becomes:

@T.prim_func
def main(
    A: T.Buffer((127, 127), "float32"),
    B: T.Buffer((127, 127), "float32"),
    C: T.Buffer((127, 127), "float32"),
):
    # with T.block("root"):
    A_pad = T.alloc_buffer((128, 128))
    B_pad = T.alloc_buffer((128, 128))
    C_pad = T.alloc_buffer((128, 128))
    for i0, i1 in T.grid(128, 128):
        with T.block("A_pad"):
            v0, v1 = T.axis.remap("SS", [i0, i1])
            A_pad[v0, v1] = T.if_then_else(
                v0 < 127 and v1 < 127,
                A[v0, v1],
                T.float32(0),
            )
    for i0, i1 in T.grid(128, 128):
        with T.block("B_pad"):
            v0, v1 = T.axis.remap("SS", [i0, i1])
            B_pad[v0, v1] = T.if_then_else(
                v0 < 127 and v1 < 127,
                B[v0, v1],
                T.float32(0),
            )
    for i0, i1, i2 in T.grid(128, 128, 128):
        with T.block("C_shared"):
            i, j, k = T.axis.remap("SSR", [i0, i1, i2])
            with T.init():
                C_pad[i, j] = T.float32(0)
            C_pad[i, j] = C_pad[i, j] + A_pad[i, k] * B_pad[k, j]
    for i0, i1 in T.grid(127, 127):
        with T.block("C_pad"):
            v0, v1 = T.axis.remap("SS", [i0, i1])
            C[v0, v1] = C_pad[v0, v1]
参数:
返回类型:

None

parallel(loop)[源代码]#

Parallelize the input loop. It requires: 1) The scope block that the loop is in should have stage-pipeline property 2) All the blocks under the loop are complete blocks or reduction blocks, and have affine bindings 3) For each block under the loop, the loop can only be contained in data-parallel block iters' bindings

Parameters#

loopLoopRV

The loop to be parallelized

Examples#

Before parallel, in TensorIR, the IR is:

@T.prim_func
def before_parallel(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do parallel:

sch = tir.Schedule(before_parallel)
i, j = sch.get_loops(sch.get_block("B"))
sch.parallel(i)

After applying parallel, the IR becomes:

@T.prim_func
def after_parallel(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i in T.parallel(0, 128):
        for j in T.serial(0, 128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
参数:

loop (LoopRV)

返回类型:

None

reindex(block, buffer)[源代码]#

Create a block that read/write a buffer region into a read/write cache with reindexing. The layout of the cache will be the same as by the iterators of the block that reads/writes the buffer. It requires: 1) There is only one block who reads/writes the target buffer 2) There is only one buffer load/store of this buffer in the block

Parameters#

block : Union[BlockRV, str]

The block that accesses the target buffer. If a string, this must uniquely identify a block.

buffer: Union[Tuple[str,int], Buffer, str]

The buffer to be transformed, or a specification of how to identify the buffer to be transformed.

If buffer if a tuple of (str,int), the first item should be either "read" or "write", and the second item is an index into the block's read or write regions.

If buffer is a string, it is the name of the buffer, which must exist within the reads/writes of the block. In addition, the reads/writes of the block may not contain more than one buffer with this name.

If buffer is a Buffer object, it must exist within the reads/writes of the block.

Returns#

reindex_blockBlockRV

The block of the reindex stage

Examples#

Before reindex, in TensorIR, the IR is:

@T.prim_func
def before_reindex(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32")
) -> None:
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vj, vi] * 2.0

Create the schedule and do reindex:

sch = tir.Schedule(before_reindex)
block = sch.get_block("B")
sch.reindex(block, ("read", 0))

After applying reindex, the IR becomes:

@T.prim_func
def after_reindex(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32")
) -> None:
    A_reindex = T.alloc_buffer((128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("A_reindex"):
            vi, vj = T.axis.remap("SS", [i, j])
            A_reindex[vi, vj] = A[vj, vi]
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A_reindex[vi, vj] * 2.0
参数:
返回类型:

BlockRV

reindex_cache_read(block, read_buffer_index, storage_scope, index_map)[源代码]#

Create a block that reads a buffer region into a read cache using customized indices specified by index map. The read region of the buffer must be a single point.

The cache stage block follows the original order of loops and block itervars in the block. If a block itervar does not appear in the buffer access region, it and its corresponding loop variables will be omitted. User can then use transform_block_layout primitive to reorder the block itervars and surrounding loops of the cache read/write block.

Unlike cache_read, reindex_cache_read only supports single consumer, please use cache_read when there are multiple consumers.

Parameters#

blockBlockRV

The consumer block of the target buffer.

read_buffer_index: int

The index of the buffer in block's read region.

storage_scope: str

The target storage scope.

index_map: Union[IndexMap, Callable]

User defined indices to access allocated cache buffer, maps from block iter vars.

Returns#

cached_blockBlockRV

The block of the cache stage

Examples#

Before reindex_cache_read, in TensorIR, the IR is:

@T.prim_func
def before_reindex_cache_read(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and reindex_cache_read:

sch = tir.Schedule(before_cache_read)
block_b = sch.get_block("B")
sch.reindex_cache_read(block_b, 0, "local", lambda vi, vj: (vj, vi))
print(sch.mod["main"].script())

After applying reindex_cache_read, the IR becomes:

@T.prim_func
def after_reindex_cache_read(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    A_local = T.alloc_buffer((128, 128), scope="local")
    for i, j in T.grid(128, 128):
        with T.block("A_local"):
            vi, vj = T.axis.remap("SS", [i, j])
            A_local[vj, vi] = A[vi, vj]
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A_local[vj, vi] * 2.0

See Also#

reindex_cache_write transform_block_layout transform_layout cache_read reindex

参数:
返回类型:

BlockRV

reindex_cache_write(block, write_buffer_index, storage_scope, index_map)[源代码]#

Create a block that reads a buffer region into a write cache using customized indices specified by index map. The write region of the buffer must be a single point.

The cache stage block follows the original order of loops and block itervars in the block. If a block itervar does not appear in the buffer access region, it and its corresponding loop variables will be omitted. User can then use transform_block_layout primitive to reorder the block itervars and surrounding loops of the cache read/write block.

Unlike cache_write, reindex_cache_write only supports single consumer, please use cache_write when there are multiple consumers.

Parameters#

blockUnion[BlockRV, str]

The consumer block of the target buffer.

write_buffer_index: int

The index of the buffer in block's write region.

storage_scope: str

The target storage scope.

index_map: Union[Callable, IndexMap]

User defined indices to access allocated cache buffer, maps from block iter vars.

consumer_blocks: Optional[List[Union[BlockRV, str]]]

An optional list of consumers that should read directly from the cache. If not specified, all consumers will read from the original buffer.

Returns#

cached_blockBlockRV

The block of the cache stage

Examples#

Before reindex_cache_write, in TensorIR, the IR is:

@T.prim_func
def before_reindex_cache_write(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and reindex_cache_write:

sch = tir.Schedule(before_cache_write)
block_b = sch.get_block("B")
sch.reindex_cache_write(block_b, 0, "local", lambda vi, vj: (vi // 2, vi % 2, vj))
print(sch.mod["main"].script())

After applying reindex_cache_write, the IR becomes:

@T.prim_func
def after_cache_write(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (64, 2, 128))
    B_local = T.alloc_buffer((128, 128), scope="local")
    for i, j in T.grid(128, 128):
        with T.block("A_local"):
            vi, vj = T.axis.remap("SS", [i, j])
            B_local[vi % 2, vi // 2, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = B_local[vi % 2, vi // 2, vj]

See Also#

reindex_cache_read transform_block_layout transform_layout cache_write reindex

参数:
返回类型:

BlockRV

remove_rv(rand_var)[源代码]#

Remove a random variable from the symbol table

Parameters#

rand_varUnion[BlockRV, LoopRV, ExprRV]

The random variable to be removed

参数:

rand_var (PrimExpr | BlockRV | LoopRV)

返回类型:

None

reorder(*ordered_loops)[源代码]#

Reorder a list of loops. It doesn't require the loops to be consecutive. It requires: 1) The loops are in the same chain. That means: the loops can be ordered to [l_1, l_2, ... , l_n] where l_i is an ancestor of l_{i+1} and there are only single-branch loops between l_1 and l_n (which also indicates they are under the same scope). 2) After reordering, the domain of an outer loop cannot depend on any of the inner loops. 3) For every block under the loop nests, its block binding must be affine, and the block variables must be either data parallel or reduction. 4) No duplicated loops are allowed in the arguments.

Parameters#

*ordered_loopsList[LoopRV]

The loops in the new order

Examples#

Before reorder, in TensorIR, the IR is:

@T.prim_func
def before_reorder(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do reorder:

sch = tir.Schedule(before_reorder)
i, j = sch.get_loops(sch.get_block("B"))
sch.reorder(j, i)
print(sch.mod["main"].script())

After applying reorder, the IR becomes:

@T.prim_func
def after_reorder(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    # Here j and i are reordered
    for j, i in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
参数:

ordered_loops (List[LoopRV])

返回类型:

None

reorder_block_iter_var(block, new_order)[源代码]#

Reorder the itervars inside a given block.

Parameters#

blockBlockRV

The block to be transformed.

new_orderList[int]

The new block itervar order.

Examples#

Before reorder_block_iter_var, in TensorIR, the IR is:

@T.prim_func
def matmul(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32"),
) -> None:
    for i, j, k in T.grid(128, 128, 128):
        with T.block("C"):
            vi, vj, vk = T.axis.remap("SSR", [i, j, k])
            with T.init():
                C[vi, vj] = 0.0
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

Create the schedule and do reorder_block_iter_var:

sch = tir.Schedule(matmul)
C = sch.get_block("C")
sch.reorder_block_iter_var(C, [2, 1, 0])

After applying reorder_block_iter_var, the IR becomes:

@T.prim_func
def matmul_after_reorder_block_iter_var(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32"),
):
    for i, j, k in T.grid(128, 128, 128):
        with T.block("C"):
            vk, vj, vi = T.axis.remap("RSS", [k, j, i])
            T.reads(A[vi, vk], B[vj, vk])
            T.writes(C[vi, vj])
            with T.init():
                C[vi, vj] = T.float32(0)
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

See Also#

reorder

参数:
返回类型:

None

reverse_compute_at(block, loop, preserve_unit_loops=False, index=-1)[源代码]#

Reverse-Compute-At. Move a consumer block under the specific loop, and regenerate the loops induced by the block so that the buffer region consumed by the consumer block could cover those regions produced by its producer blocks under the given loop. It requires:

  1. block and loop are under the same scope, loop is not the ancestor of block

  2. The scope block has stage-pipeline property

3) The subtree of the scope block, where the given block is in, satisfies the compact dataflow condition. i.e. all the blocks in the scope block's subtree must be either complete block or reduction block

  1. All the producers of the block are under the given loop

Parameters#

blockUnion[BlockRV, str]

The block to be moved

loop: LoopRV

The loop where the block to be moved under

preserve_unit_loops: bool

Whether to keep the trivial loops whose extents are 1

index: int

The block index of the loop body subtree blocks: - index = -1 means inserted into the last possible insertion point; - index = -2 means inserted into the first possible insertion point; - Otherwise, index is a nonnegative number that indicates the insertion point

Examples#

Before reverse-compute-at, in TensorIR, the IR is:

@T.prim_func
def before_reverse_compute_at(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((128, 128), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do reverse-compute-at:

sch = tir.Schedule(before_reverse_compute_at)
block = sch.get_block("C")
loop, _ = sch.get_loops(sch.get_block("B"))
sch.reverse_compute_at(block, loop, preserve_unit_loops=False)
print(sch.mod["main"].script())

After applying reverse-compute-at, the IR becomes:

@T.prim_func
def after_reverse_compute_at(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((128, 128), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i in T.serial(0, 128):
        for j in T.serial(0, 128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
        for j in T.serial(0, 128):
            with T.block("C"):
                vi, vj = T.axis.remap("SS", [i, j])
                C[vi, vj] = B[vi, vj] + 1.0
参数:
返回类型:

None

reverse_compute_inline(block)[源代码]#

Inline a block into its only producer. It requires:

  1. The block is a complete non-root block, which only produces and consumes one buffer

  2. The block must not be the only leaf in the scope.

  3. The only producer of the block is a read-after-write producer and a complete non-root block

  4. The body of the block must be a BufferStore statement in the form of, B[f(i, j, k, ...)] = g(i, j, k, A[i, j, k, ...] ...) where the indices of each BufferLoad on the RHS are all distinct atomic variables, and no variables other than those indexing variables are allowed in the statement.

Parameters#

blockUnion[BlockRV, str]

The block to be inlined to its producer

Examples#

Before reverse-compute-inline, in TensorIR, the IR is:

@T.prim_func
def before_inline(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.alloc_buffer((128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do reverse-compute-inline:

sch = tir.Schedule(before_inline)
sch.reverse_compute_inline(sch.get_block("C"))
print(sch.mod["main"].script())

After applying reverse-compute-inline, the IR becomes:

@T.prim_func
def after_inline(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = A[vi, vj] * 2.0 + 1.0
参数:

block (BlockRV | str)

返回类型:

None

rfactor(loop, factor_axis)[源代码]#

Factorize an associative reduction block by the specified loop.

An associative reduction cannot be parallelized directly, because it leads to potential race condition during accumulation. Alternatively, the reduction could be factorized on a loop with the following steps: - Step 1: evenly slice the reduction into n separate chunks, where n is the loop extent - Step 2: compute the chunks separately and write the result into n intermediate buffers; - Step 3: accumulate the n separate buffer into the result buffer. Note that the Step 2 above introduces opportunities for parallelization.

RFactor is a schedule primitive that implements the transformation described above: Given a block that writes to buffer B, it factorizes a loop of extent n.

For example, the pseudocode below accumulates B[i] = sum(A[i, : , : ]):

for i in range(128):                    # loop i is a data parallel loop
    for j in range(128):                # loop j is a reduction loop
        for k in range(128):            # loop k is a reduction loop
            B[i] = B[i] + A[i, j, k]

Suppose RFactor is applied on the innermost loop k and factor_axis = 1. RFactor then creates an intermediate buffer and two blocks.

1. The intermediate buffer, or "rf-buffer" is a buffer of rank ndim(B) + 1 and size size(B) * n, whose shape expands from shape(B) by adding an axis of n at the position specified by factor_axis. For example,

  • shape(B) = [1, 2, 3], factor_axis = 0 => shape(B_rf) = [n, 1, 2, 3]

  • shape(B) = [1, 2, 3], factor_axis = 1 => shape(B_rf) = [1, n, 2, 3]

  • shape(B) = [1, 2, 3], factor_axis = 2 => shape(B_rf) = [1, 2, n, 3]

  • shape(B) = [1, 2, 3], factor_axis = 3 => shape(B_rf) = [1, 2, 3, n]

2. The rfactor block, or "rf-block", is a block that writes to the rf-buffer without accumulating over the loop k, i.e. the loop k is converted from a reduction loop to a data parallel loop. In our example, the rf-block is:

B_rf = np.zeros((128, 128))     # the rf-buffer
for k in range(128):            # loop k is converted to a data parallel loop
    for i in range(128):        # loop i is a data parallel loop (unchanged)
        for j in range(128):    # loop j is a reduction loop (unchanged)
            B_rf[i, k] = B_rf[i, k] + A[i, j, k]

3. The write-back block, or wb-block, is a block that accumulates the rf-buffer into the result buffer. All the reduction loops are removed except the loop k for accumulation. In our example, the wb-block is:

for i in range(128):            # loop i is a data parallel loop (unchanged)
                                # loop j is removed because it is a reduction loop
    for k in range(128):        # loop k is a reduction loop (unchanged)
        B[i] = B[i] + B_rf[i, k]

Parameters#

loopLoopRV

The loop outside block for which we want to do rfactor

factor_axisint

The position where the new dimension is placed in the new introduced rfactor buffer

Returns#

rf_blockBlockRV

The block which computes partial results over each slices (i.e., the first block as described in the above illustration)

Examples#

Before rfactor, in TensorIR, the IR is:

@T.prim_func
def before_rfactor(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128, 128))
    B = T.match_buffer(b, (128,))
    for ii, i, j in T.grid(128, 128, 128):
    with T.block("B"):
        vii, vi, vj = T.axis.remap("SRR", [ii, i, j])
        with T.init():
            B[vii] = 0.0
        B[vii] = B[vii] + A[vii, vi, vj]

Create the schedule and do rfactor:

sch = tir.Schedule(before_rfactor)
_, _, k = sch.get_loops(sch.get_block("B"))
sch.rfactor(k, 0)
print(sch.mod["main"].script())

After applying rfactor, the IR becomes:

@T.prim_func
def after_rfactor(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, [128, 128, 128])
    B = T.match_buffer(b, [128])
    B_rf = T.alloc_buffer([128, 128])
    for i2, ii, i in T.grid(128, 128, 128):
        with T.block("B_rf"):
            vi2, vii, vi = T.axis.remap("SSR", [i2, ii, i])
            with T.init():
                B_rf[vi2, vii] = 0.0
            B_rf[vi2, vii] = (B_rf[vi2, vii] + A[vii, vi, vi2])
    for ii, i2 in T.grid(128, 128):
        with T.block("B"):
            vii, vi2 = T.axis.remap("SR", [ii, i2])
            with T.init():
                B[vii] = 0.0
            B[vii] = B[vii] + B_rf[vi2, vii]

Note#

Rfactor requires: 1) loop has only one child block, and it is a reduction block; 2) loop is a reduction loop, i.e. the loop variable is bound to only reduction variables in the block binding; 3) loop is not parallelized, vectorized, unrolled or bound to any thread axis; 4) The block scope that loop is in is a staged-pipeline; 5) The outermost loop outside the reduction block should has the reduction block as its first child block; 6) The outermost reduction loop should have only one child block; 7) An unary extent loop that is not bound to any reduction or data parallel variables in the block binding should not appear under some reduction loop; 8) The reduction block should write to only one buffer, and its init and body are both simple BufferStore`s, and the pattern is registered as an associative reducer. The pre-defined patterns include: plus, multiplication, min and max; 9) Each of the loops on top of the block cannot be bound to a data parallel and a reduction block binding at the same time; 10) `factor_axis should be in range [-ndim(B) - 1, ndim(B)], where B is the buffer that the reduction block writes to. Negative indexing is normalized according to numpy convention.

参数:
返回类型:

BlockRV

rolling_buffer(block, write_buffer_index)[源代码]#

Compute the target buffer via rolling buffering, select the outermost rollable axis with a positive bound overlap that appears in the block's ancestor loops as rolling axis, fold and circularize the buffer along the rolling dimension, append block predicate to avoid recomputing overlapping elements. It requires:

  1. The block is not an output block and has only RAW dependencies.

  2. The buffer to be an intermediate buffer defined via alloc_buffer.

3) The LCA of the producer and consumer of the buffer is a for loop, typically, the producer and consumer of the buffer are cascaded through compute_at.

4) The access region of the buffer has at least one dimension that contains a positive bound overlap.

Parameters#

blockUnion[BlockRV, str]

The producer block of the buffer.

write_buffer_indexint

The index of the buffer in block's write region.

Examples#

Before rolling_buffer, in TensorIR, the IR is:

@T.prim_func
def before_rolling_buffer(
    A: T.Buffer((12, 12), "int8"), C: T.Buffer((8, 8), "int8")
) -> None:
    # body
    # with T.block("root")
    B = T.alloc_buffer([10, 10], dtype="int8")
    for i0, i1 in T.grid(2, 2):
        for ax0, ax1, ax2, ax3 in T.grid(6, 6, 3, 3):
            with T.block("B"):
                ax0_1 = T.axis.spatial(10, i0 * 4 + ax0)
                ax1_1 = T.axis.spatial(10, i1 * 4 + ax1)
                rv0, rv1 = T.axis.remap("RR", [ax2, ax3])
                B[ax0_1, ax1_1] = T.max(
                    B[ax0_1, ax1_1], A[ax0_1 + rv0, ax1_1 + rv1]
                )
        for ax0, ax1, ax2, ax3 in T.grid(4, 4, 3, 3):
            with T.block("C"):
                ax0_1 = T.axis.spatial(8, i0 * 4 + ax0)
                ax1_1 = T.axis.spatial(8, i1 * 4 + ax1)
                rv0, rv1 = T.axis.remap("RR", [ax2, ax3])
                C[ax0_1, ax1_1] = T.max(
                    C[ax0_1, ax1_1], B[ax0_1 + rv0, ax1_1 + rv1]
                )

Create the schedule and do rolling_buffer:

sch = tir.Schedule(before_rolling_buffer)
sch.rolling_buffer(sch.get_block("B"), write_buffer_index=0)
print(sch.mod["main"].script())

After applying rolling_buffer, the IR becomes:

@T.prim_func
def after_rolling_buffer(
    A: T.Buffer((12, 12), "int8"),
    C: T.Buffer((8, 8), "int8")
) -> None:
    # body
    # with T.block("root")
    B = T.alloc_buffer([6, 10], dtype="int8")
    for i0, i1 in T.grid(2, 2):
        for ax0, ax1, ax2, ax3 in T.grid(6, 6, 3, 3):
            with T.block("B"):
                T.where((i0 < 1 or 2 <= ax0) and (i1 < 1 or 2 <= ax1))
                ax0_1 = T.axis.spatial(10, i0 * 4 + ax0)
                ax1_1 = T.axis.spatial(10, i1 * 4 + ax1)
                rv0, rv1 = T.axis.remap("RR", [ax2, ax3])
                B[ax0_1 % 6, ax1_1] = T.max(
                    B[ax0_1 % 6, ax1_1], A[ax0_1 + rv0, ax1_1 + rv1]
                )
        for ax0, ax1, ax2, ax3 in T.grid(4, 4, 3, 3):
            with T.block("C"):
                ax0_1 = T.axis.spatial(8, i0 * 4 + ax0)
                ax1_1 = T.axis.spatial(8, i1 * 4 + ax1)
                rv0, rv1 = T.axis.remap("RR", [ax2, ax3])
                C[ax0_1, ax1_1] = T.max(
                    C[ax0_1, ax1_1], B[ax0_1 % 6 + rv0, ax1_1 + rv1]
                )

Note#

The region_cover property of the consumer block of the target buffer will become false.

参数:
返回类型:

None

sample_categorical(candidates, probs, decision=None)[源代码]#

Sample an integer given the probability distribution

Parameters#

candidatesList[int]

The candidates to be sampled from

probsList[float]

The probability of each candidate

decisionOptional[int]

The sampling decision, if any

Returns#

resultExprRV

The random variable sampled from candidates

参数:
返回类型:

PrimExpr

sample_compute_location(block, decision=None)[源代码]#

Sample a compute-at location of the given block

Parameters#

blockUnion[BlockRV, str]

The block whose compute-at location is to be sampled

decisionOptional[int]

The sampling decision

Returns#

resultLoopRV

The sampled loop where the input block is to be computed at

参数:
返回类型:

LoopRV

sample_partitioned_tile(loop, n, partition_pos=0, innerpart_factor=1, decision=None)[源代码]#

Sample the factors to a partitioned tile for a specific loop

Parameters#

loopLoopRV

The loop to be tiled

nint

The number of tiles to be sampled

partition_posint

The position to partition tiles to two parts

innerpart_factorint

The factor of the second part

decision: Optional[List[int]]

The sampling decision, if any

Returns#

resultList[ExprRV]

A list of length n, the random partitioned tile sizes sampled

参数:
返回类型:

List[PrimExpr]

sample_perfect_tile(loop, n, max_innermost_factor=16, decision=None)[源代码]#

Sample the factors to perfect tile a specific loop

Parameters#

loopLoopRV

The loop to be tiled

nint

The number of tiles to be sampled

max_innermost_factorint

The maximum tile size allowed to be sampled in the innermost loop

decision: Optional[List[int]]

The sampling decision, if any

Returns#

resultList[ExprRV]

A list of length n, the random perfect tile sizes sampled

参数:
返回类型:

List[PrimExpr]

seed(seed)[源代码]#

Seed the randomness

Parameters#

seedint

The new random seed, -1 if use device random, otherwise non-negative

参数:

seed (int)

返回类型:

None

set_axis_separator(block, buffer, axis_separators)[源代码]#

Set the axis separator of a buffer, where the buffer is specified by a block and a read or write index.

Parameters#

block : Union[BlockRV, str]

The block that accesses the target buffer. If a string, this must uniquely identify a block.

buffer: Union[Tuple[str,int], Buffer, str]

The buffer to be transformed, or a specification of how to identify the buffer to be transformed.

If buffer if a tuple of (str,int), the first item should be either "read" or "write", and the second item is an index into the block's read or write regions.

If buffer is a string, it is the name of the buffer, which must exist within the reads/writes of the block. In addition, the reads/writes of the block may not contain more than one buffer with this name.

If buffer is a Buffer object, it must exist within the reads/writes of the block.

axis_separators : Optional[List[int]]

The axis separators.

Examples#

Before set_axis_separator, in TensorIR, the IR is:

@T.prim_func
def before_set_axis_separator(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float32")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do set_axis_separator:

sch = tir.Schedule(before_set_axis_separator)
sch.set_axis_separators(sch.get_block("B"), buffer=("write", 0),
                        axis_separators=[1])
print(sch.mod["main"].script())

After applying set_axis_separator, the IR becomes:

@T.prim_func
def after_set_axis_separators(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer([128, 128], dtype="float32", axis_separators=[1])

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * T.float32(2)
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + T.float32(1)
参数:
返回类型:

None

set_scope(block, buffer_index, storage_scope)[源代码]#

Set the storage scope of a buffer, where the buffer is specified by the a block and a write-index.

Parameters#

blockUnion[BlockRV, str]

The producer block of the buffer

buffer_indexint

The index of the buffer in block's write region

storage_scopestr

The storage scope to be set

Examples#

Before set_scope, in TensorIR, the IR is:

@T.prim_func
def before_set_scope(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float32")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do set_scope:

sch = tir.Schedule(before_set_scope)
sch.set_scope(sch.get_block("B"), buffer_index=0, storage_scope="shared")
print(sch.mod["main"].script())

After applying set_scope, the IR becomes:

@T.prim_func
def after_set_scope(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B_shared = T.alloc_buffer([128, 128], dtype="float32", scope="shared")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B_shared[vi, vj] = A[vi, vj] * T.float32(2)
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B_shared[vi, vj] + T.float32(1)

Note#

set_scope requires the buffer to be an intermediate buffer defined via alloc_buffer.

参数:
返回类型:

None

show(*args, **kwargs)[源代码]#

A sugar for print highlighted TVM script.

All parameters are forwarded to the underlying Module.show and Trace.show methods.

返回类型:

None

split(loop, factors, preserve_unit_iters=True, disable_predication=False)[源代码]#

Split a loop into a list of consecutive loops. It requires: 1) The loop can't have annotation or thread binding. 2) The loop must start with 0. Predicates may be added to ensure the total loop numbers keeps unchanged. In factors, at most one of the factors can be None, which will be automatically inferred.

Parameters#

loopLoopRV

The loop to be split

factors: List[Union[int, ExprRV, None]]

The splitting factors Potential inputs are: - None - ExprRV - Positive constant integers

preserve_unit_itersbool

Whether or not to preserve unit iterators in block bindings

disable_predicationbool

If enabled, don't create a predicate for guarding the loop. This can be useful when splitting with scalable factors that the schedule writer knows are divisible by the loop bound.

Warning: enabling this feature may result in incorrect code generation if not used carefully.

Returns#

split_loopsList[LoopRV]

The new loops after split

Examples#

Before split, in TensorIR, the IR is:

@T.prim_func
def before_split(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do split:

sch = tir.Schedule(before_split)
i, j = sch.get_loops(sch.get_block("B"))
sch.split(i, factors=[2, 64])
print(sch.mod["main"].script())

After applying split, the IR becomes:

@T.prim_func
def after_split(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    # the original loop is split into 2 loops
    for i0, i1, j in T.grid(2, 64, 128):
        with T.block("B"):
            vi = T.axis.S(128, i0 * 64 + i1)
            vj = T.axis.S(128, j)
            B[vi, vj] = A[vi, vj] * 2.0
参数:
  • loop (LoopRV)

  • factors (List[int | PrimExpr | None])

  • preserve_unit_iters (bool)

  • disable_predication (bool)

返回类型:

List[LoopRV]

storage_align(block, buffer_index, axis, factor, offset)[源代码]#

Set alignment requirement for specific dimension such that stride[axis] == k * factor + offset for some k. This is useful to set memory layout for more friendly memory access pattern. For example, we can set alignment to be factor=2, offset=1 to avoid bank conflict for thread access on higher dimension in GPU shared memory.

Parameters#

blockUnion[BlockRV, str]

The producer block of the buffer.

buffer_indexint

The index of the buffer in block's write region.

axisint

The dimension to be specified for alignment.

factorint

The factor multiple of alignment.

offsetint

The required offset factor.

Examples#

Before storage_align, in TensorIR, the IR is:

@T.prim_func
def before_storage_align(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.alloc_buffer((128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do storage_align:

sch = tir.Schedule(before_storage_align)
sch.storage_align(sch.get_block("B"), buffer_index=0, axis=0, factor=128, offset=1)
print(sch.mod["main"].script())

After applying storage_align, the IR becomes:

@T.prim_func
def after_storage_align(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.alloc_buffer((128, 128))
    C = T.match_buffer(c, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            T.block_attr({"buffer_dim_align": [[[0, 128, 1]]]})
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

After lowering passes, buffer B will have strides as [129, 1].

Note#

Storage_align requires the buffer to be an intermediate buffer defined via alloc_buffer.

参数:
返回类型:

None

tensorize(block_or_loop, tensor_intrin, preserve_unit_iters=True)[源代码]#

Tensorize the computation enclosed by loop with the tensor intrinsic.

Parameters#

block_or_loopUnion[BlockRV, LoopRV]

The loop to be tensorized.

tensor_intrinstr

The tensor intrin or the name of the tensor intrin.

preserve_unit_itersbool

Whether or not to preserve unit iterators in block bindings

Examples#

Before tensorize, in TensorIR, the IR is:

@T.prim_func
def before_tensorize(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32"),
) -> None:
    # body
    # with T.block("root")
    for i_0, j_0, k_0, i_1, j_1, k_1 in T.grid(8, 8, 8, 16, 16, 16):
        with T.block("update"):
            vi = T.axis.spatial(128, i_0 * 16 + i_1)
            vj = T.axis.spatial(128, j_0 * 16 + j_1)
            vk = T.axis.reduce(128, k_0 * 16 + k_1)
            T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
            T.writes(C[vi, vj])
            C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]

Declare and register the tensor intrinsic:

@T.prim_func
def mma_desc(a: T.handle, b: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (16, 16), align=128, offset_factor=1)
    B = T.match_buffer(b, (16, 16), align=128, offset_factor=1)
    C = T.match_buffer(c, (16, 16), align=128, offset_factor=1)

    with T.block("root"):
        T.reads(C[0 : 16, 0 : 16], A[0 : 16, 0 : 16], B[0 : 16, 0 : 16])
        T.writes(C[0 : 16, 0 : 16])
        for i, j, k in T.grid(16, 16, 16):
            with T.block("update"):
                vi, vj, vk = T.axis.remap("SSR", [i, j, k])
                C[vi, vj] = C[vi, vj] + A[vi, vk] * B[vj, vk]


@T.prim_func
def mma_intrin(a: T.handle, b: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (16, 16), align=128, offset_factor=1)
    B = T.match_buffer(b, (16, 16), align=128, offset_factor=1)
    C = T.match_buffer(c, (16, 16), align=128, offset_factor=1)

    with T.block("root"):
        T.reads(C[0 : 16, 0 : 16], A[0 : 16, 0 : 16], B[0 : 16, 0 : 16])
        T.writes(C[0 : 16, 0 : 16])
        T.evaluate(
            T.tvm_mma_sync(
                C.data,
                C.elem_offset // 256,
                A.data,
                A.elem_offset // 256,
                B.data,
                B.elem_offset // 256,
                C.data,
                C.elem_offset // 256,
                dtype="handle",
            )
        )

tir.TensorIntrin.register("test_mma_intrin", mma_desc, mma_intrin)

Create the schedule and do tensorize:

sch = tir.Schedule(before_tensorize)
update = sch.get_block("update")
_, _, _, i1, _, _ = sch.get_loops(update)
sch.tensorize(i1, "test_mma_intrin")
print(sch.mod["main"].script())

After applying tensorize, the IR becomes:

@T.prim_func
def after_tensorize(
    A: T.Buffer((128, 128), "float32"),
    B: T.Buffer((128, 128), "float32"),
    C: T.Buffer((128, 128), "float32"),
) -> None:
    # body
    # with T.block("root")
    for i_0, j_0, k_0 in T.grid(8, 8, 8):
        with T.block("update_o"):
            vio, vjo, vko = T.axis.remap("SSR", [i_0, j_0, k_0])
            T.reads(
                C[vio * 16 : vio * 16 + 16, vjo * 16 : vjo * 16 + 16],
                A[vio * 16 : vio * 16 + 16, vko * 16 : vko * 16 + 16],
                B[vjo * 16 : vjo * 16 + 16, vko * 16 : vko * 16 + 16],
            )
            T.writes(C[vio * 16 : vio * 16 + 16, vjo * 16 : vjo * 16 + 16])
            A_1 = T.match_buffer(
                A[vio * 16 : vio * 16 + 16, vko * 16 : vko * 16 + 16],
                [16, 16],
                dtype="float32",
                offset_factor=1,
            )
            B_1 = T.match_buffer(
                B[vjo * 16 : vjo * 16 + 16, vko * 16 : vko * 16 + 16],
                [16, 16],
                dtype="float32",
                offset_factor=1,
            )
            C_1 = T.match_buffer(
                C[vio * 16 : vio * 16 + 16, vjo * 16 : vjo * 16 + 16],
                [16, 16],
                dtype="float32",
                offset_factor=1,
            )
            T.evaluate(
                T.tvm_mma_sync(
                    C_1.data,
                    C_1.elem_offset // 256,
                    A_1.data,
                    A_1.elem_offset // 256,
                    B_1.data,
                    B_1.elem_offset // 256,
                    C_1.data,
                    C_1.elem_offset // 256,
                    dtype="handle",
                )
            )
参数:
返回类型:

None

transform_block_layout(block, index_map)[源代码]#

Apply a transformation represented by IndexMap to block

Parameters#

blockUnion[BlockRV, str]

The block to be transformed

index_mapUnion[IndexMap, Callable]

The transformation to apply.

Examples#

Before transform_block_layout, in TensorIR, the IR is:

@T.prim_func
def before_transform_block_layout(
    A: T.Buffer((16, 16), "float32"),
    B: T.Buffer((16, 16), "float32")
) -> None:
    for i, j in T.grid(16, 16):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do transform_block_layout:

sch = tir.Schedule(before_transform_block_layout)
sch.transform_block_layout(sch.get_block("B"), lambda i, j: (i * 16 + j,))
print(sch.mod["main"].script())

After applying transform_block_layout, the IR becomes:

@T.prim_func
def after_transform_block_layout(
    A: T.Buffer((16, 16), "float32"),
    B: T.Buffer((16, 16), "float32")
) -> None:
    for i in range(256):
        with T.block("B"):
            vi, = T.axis.remap("S", [i])
            B[vi // 16, vi % 16] = A[vi // 16, vi % 16] * 2.0
参数:
返回类型:

None

transform_layout(block, buffer, index_map, pad_value=None, *, assume_injective_transform=False)[源代码]#

Apply a transformation represented by IndexMap to buffer

Parameters#

block : Union[BlockRV, str]

The block that accesses the target buffer. If a string, this must uniquely identify a block.

buffer: Union[Tuple[str,int], Buffer, str]

The buffer to be transformed, or a specification of how to identify the buffer to be transformed.

If buffer if a tuple of (str,int), the first item should be either "read" or "write", and the second item is an index into the block's read or write regions.

If buffer is a string, it is the name of the buffer, which must exist within the reads/writes of the block. In addition, the reads/writes of the block may not contain more than one buffer with this name.

If buffer is a Buffer object, it must exist within the reads/writes of the block.

index_map : Union[IndexMap, Callable]

The transformation to apply.

If index_map is a callable, and the returned list contains IndexMap.AXIS_SEPARATOR, the SetAxisSeparators primitive will be called in addition to the TransformLayout primitive.

pad_value: Optional[Union[int, float, PrimExpr, IndexMap, Callable]]

The value to be used for any padding introduced by the transformation. If the schedule contains a producer block for the specified buffer, the pad value will be written as part of the producer block if possible, or after the producer block otherwise. Otherwise, if the buffer is an input, will insert an annotation block to state that the padding contains the known value.

The pad value may not contain instances of BufferLoad, except where it loads a value from the buffer being transformed (e.g. to create a circular buffer with padding that consists of repeated elements).

Note: If applied to an input buffer, the calling scope is responsible for ensuring that the pad_value is present. Algebraic symplifications, branch elimination, and other optimizations may assume that this precondition is met, and may result in incorrect results being returned.

If None, the transformation may not introduce padding.

If an int, float or PrimExpr, the transformation is the specific value to be present in the padding.

If an IndexMap or Callable, the transformation is the value to be present in the padding in terms of the transformed index.

assume_injective_transform : bool

If set to true, the schedule primitive will assume the index_map is injective and skip checking overlapping of the mapped indices. This can be useful for complicated index_map that the analysis does not cover. It is the callers' responsibility to ensure the index map is injective, otherwise, the correctness of the schedule is not guaranteed.

Examples#

Before transform_layout, in TensorIR, the IR is:

@T.prim_func
def before_transform_layout(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((128, 128), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do transform_layout:

sch = tir.Schedule(before_storage_align)
sch.transform_layout(sch.get_block("B"), buffer=("write",0),
                     index_map=lambda m, n: (m // 16, n // 16, m % 16, n % 16))
print(sch.mod["main"].script())

After applying transform_layout, the IR becomes:

@T.prim_func
def two_elementwise_transformed_intermediate_buffer(a: T.handle, c: T.handle) -> None:
    A = T.match_buffer(a, (128, 128), "float32")
    B = T.alloc_buffer((8, 8, 16, 16), "float32")
    C = T.match_buffer(c, (128, 128), "float32")
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi // 16, vj // 16, vi % 16, vj % 16] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j])
            C[vi, vj] = B[vi // 16, vj // 16, vi % 16, vj % 16] + 1.0
参数:
返回类型:

None

unannotate(block_or_loop, ann_key)[源代码]#

Unannotate a block/loop's annotation with key ann_key

Parameters#

block_or_loop: Union[BlockRV, LoopRV]

The block/loop to be unannotated

ann_keystr

The annotation key

Examples#

Before unannotate, in TensorIR, the IR is:

@T.prim_func
def before_unannotate(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            T.block_attr({"ann_key", "ann_value"})
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do annotate:

sch = tir.Schedule(before_unannotate)
sch.unannotate(sch.get_block("B"), "ann_key")
print(sch.mod["main"].script())

After applying unannotate, the IR becomes:

@T.prim_func
def after_unannotate(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
参数:
返回类型:

None

unroll(loop)[源代码]#

Unroll the input loop. It requires nothing

Parameters#

loopLoopRV

The loop to be unrolled

Examples#

Before unroll, in TensorIR, the IR is:

@T.prim_func
def before_unroll(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do unroll:

sch = tir.Schedule(before_unroll)
i, j = sch.get_loops(sch.get_block("B"))
sch.unroll(i)

After applying unroll, the IR becomes:

@T.prim_func
def after_unroll(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i in T.unroll(0, 128):
        for j in T.serial(0, 128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
参数:

loop (LoopRV)

返回类型:

None

unsafe_hide_buffer_access(block, buf_type, buf_index_array)[源代码]#

Hide some buffer access in a given block. This is an unsafe schedule primitive.

Parameters#

blockBlockRV

The block where we hide read access.

buf_typestr

The buffer type: "read"/"write".

buf_index_arrayList[int]

The array of buffer indices we hide access.

Note#

This schedule primitive is unsafe, and may fail dependency analysis. One use case of unsafe_hide_buffer_access is to hide the buffer access to indices buffers (e.g. in sparse computation) so that we can further tensorize the block (the indices buffers appeared in read/write regions may fail the pattern matching in tensorize primitive, and hide the access to these buffers could address the issue).

参数:
返回类型:

None

unsafe_set_dtype(block, buffer_index, dtype)[源代码]#

Set the data type of a buffer, where the buffer is specified by the a block and write-index.

This schedule primitive is unsafe and may change the correctness of program because of type conversion, please use with caution.

Parameters#

blockUnion[BlockRV, str]

The producer block of the buffer

buffer_indexint

The index of the buffer in block's write region

dtypestr

The data type to be set

Examples#

Before unsafe_set_dtype, in TensorIR, the IR is:

@T.prim_func
def before_set_dtype(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float32")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j]
            C[vi, vj] = B[vi, vj] + 1.0

Create the schedule and do unsafe_set_dtype:

sch = tir.Schedule(before_set_dtype)
sch.unsafe_set_dtype("B", buffer_index=0, dtype="float16")
print(sch.mod["main"].script())

After applying set_dtype, the IR becomes:

@T.prim_func
def after_set_dtype(
    A: T.Buffer((128, 128), "float32"), C: T.Buffer((128, 128), "float32")
) -> None:
    B = T.alloc_buffer((128, 128), dtype="float16")

    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = T.cast(A[vi, vj] * 2.0, "float16")
    for i, j in T.grid(128, 128):
        with T.block("C"):
            vi, vj = T.axis.remap("SS", [i, j]
            C[vi, vj] = T.cast(B[vi, vj], "float32") + 1.0

Note#

unsafe_set_dtype requires the buffer to be an intermediate buffer defined via alloc_buffer.

参数:
返回类型:

None

vectorize(loop)[源代码]#

Vectorize the input loop. It requires: 1) The scope block that the loop is in should have stage-pipeline property 2) All the blocks under the loop are complete blocks or reduction blocks, and have affine bindings 3) For each block under the loop, the loop can only be contained in data-parallel block iters' bindings

Parameters#

loopLoopRV

The loop to be vectorized

Examples#

Before vectorize, in TensorIR, the IR is:

@T.prim_func
def before_vectorize(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i, j in T.grid(128, 128):
        with T.block("B"):
            vi, vj = T.axis.remap("SS", [i, j])
            B[vi, vj] = A[vi, vj] * 2.0

Create the schedule and do vectorize:

sch = tir.Schedule(before_vectorize)
i, j = sch.get_loops(sch.get_block("B"))
sch.vectorize(j)

After applying vectorize, the IR becomes:

@T.prim_func
def after_vectorize(a: T.handle, b: T.handle) -> None:
    A = T.match_buffer(a, (128, 128))
    B = T.match_buffer(b, (128, 128))
    for i in T.serial(0, 128):
        for j in T.vectorized(0, 128):
            with T.block("B"):
                vi, vj = T.axis.remap("SS", [i, j])
                B[vi, vj] = A[vi, vj] * 2.0
参数:

loop (LoopRV)

返回类型:

None

work_on(func_name)[源代码]#

Instruct the schedule to work on a function in the IRModule.

By default, the schedule works on the function with the name "main", or the only function in the IRModule if there is only one. If there is multiple functions in the IRModule, and none of their names are "main", users will have to call this method to explicitly specify which function to work on.

This sugar function will guide the GetBlock method if its func_name is not specified.

Parameters#

func_namestr

The name of the function to work on.

参数:

func_name (str)

返回类型:

None

property func_working_on: GlobalVar | None#

Returns the GlobalVar of the func that the schedule is currently working on

property mod: IRModule#

Returns the AST of the module being scheduled

property state: ScheduleState#

Returns the ScheduleState in the current schedule class

property trace: Trace | None#

Returns the internally maintained trace of scheduling program execution

class tvm.tir.schedule.ScheduleDebugMask(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[源代码]#

The bitmask of the debug_mask flag in the ScheduleState class.

If the debug_mask flag has a certain bit on, then the correpsonding verification pass will be conducted. For example, if (debug_mask & VERIFY_SREF_TREE) != 0, then the correctness of the sref tree will be verified after each schedule instruction.

Attributes#

VERIFY_SREF_TREEint = 1

Verify the correctness of the sref tree

VERIFY_CACHED_FLAGSint = 2

Verify the correctness of affine_binding, region_cover and stage_pipeline

class tvm.tir.schedule.ScheduleState(mod, *, debug_mask='none', enable_check=True)[源代码]#

The state of scheduling, which exposes a Replace method as the primary resort for all the scheduling primitives to manipulate the TensorIR.

The data structure contains the following information 1) The AST being scheduled (mod) 2) The sref tree of schedulable statements (indicated by the srefs) 3) The dependency information of each block scope (block_info) 4) A reverse mapping from the AST nodes to that in the sref tree (get_sref) 5) A debug flag, if set, extra checking is enabled (debug_mask) 6) A enable check flag, if False, some prerequisite checks are disabled.

Parameters#

modIRModule

The AST of the module being scheduled

debug_maskint

Do extra correctness checking after the object construction and each time after calling the Replace method.

enable_checkbool

Indicates whether we enable prerequisite checks for some schedule primitives or not, defaults to True.

__init__(mod, *, debug_mask='none', enable_check=True)[源代码]#

Construct a schedule state from an IRModule or a PrimFunc

Parameters#

modUnion[PrimFunc, IRModule]

The IRModule or PrimFunc to be scheduled

debug_maskUnion[str, int]

Do extra correctness checking after the class creation and each time after calling the Replace method. Possible choices of debug_mask: 1) "all" - Turn on all the checks 2) "none" - Turn off all the checks 3) An integer - Turn on checks according to the bitmasks provided in ScheduleDebugMask

参数:
返回类型:

None

_get_cached_flags(block_sref)[源代码]#

Get the cached flags of the corresponding block

Parameters#

block_srefStmtSRef

The block sref to be retrieved

Returns#

flagsCachedFlags

Three flags: affine_binding, region_cover, stage_pipeline

Note#

It is an API intended for internal testing use.

参数:

block_sref (StmtSRef)

返回类型:

CachedFlags

get_block_scope(block_sref)[源代码]#

Get the BlockScope correpsonding to the block sref

Parameters#

block_srefStmtSRef

The block sref to be retrieved

Returns#

srefStmtSRef

The corresponding sref

参数:

block_sref (StmtSRef)

返回类型:

BlockScope

get_sref(stmt)[源代码]#

Return the corresponding sref that points to the stmt

Parameters#

stmtUnion[Block, For]

The schedulable statement in the TensorIR to be retrieved for its sref

Returns#

srefStmtSRef

The corresponding sref

参数:

stmt (Block | For)

返回类型:

StmtSRef | None

replace(src_sref, tgt_stmt, block_sref_reuse=None)[源代码]#

Replace the part of the AST, as being pointed to by src_sref, with a specific statement tgt_stmt, and maintain the sref tree accordingly. Replace will try to perform copy on write as much as possible when the ScheduleState holds the only copy to the IRModule and IR nodes.

Only 3 types of replacements are allowed: from src_sref->stmt to tgt_stmt. 1) Block -> Block 2) Loop -> Loop 3) Loop -> BlockRealize

Parameters#

src_srefStmtSRef

The sref to the statement to be replaced in the TensorIR AST

tgt_stmtUnion[Block, For, BlockRealize]

The statement to be replaced to

block_sref_reuseOptional[Dict[Block, Block]] = None

Maps an old block (to be replaced in the subtree under src_sref->stmt) to a new block (replaced to, in the subtree under tgt_stmt), and enforces reuse of srefs between them (rather than create new srefs) i.e. after being replaced, the sref that points to the old block will point to the new one

Note#

The reuse of loop srefs are detected automatically according to the reuse of loop vars.

参数:
返回类型:

None

参数:
  • mod (IRModule)

  • debug_mask (int)

  • enable_check (bool)

class tvm.tir.schedule.StmtSRef[源代码]#

An object that refers to schedulable elements in the TensorIR, aka "sref".

Glossary - Block sref: An StmtSref that points to a TensorIR block. - Loop sref: An StmtSRef that points to a TensorIR for loop. - Parent sref: The parent sref of an sref is the block/loop sref that points to its closest schedulable statement of its ancestors on the TensorIR AST. - Root sref: Sref to the root block. Every sref has exactly one parent sref except for root sref. - Sref tree: The parent-children-relationship of srefs that forms a tree, uniquely determined by the TensorIR AST.

static inline_mark()[源代码]#

A special StmtSRef, which doesn't point to any stmt in the AST, only serving as a "mark" to hint compute-at to do the work of compute-inline

返回类型:

StmtSRef

static root_mark()[源代码]#

A special StmtSRef, which doesn't point to any stmt in the AST, only serving as a "mark" to hint compute-at to do nothing

返回类型:

StmtSRef

property parent: StmtSRef | None#

The parent sref

property stmt: Block | For | None#

The block/for stmt the object refers to

class tvm.tir.schedule.Trace(insts, decisions)[源代码]#

An execution trace of a scheduling program.

A trace has two parts: 1) The instructions invoked so far 2) The random decisions made upon those instructions, if any

A trace can be serialized to: 1) Roundtrippable JSON format: can be saved to file and loaded back 2) Python syntax: allows users to copy-paste the trace to reproduce the scheduling process

A trace can be applied to a TensorIR schedule by re-applying all its instructions possibly with their decisions accordingly. Re-sampling is invoked if a sampling instruction doesn't have its corresponding decision; Otherwise the existing decision will be reused accordingly.

Attributes#

instsList[Instruction]

The instructions invoked so far in the program execution

decisionsDict[Instruction, DECISION_TYPE]

The random decisions made upon those instructions

__init__(insts, decisions)[源代码]#

Constructor

Parameters#

instsList[Instruction]

The instructions invoked so far in the program execution

decisionsDict[Instruction, DECISION_TYPE]

The random decisions made upon those instructions

参数:
返回类型:

None

append(inst, decision=None)[源代码]#

Append a new instruction to the trace

Parameters#

instsInstruction

The new instruction to be appended

decisionOptional[DECISION_TYPE] = None

The random decision made on this instruction

参数:
返回类型:

None

static apply_json_to_schedule(json_obj, sch)[源代码]#

Apply a JSON-serialized trace to a TensorIR schedule

Parameters#

json_objJSON_TYPE

The JSON-serialized trace

schSchedule

The TensorIR schedule

参数:
返回类型:

None

apply_to_schedule(sch, remove_postproc, decision_provider=None)[源代码]#

Apply the trace to a TensorIR schedule

Parameters#

schSchedule

The schedule to be applied onto

remove_postprocbool

If postprocessing instructions are removed

decision_provider: Optional[Callable] = None

A callback that allows users to mutate decisions on the fly when applying instructions. The signature of the callback is: - The 1st argument: The instruction - The 2nd argument: The input random variables - The 3rd argument: The attributes - The 4th argument: The decision - Return: A new decision

参数:
返回类型:

None

as_json(remove_postproc=False)[源代码]#

Serialize the trace as a JSON-style object

Parameters#

remove_postprocbool = False

If postprocessing instructions are removed

Returns#

json: JSON_TYPE

The JSON-style object

参数:

remove_postproc (bool)

返回类型:

Any

as_python(remove_postproc=False)[源代码]#

Serialize the trace as a sequence of python statements

Parameters#

remove_postprocbool = False

If postprocessing instructions are removed

Returns#

py_stmts: List[str]

A sequence of python statements

参数:

remove_postproc (bool)

返回类型:

List[str]

get_decision(inst)[源代码]#

Retrieve the decision made on a specific instruction

Parameters#

instsInstruction

The instruction whose decision is to be retrieved

Returns#

decisionOptional[DECISION_TYPE]

The corresponding decision; None if there is no decision made on the instruction

参数:

inst (Instruction)

返回类型:

Any | None

pop()[源代码]#

Remove the last instruction, along with the decision made on that instruction, if any

Returns#

popped_instInstruction

Returns the instruction removed; NullOpt if the trace is empty

返回类型:

Instruction | None

show(style=None, black_format=False)[源代码]#

A sugar for print highlighted TVM script.

Parameters#

style : str, optional

Pygmentize printing style, auto-detected if None. See tvm.script.highlight.cprint for more details.

black_format: bool

If true, use the formatter Black to format the TVMScript. If None, determine based on the "TVM_BLACK_FORMAT" environment variable.

参数:
  • style (str | None)

  • black_format (bool)

返回类型:

None

simplified(remove_postproc)[源代码]#

Simplify the trace with dead-code elimination

Parameters#

remove_postprocbool

If postprocessing instructions are removed

Returns#

trace: Trace

A simplified trace

参数:

remove_postproc (bool)

返回类型:

Trace

with_decision(inst, decision, remove_postproc)[源代码]#

Create a new trace with an instruction whose decision is changed, assuming this instruction exists in the resulting trace

Parameters#

instInstruction

The instruction whose decision is to be changed

decisionDECISION_TYPE

The decision to be changed to

remove_postprocbool

If postprocessing instructions are removed

Returns#

trace: Trace

The new trace with the decision changed

参数:
返回类型:

Trace

参数: