tilelang.cuda.intrinsics.macro.wgmma_sp_macro_generator¶

Classes¶

WGSparseTensorCoreIntrinEmitter

To eliminate Python syntax within TIR Macro.

Module Contents¶

class tilelang.cuda.intrinsics.macro.wgmma_sp_macro_generator.WGSparseTensorCoreIntrinEmitter(a_dtype=T.float16, e_dtype=T.uint8, b_dtype=T.float16, accum_dtype=T.float16, a_transposed=False, b_transposed=False, e_transposed=False, block_row_warps=2, block_col_warps=2, warp_row_tiles=8, warp_col_tiles=8, warp_k=16, reduce_k=1, num_elems_per_byte=1, is_m_first=False, thread_var=None)¶

Bases: tilelang.cuda.intrinsics.macro.mma_sp_macro_generator.SparseTensorCoreIntrinEmitter

To eliminate Python syntax within TIR Macro.

Parameters:

a_dtype (str)
e_dtype (str)
b_dtype (str)
accum_dtype (str)
a_transposed (bool)
b_transposed (bool)
e_transposed (bool)
block_row_warps (int)
block_col_warps (int)
warp_row_tiles (int)
warp_col_tiles (int)
warp_k (int)
reduce_k (int)
num_elems_per_byte (int)
is_m_first (bool | None)
thread_var (tvm.tirx.Var | None)

wgmma_prefix: str¶

wgmma_inst_m: int¶

wgmma_inst_n: int¶

a_shared_layout: tilelang.layout.Layout = None¶

b_shared_layout: tilelang.layout.Layout = None¶

compute_wgmma_a_desc_params(A_region)¶

Compute A descriptor parameters from the A shared buffer region (SS variant).

A is sparse: its physical (compressed) K extent is micro_size_k // SPARSE_FACTOR, passed as micro_size_k for the K-atom accounting. Pure-Python helper (no TIR emitted); the returned WGMMADescriptorParams is consumed by wgmma_ss().

Parameters:: A_region (tvm.tirx.BufferRegion)
Return type:: tilelang.cuda.intrinsics.macro.wgmma_macro_generator.WGMMADescriptorParams

compute_wgmma_b_desc_params(B_region)¶

Compute B descriptor parameters from the B shared buffer region.

Pure-Python helper (no TIR emitted); the returned WGMMADescriptorParams is consumed by wgmma_ss() and wgmma_rs().

Parameters:: B_region (tvm.tirx.BufferRegion)
Return type:: tilelang.cuda.intrinsics.macro.wgmma_macro_generator.WGMMADescriptorParams

wgmma_ss(A_region, E_region, B_region, C_region, clear_accum=False, wg_wait=0)¶

Parameters:

A_region (tvm.tirx.BufferRegion)
E_region (tvm.tirx.BufferRegion)
B_region (tvm.tirx.BufferRegion)
C_region (tvm.tirx.BufferRegion)
clear_accum (tvm.tirx.PrimExpr)
wg_wait (int)

wgmma_rs(A_region, E_region, B_region, C_region, clear_accum=False, wg_wait=0)¶

Parameters:

A_region (tvm.tirx.BufferRegion)
E_region (tvm.tirx.BufferRegion)
B_region (tvm.tirx.BufferRegion)
C_region (tvm.tirx.BufferRegion)
clear_accum (tvm.tirx.PrimExpr)
wg_wait (int)

ldmatrix_e(E_local_buf, E_shared_buf, inst_i, warp_m, ki, ki_slot)¶

Parameters:

E_local_buf (tvm.tirx.Buffer)
E_shared_buf (tvm.tirx.Buffer)
inst_i (tvm.tirx.PrimExpr)
warp_m (tvm.tirx.PrimExpr)
ki (tvm.tirx.PrimExpr)
ki_slot (tvm.tirx.PrimExpr)

make_mma_load_layout(local_buf, matrix='A')¶

Create a layout function for storing MMA results into a fragment buffer. This layout is used in conjunction with inverse_mma_store_layout to map fragment indices to threads and local indices.

Parameters:

local_buf (tirx.Buffer) – The local buffer representing a fragment of a matrix.
matrix (str)

Returns:

A fragment object that describes how threads and indices in local_buf are laid out.

Return type:

T.Fragment

Raises:

AssertionError – If local_buf is not detected to be a fragment buffer.

make_mma_store_layout(local_buf)¶

Create a layout function for storing MMA results into a fragment buffer. This layout is used in conjunction with inverse_mma_store_layout to map fragment indices to threads and local indices.

Parameters:: local_buf (tirx.Buffer) – The local buffer representing a fragment of a matrix.
Returns:: A fragment object that describes how threads and indices in local_buf are laid out.
Return type:: T.Fragment
Raises:: AssertionError – If local_buf is not detected to be a fragment buffer.