tilelang.cuda.intrinsics.macro.tcgen05_macro_generator¶

Attributes¶

Classes¶

TCGEN05DescriptorParams

Pre-computed parameters for TCGEN05 descriptor initialization and atom offset computation.

SwizzleMode

Enum where members are also (and must be) ints

TensorCoreIntrinEmitter

Intrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions.

Module Contents¶

tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.lift¶
class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.TCGEN05DescriptorParams¶

Pre-computed parameters for TCGEN05 descriptor initialization and atom offset computation.

Returned by compute_tcgen05_*_desc_params() and consumed by init_tcgen05_*_desc() and tcgen05_*_atom() methods.

swizzle_mode: int¶

SwizzleMode enum value (passed directly to T.initialize_tcgen05_descriptor).

leading_byte_offset: int¶

LBO >> 4, ready to pass to T.initialize_tcgen05_descriptor.

stride_byte_offset: int¶

SBO >> 4, ready to pass to T.initialize_tcgen05_descriptor.

swizzle_atom_elems: int¶

Number of elements per swizzle atom along the non-K dimension.

k_atom_size: int¶

max(swizzle_atom_elems // micro_size_k, 1).

elem_bits: int¶

Bit width of a single logical element.

is_k_major: bool¶

Whether the matrix is stored in K-major order (affects offset formula branching).

class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.SwizzleMode¶

Bases: enum.IntEnum

Enum where members are also (and must be) ints

NONE = 0¶
SWIZZLE_128B = 2¶
SWIZZLE_64B = 4¶
SWIZZLE_32B = 6¶
is_none()¶
Return type:

bool

is_swizzle_32b()¶
Return type:

bool

is_swizzle_64b()¶
Return type:

bool

is_swizzle_128b()¶
Return type:

bool

swizzle_byte_size()¶
Return type:

int

swizzle_atom_size()¶
Return type:

int

class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.TensorCoreIntrinEmitter(a_dtype=T.float16, b_dtype=T.float16, accum_dtype=T.float16, a_transposed=False, b_transposed=False, block_row_warps=2, block_col_warps=2, warp_row_tiles=8, warp_col_tiles=8, chunk=16, reduce_k=1, num_elems_per_byte=1, is_m_first=False, thread_var=None)¶

Bases: tilelang.cuda.intrinsics.macro.mma_macro_generator.TensorCoreIntrinEmitter

Intrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions.

Generates TIR macros that lower to tcgen05.mma PTX instructions for both the SS (Shared-Shared) and TS (TensorMemory-Shared) GEMM variants. Also provides layout helpers for tensor-memory (TMEM) buffers.

Parameters:
  • a_dtype (str)

  • b_dtype (str)

  • accum_dtype (str)

  • a_transposed (bool)

  • b_transposed (bool)

  • block_row_warps (int)

  • block_col_warps (int)

  • warp_row_tiles (int)

  • warp_col_tiles (int)

  • chunk (int)

  • reduce_k (int)

  • num_elems_per_byte (int)

  • is_m_first (bool)

  • thread_var (tvm.tirx.Var | None)

tcgen05_prefix: str¶
a_shared_layout: tilelang.layout.Layout = None¶
b_shared_layout: tilelang.layout.Layout = None¶
tcgen05mma(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶

Emit a TCGEN5MMA operation, dispatching to SS or TS variant based on A’s memory scope.

If A_buf resides in tensor memory (shared.tmem), the TS variant is emitted; otherwise the SS variant is used (both A and B from shared memory).

Parameters:
  • A_buf (Buffer) – Operand A — either in shared memory (SS) or tensor memory (TS).

  • B_buf (Buffer) – Operand B in shared memory.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • mbar (PrimExpr) – Memory barrier used for MMA completion signalling.

  • clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.

tcgen05mma_ss(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶

Emit the SS (Shared-Shared) variant of TCGEN5MMA.

Reads operand A and B from shared memory via a descriptor.

Parameters:
  • A_buf (Buffer) – Operand A in shared memory.

  • B_buf (Buffer) – Operand B in shared memory.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • mbar (PrimExpr) – Memory barrier for MMA completion signalling.

  • clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.

tcgen05mma_ts(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶

Emit the TS (TensorMemory-Shared) variant of TCGEN5MMA.

Reads operand A directly from tensor memory (TMEM) and operand B from shared memory via a descriptor. The TMEM column offset for A is computed assuming packed storage (e.g. two bfloat16 values per uint32 column) to match the output of tcgen05.st.

Parameters:
  • A_buf (Buffer) – Operand A residing in tensor memory (shared.tmem).

  • B_buf (Buffer) – Operand B in shared memory.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • mbar (PrimExpr) – Memory barrier for MMA completion signalling.

  • clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.

tcgen05mma_blockscaled(A_buf, B_buf, C_local_buf, SFA_tmem, SFB_tmem, mbar, sf_k_start, sf_a_granularity_k, sf_b_granularity_k, clear_accum=False)¶

Emit a block-scaled TCGEN5MMA (SS variant with TMEM scale factors).

Uses tcgen05.mma.cta_group::1|2.kind::mxf8f6f4.block_scale PTX instruction. Scale factors must already reside in tensor memory.

Parameters:
  • A_buf (tvm.tirx.Buffer)

  • B_buf (tvm.tirx.Buffer)

  • C_local_buf (tvm.tirx.Buffer)

  • sf_k_start (tvm.tirx.PrimExpr)

  • sf_a_granularity_k (int)

  • sf_b_granularity_k (int)

  • clear_accum (tvm.tirx.PrimExpr)

get_tcgen5_blockscaled_instr_desc(atom_m, atom_n, a_is_k_major, b_is_k_major, scale_in_a, scale_in_b, a_sf_id, b_sf_id)¶

Build the block-scaled instruction descriptor via FFI.

Parameters:
  • atom_m (int)

  • atom_n (int)

  • a_is_k_major (bool)

  • b_is_k_major (bool)

  • scale_in_a (int)

  • scale_in_b (int)

  • a_sf_id (int)

  • b_sf_id (int)

Return type:

tvm.tirx.PrimExpr

abstract make_mma_load_layout(local_buf, matrix='A')¶

Create a layout function for storing MMA results into a fragment buffer. This layout is used in conjunction with inverse_mma_store_layout to map fragment indices to threads and local indices.

Parameters:
  • local_buf (tirx.Buffer) – The local buffer representing a fragment of a matrix.

  • matrix (str)

Returns:

A fragment object that describes how threads and indices in local_buf are laid out.

Return type:

T.Fragment

Raises:

AssertionError – If local_buf is not detected to be a fragment buffer.

make_mma_store_layout(tmem_buf)¶

Create the TCGEN5 tensor-memory layout used to store MMA accumulators.

Parameters:

tmem_buf (tir.Buffer) – The local buffer representing tensormemory of a mma’s output

Returns:

Layout object describing how logical (i, j) coordinates map to the swizzled tensor-memory offsets required by TCGEN5MMA.

Return type:

Layout

Raises:

AssertionError – If tmem_buf is not detected to be a tensor-memory buffer.

get_tcgen5_mma_meta(m, n, k, disable_2cta, disable_ws=False)¶

Query the FFI for TCGEN5MMA atom metadata (atom_m, atom_n, atom_k, enable_ws, enable_2cta), and record them in self.meta.

Parameters:
  • m (int)

  • n (int)

  • k (int)

  • disable_2cta (bool)

  • disable_ws (bool)

get_tcgen5_instr_desc(atom_m, atom_n, atom_k, a_is_k_major, b_is_k_major, scale_in_a, scale_in_b)¶

Build the 64-bit instruction descriptor for a tcgen05.mma PTX call.

Parameters:
  • atom_m (int)

  • atom_n (int)

  • atom_k (int)

  • a_is_k_major (bool)

  • b_is_k_major (bool)

  • scale_in_a (int)

  • scale_in_b (int)

Return type:

tvm.tirx.PrimExpr

property tcgen05_meta_unpacked: tuple¶

Return (atom_m, atom_n, atom_k, enable_ws, enable_2cta) as ints.

Requires self.meta to have been set via get_tcgen5_mma_meta().

Return type:

tuple

property tcgen05_num_inst_m: int¶

Number of TCGEN05MMA instruction atoms along M (SS variant).

Return type:

int

property tcgen05_num_inst_n: int¶

Number of TCGEN05MMA instruction atoms along N.

Return type:

int

property tcgen05_num_k_atoms: int¶

Number of K-dimension micro-steps (chunk // micro_size_k).

Return type:

int

compute_tcgen05_b_desc_params(B_buf)¶

Compute B descriptor parameters from the B shared buffer.

This is a pure-Python helper – no TIR code is emitted. The returned TCGEN05DescriptorParams is passed to init_tcgen05_b_desc() and tcgen05_*_atom() methods.

Parameters:

B_buf (Buffer or BufferRegion) – The B operand in shared memory.

Return type:

TCGEN05DescriptorParams

compute_tcgen05_a_desc_params(A_buf)¶

Compute A descriptor parameters from the A shared buffer (SS variant).

This is a pure-Python helper – no TIR code is emitted.

Parameters:

A_buf (Buffer or BufferRegion) – The A operand in shared memory.

Return type:

TCGEN05DescriptorParams

init_tcgen05_b_desc(desc_b, B_buf, b_params)¶

Emit TIR to initialize a pre-allocated TCGEN05 B descriptor.

Parameters:
  • desc_b (Buffer) – A descriptor buffer allocated via T.alloc_tcgen05_smem_desc().

  • B_buf (Buffer or BufferRegion) – The B operand in shared memory.

  • b_params (TCGEN05DescriptorParams) – Pre-computed parameters from compute_tcgen05_b_desc_params().

init_tcgen05_a_desc(desc_a, A_buf, a_params)¶

Emit TIR to initialize a pre-allocated TCGEN05 A descriptor (SS variant).

Parameters:
  • desc_a (Buffer) – A descriptor buffer allocated via T.alloc_tcgen05_smem_desc().

  • A_buf (Buffer or BufferRegion) – The A operand in shared memory.

  • a_params (TCGEN05DescriptorParams) – Pre-computed parameters from compute_tcgen05_a_desc_params().

compute_tcgen05_instr_desc()¶

Compute the 64-bit instruction descriptor using current meta.

Requires self.meta to have been set via get_tcgen5_mma_meta().

Return type:

tvm.tirx.PrimExpr

tcgen05_atom_arrive(mbar)¶

Emit tcgen05_mma_arrive(mbar).

tcgen05_ss_atom(desc_a, desc_b, C_local_buf, inst_m_idx, inst_n_idx, ki, a_params, b_params, instr_desc, clear_accum=False)¶

Emit a single TCGEN05MMA SS instruction for atom (inst_m_idx, inst_n_idx, ki).

Must be called after descriptor initialization and before tcgen05_atom_arrive().

Parameters:
  • desc_a (Buffer) – Initialized A and B descriptors.

  • desc_b (Buffer) – Initialized A and B descriptors.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • inst_m_idx (int) – M-dimension atom index (0 .. tcgen05_num_inst_m - 1).

  • inst_n_idx (int) – N-dimension atom index (0 .. tcgen05_num_inst_n - 1).

  • ki (int) – K-dimension atom index (0 .. tcgen05_num_k_atoms - 1).

  • a_params (TCGEN05DescriptorParams) – Pre-computed A descriptor parameters.

  • b_params (TCGEN05DescriptorParams) – Pre-computed B descriptor parameters.

  • instr_desc (PrimExpr) – Instruction descriptor from compute_tcgen05_instr_desc().

  • clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.

tcgen05_ts_atom(a_tmem_data, desc_b, C_local_buf, inst_m_idx, inst_n_idx, ki, b_params, instr_desc, clear_accum=False)¶

Emit a single TCGEN05MMA TS instruction for atom (inst_m_idx, inst_n_idx, ki).

A resides in tensor memory; B in shared memory.

Parameters:
  • a_tmem_data (Var) – Data pointer for the A operand in tensor memory (e.g., A_buf.data).

  • desc_b (Buffer) – Initialized B descriptor.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • inst_m_idx (int) – M-dimension atom index.

  • inst_n_idx (int) – N-dimension atom index.

  • ki (int) – K-dimension atom index.

  • b_params (TCGEN05DescriptorParams) – Pre-computed B descriptor parameters.

  • instr_desc (PrimExpr) – Instruction descriptor from compute_tcgen05_instr_desc().

  • clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.

tcgen05_blockscaled_atom(desc_a, desc_b, C_local_buf, sfa_data, sfb_data, inst_m_idx, inst_n_idx, ki, a_params, b_params, instr_desc, clear_accum=False)¶

Emit a single TCGEN05MMA block-scaled SS instruction.

Parameters:
  • desc_a (Buffer) – Initialized A and B descriptors.

  • desc_b (Buffer) – Initialized A and B descriptors.

  • C_local_buf (Buffer) – Accumulator buffer in tensor memory.

  • sfa_data (Var) – Scale factor data pointers in tensor memory.

  • sfb_data (Var) – Scale factor data pointers in tensor memory.

  • inst_m_idx (int) – Atom indices.

  • inst_n_idx (int) – Atom indices.

  • ki (int) – Atom indices.

  • a_params (TCGEN05DescriptorParams) – Pre-computed descriptor parameters.

  • b_params (TCGEN05DescriptorParams) – Pre-computed descriptor parameters.

  • instr_desc (PrimExpr) – Block-scaled instruction descriptor (with SF IDs already encoded).

  • clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.