tilelang.cuda.intrinsics.macro.tcgen05_macro_generator¶
Attributes¶
Classes¶
Pre-computed parameters for TCGEN05 descriptor initialization and atom offset computation. |
|
Enum where members are also (and must be) ints |
|
Intrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions. |
Module Contents¶
- tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.lift¶
- class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.TCGEN05DescriptorParams¶
Pre-computed parameters for TCGEN05 descriptor initialization and atom offset computation.
Returned by
compute_tcgen05_*_desc_params()and consumed byinit_tcgen05_*_desc()andtcgen05_*_atom()methods.- swizzle_mode: int¶
SwizzleMode enum value (passed directly to
T.initialize_tcgen05_descriptor).
- leading_byte_offset: int¶
LBO >> 4, ready to pass to
T.initialize_tcgen05_descriptor.
- stride_byte_offset: int¶
SBO >> 4, ready to pass to
T.initialize_tcgen05_descriptor.
- swizzle_atom_elems: int¶
Number of elements per swizzle atom along the non-K dimension.
- k_atom_size: int¶
max(swizzle_atom_elems // micro_size_k, 1).
- elem_bits: int¶
Bit width of a single logical element.
- class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.SwizzleMode¶
Bases:
enum.IntEnumEnum where members are also (and must be) ints
- NONE = 0¶
- SWIZZLE_128B = 2¶
- SWIZZLE_64B = 4¶
- SWIZZLE_32B = 6¶
- swizzle_byte_size()¶
- Return type:
int
- swizzle_atom_size()¶
- Return type:
int
- class tilelang.cuda.intrinsics.macro.tcgen05_macro_generator.TensorCoreIntrinEmitter(a_dtype=T.float16, b_dtype=T.float16, accum_dtype=T.float16, a_transposed=False, b_transposed=False, block_row_warps=2, block_col_warps=2, warp_row_tiles=8, warp_col_tiles=8, chunk=16, reduce_k=1, num_elems_per_byte=1, is_m_first=False, thread_var=None)¶
Bases:
tilelang.cuda.intrinsics.macro.mma_macro_generator.TensorCoreIntrinEmitterIntrinsic emitter for Blackwell (SM100) TCGEN5MMA instructions.
Generates TIR macros that lower to
tcgen05.mmaPTX instructions for both the SS (Shared-Shared) and TS (TensorMemory-Shared) GEMM variants. Also provides layout helpers for tensor-memory (TMEM) buffers.- Parameters:
- tcgen05_prefix: str¶
- tcgen05mma(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶
Emit a TCGEN5MMA operation, dispatching to SS or TS variant based on A’s memory scope.
If A_buf resides in tensor memory (
shared.tmem), the TS variant is emitted; otherwise the SS variant is used (both A and B from shared memory).- Parameters:
A_buf (Buffer) – Operand A — either in shared memory (SS) or tensor memory (TS).
B_buf (Buffer) – Operand B in shared memory.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
mbar (PrimExpr) – Memory barrier used for MMA completion signalling.
clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.
- tcgen05mma_ss(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶
Emit the SS (Shared-Shared) variant of TCGEN5MMA.
Reads operand A and B from shared memory via a descriptor.
- Parameters:
A_buf (Buffer) – Operand A in shared memory.
B_buf (Buffer) – Operand B in shared memory.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
mbar (PrimExpr) – Memory barrier for MMA completion signalling.
clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.
- tcgen05mma_ts(A_buf, B_buf, C_local_buf, mbar, clear_accum=False)¶
Emit the TS (TensorMemory-Shared) variant of TCGEN5MMA.
Reads operand A directly from tensor memory (TMEM) and operand B from shared memory via a descriptor. The TMEM column offset for A is computed assuming packed storage (e.g. two
bfloat16values peruint32column) to match the output oftcgen05.st.- Parameters:
A_buf (Buffer) – Operand A residing in tensor memory (
shared.tmem).B_buf (Buffer) – Operand B in shared memory.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
mbar (PrimExpr) – Memory barrier for MMA completion signalling.
clear_accum (PrimExpr) – Whether to zero the accumulator before the first MMA.
- tcgen05mma_blockscaled(A_buf, B_buf, C_local_buf, SFA_tmem, SFB_tmem, mbar, sf_k_start, sf_a_granularity_k, sf_b_granularity_k, clear_accum=False)¶
Emit a block-scaled TCGEN5MMA (SS variant with TMEM scale factors).
Uses
tcgen05.mma.cta_group::1|2.kind::mxf8f6f4.block_scalePTX instruction. Scale factors must already reside in tensor memory.- Parameters:
A_buf (tvm.tirx.Buffer)
B_buf (tvm.tirx.Buffer)
C_local_buf (tvm.tirx.Buffer)
sf_k_start (tvm.tirx.PrimExpr)
sf_a_granularity_k (int)
sf_b_granularity_k (int)
clear_accum (tvm.tirx.PrimExpr)
- get_tcgen5_blockscaled_instr_desc(atom_m, atom_n, a_is_k_major, b_is_k_major, scale_in_a, scale_in_b, a_sf_id, b_sf_id)¶
Build the block-scaled instruction descriptor via FFI.
- abstract make_mma_load_layout(local_buf, matrix='A')¶
Create a layout function for storing MMA results into a fragment buffer. This layout is used in conjunction with inverse_mma_store_layout to map fragment indices to threads and local indices.
- Parameters:
local_buf (tirx.Buffer) – The local buffer representing a fragment of a matrix.
matrix (str)
- Returns:
A fragment object that describes how threads and indices in local_buf are laid out.
- Return type:
T.Fragment
- Raises:
AssertionError – If local_buf is not detected to be a fragment buffer.
- make_mma_store_layout(tmem_buf)¶
Create the TCGEN5 tensor-memory layout used to store MMA accumulators.
- Parameters:
tmem_buf (tir.Buffer) – The local buffer representing tensormemory of a mma’s output
- Returns:
Layout object describing how logical (i, j) coordinates map to the swizzled tensor-memory offsets required by TCGEN5MMA.
- Return type:
- Raises:
AssertionError – If tmem_buf is not detected to be a tensor-memory buffer.
- get_tcgen5_mma_meta(m, n, k, disable_2cta, disable_ws=False)¶
Query the FFI for TCGEN5MMA atom metadata (atom_m, atom_n, atom_k, enable_ws, enable_2cta), and record them in self.meta.
- get_tcgen5_instr_desc(atom_m, atom_n, atom_k, a_is_k_major, b_is_k_major, scale_in_a, scale_in_b)¶
Build the 64-bit instruction descriptor for a
tcgen05.mmaPTX call.
- property tcgen05_meta_unpacked: tuple¶
Return
(atom_m, atom_n, atom_k, enable_ws, enable_2cta)as ints.Requires
self.metato have been set viaget_tcgen5_mma_meta().- Return type:
tuple
- property tcgen05_num_inst_m: int¶
Number of TCGEN05MMA instruction atoms along M (SS variant).
- Return type:
int
- property tcgen05_num_inst_n: int¶
Number of TCGEN05MMA instruction atoms along N.
- Return type:
int
- property tcgen05_num_k_atoms: int¶
Number of K-dimension micro-steps (
chunk // micro_size_k).- Return type:
int
- compute_tcgen05_b_desc_params(B_buf)¶
Compute B descriptor parameters from the B shared buffer.
This is a pure-Python helper – no TIR code is emitted. The returned
TCGEN05DescriptorParamsis passed toinit_tcgen05_b_desc()andtcgen05_*_atom()methods.- Parameters:
B_buf (Buffer or BufferRegion) – The B operand in shared memory.
- Return type:
- compute_tcgen05_a_desc_params(A_buf)¶
Compute A descriptor parameters from the A shared buffer (SS variant).
This is a pure-Python helper – no TIR code is emitted.
- Parameters:
A_buf (Buffer or BufferRegion) – The A operand in shared memory.
- Return type:
- init_tcgen05_b_desc(desc_b, B_buf, b_params)¶
Emit TIR to initialize a pre-allocated TCGEN05 B descriptor.
- Parameters:
desc_b (Buffer) – A descriptor buffer allocated via
T.alloc_tcgen05_smem_desc().B_buf (Buffer or BufferRegion) – The B operand in shared memory.
b_params (TCGEN05DescriptorParams) – Pre-computed parameters from
compute_tcgen05_b_desc_params().
- init_tcgen05_a_desc(desc_a, A_buf, a_params)¶
Emit TIR to initialize a pre-allocated TCGEN05 A descriptor (SS variant).
- Parameters:
desc_a (Buffer) – A descriptor buffer allocated via
T.alloc_tcgen05_smem_desc().A_buf (Buffer or BufferRegion) – The A operand in shared memory.
a_params (TCGEN05DescriptorParams) – Pre-computed parameters from
compute_tcgen05_a_desc_params().
- compute_tcgen05_instr_desc()¶
Compute the 64-bit instruction descriptor using current meta.
Requires
self.metato have been set viaget_tcgen5_mma_meta().- Return type:
tvm.tirx.PrimExpr
- tcgen05_atom_arrive(mbar)¶
Emit
tcgen05_mma_arrive(mbar).
- tcgen05_ss_atom(desc_a, desc_b, C_local_buf, inst_m_idx, inst_n_idx, ki, a_params, b_params, instr_desc, clear_accum=False)¶
Emit a single TCGEN05MMA SS instruction for atom
(inst_m_idx, inst_n_idx, ki).Must be called after descriptor initialization and before
tcgen05_atom_arrive().- Parameters:
desc_a (Buffer) – Initialized A and B descriptors.
desc_b (Buffer) – Initialized A and B descriptors.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
inst_m_idx (int) – M-dimension atom index (0 .. tcgen05_num_inst_m - 1).
inst_n_idx (int) – N-dimension atom index (0 .. tcgen05_num_inst_n - 1).
ki (int) – K-dimension atom index (0 .. tcgen05_num_k_atoms - 1).
a_params (TCGEN05DescriptorParams) – Pre-computed A descriptor parameters.
b_params (TCGEN05DescriptorParams) – Pre-computed B descriptor parameters.
instr_desc (PrimExpr) – Instruction descriptor from
compute_tcgen05_instr_desc().clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.
- tcgen05_ts_atom(a_tmem_data, desc_b, C_local_buf, inst_m_idx, inst_n_idx, ki, b_params, instr_desc, clear_accum=False)¶
Emit a single TCGEN05MMA TS instruction for atom
(inst_m_idx, inst_n_idx, ki).A resides in tensor memory; B in shared memory.
- Parameters:
a_tmem_data (Var) – Data pointer for the A operand in tensor memory (e.g.,
A_buf.data).desc_b (Buffer) – Initialized B descriptor.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
inst_m_idx (int) – M-dimension atom index.
inst_n_idx (int) – N-dimension atom index.
ki (int) – K-dimension atom index.
b_params (TCGEN05DescriptorParams) – Pre-computed B descriptor parameters.
instr_desc (PrimExpr) – Instruction descriptor from
compute_tcgen05_instr_desc().clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.
- tcgen05_blockscaled_atom(desc_a, desc_b, C_local_buf, sfa_data, sfb_data, inst_m_idx, inst_n_idx, ki, a_params, b_params, instr_desc, clear_accum=False)¶
Emit a single TCGEN05MMA block-scaled SS instruction.
- Parameters:
desc_a (Buffer) – Initialized A and B descriptors.
desc_b (Buffer) – Initialized A and B descriptors.
C_local_buf (Buffer) – Accumulator buffer in tensor memory.
sfa_data (Var) – Scale factor data pointers in tensor memory.
sfb_data (Var) – Scale factor data pointers in tensor memory.
inst_m_idx (int) – Atom indices.
inst_n_idx (int) – Atom indices.
ki (int) – Atom indices.
a_params (TCGEN05DescriptorParams) – Pre-computed descriptor parameters.
b_params (TCGEN05DescriptorParams) – Pre-computed descriptor parameters.
instr_desc (PrimExpr) – Block-scaled instruction descriptor (with SF IDs already encoded).
clear_accum (PrimExpr) – Whether to zero the accumulator on the first K atom.