tilelang.cuda.transform¶

CUDA-specific transformation frontends.

Functions¶

ProducerConsumerWarpSpecialized()

Producer-consumer warp specialization at the tile-op level.

LowerBlackwell2SM()

Lower 2SM TCGEN5MMA and related on Blackwell target

LowerHopperIntrin()

LowerHopperIntrin

LowerL2Persistent()

LowerL2Persistent

LowerSharedTmem()

LowerSharedTmem

LowerSharedBarrier()

LowerSharedBarrier

FuseMBarrierArriveExpectTx()

Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx.

LowerLDGSTG()

Lower Ramp-based global memory load/store to ldg/stg intrinsics.

LowerPTXAsyncCopy()

Lower eligible global->shared copies into PTX cp.async on CUDA.

MarkCudaSyncCalls([have_pdl])

MarkCudaSyncCalls

InjectFenceProxy()

InjectFenceProxy

InjectTcgen05Fence()

Inject tcgen05.fence::before_thread_sync / after_thread_sync at

AnnotateWarpGroupRegAlloc()

Inject set_max_nreg calls into warp-specialized functions.

PersistThreadblock()

PersistThreadblock

Package Contents¶

tilelang.cuda.transform.ProducerConsumerWarpSpecialized()¶

Producer-consumer warp specialization at the tile-op level.

This pass runs before LayoutInference and LowerTileOp. It rewrites eligible pipelined tile-op loops into warp-specialized producer and consumer branches with explicit barrier synchronization.

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.LowerBlackwell2SM()¶

Lower 2SM TCGEN5MMA and related on Blackwell target

Returns:

tvm.transform.Pass

The result pass

Return type:

fpass

tilelang.cuda.transform.LowerHopperIntrin()¶

LowerHopperIntrin

tilelang.cuda.transform.LowerL2Persistent()¶

LowerL2Persistent

tilelang.cuda.transform.LowerSharedTmem()¶

LowerSharedTmem

tilelang.cuda.transform.LowerSharedBarrier()¶

LowerSharedBarrier

tilelang.cuda.transform.FuseMBarrierArriveExpectTx()¶

Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx.

tilelang.cuda.transform.LowerLDGSTG()¶

Lower Ramp-based global memory load/store to ldg/stg intrinsics.

This pass transforms vectorized global memory loads and stores (using Ramp indices) into explicit ldg32/64/128/256 and stg32/64/128/256 intrinsics for better codegen.

Key behaviors: - Converts Ramp-based global BufferLoad to ldg intrinsics - Converts Ramp-based global BufferStore to stg intrinsics - Supports predicated loads (if_then_else with else=0) - Supports predicated stores (if in then case) - Skips loads in async scope (will be lowered to cp.async) - Only enabled for CUDA targets

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.LowerPTXAsyncCopy()¶

Lower eligible global->shared copies into PTX cp.async on CUDA.

When enabled (pass config tl.enable_async_copy, default True), this pass may rewrite plain user-written global->shared BufferStore patterns (e.g. SIMT copies in T.Parallel) into tir.ptx_cp_async, and insert tir.ptx_commit_group + tir.ptx_wait_group(0) to preserve synchronous semantics for normal stores. If explicit commit/wait intrinsics already exist, the pass avoids duplicating them (and may insert a missing commit immediately before an existing wait to cover injected cp.async).

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.MarkCudaSyncCalls(have_pdl=False)¶

MarkCudaSyncCalls

Parameters:

have_pdl (bool)

tilelang.cuda.transform.InjectFenceProxy()¶

InjectFenceProxy

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.InjectTcgen05Fence()¶

Inject tcgen05.fence::before_thread_sync / after_thread_sync at conservative TCGEN05/TMEM synchronization boundaries on Blackwell (SM100+) targets.

The current pass wraps CTA-wide shared-memory syncs and also inserts fences around linear mbarrier wait/use and use/arrive handoff patterns. It is intentionally conservative and does not try to infer arbitrary barrier protocols.

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.AnnotateWarpGroupRegAlloc()¶

Inject set_max_nreg calls into warp-specialized functions.

This pass analyzes the function to collect register hints from set_max_nreg and no_set_max_nreg calls, then injects appropriate set_max_nreg calls into producer and consumer branches of warp-specialized code.

Returns:

fpass – The result pass

Return type:

tvm.transform.Pass

tilelang.cuda.transform.PersistThreadblock()¶

PersistThreadblock