tilelang.cuda.transform¶

CUDA-specific transformation frontends.

Functions¶

`ProducerConsumerWarpSpecialized`()	Producer-consumer warp specialization at the tile-op level.
`LowerBlackwell2SM`()	Lower 2SM TCGEN5MMA and related on Blackwell target
`LowerHopperIntrin`()	LowerHopperIntrin
`LowerL2Persistent`()	LowerL2Persistent
`LowerSharedTmem`()	LowerSharedTmem
`LowerSharedBarrier`()	LowerSharedBarrier
`FuseMBarrierArriveExpectTx`()	Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx.
`LowerLDGSTG`()	Lower Ramp-based global memory load/store to ldg/stg intrinsics.
`LowerPTXAsyncCopy`()	Lower eligible global->shared copies into PTX cp.async on CUDA.
`MarkCudaSyncCalls`([have_pdl])	MarkCudaSyncCalls
`InjectFenceProxy`()	InjectFenceProxy
`InjectTcgen05Fence`()	Inject tcgen05.fence::before_thread_sync / after_thread_sync at
`AnnotateWarpGroupRegAlloc`()	Inject set_max_nreg calls into warp-specialized functions.
`PersistThreadblock`()	PersistThreadblock

Package Contents¶

tilelang.cuda.transform.ProducerConsumerWarpSpecialized()¶

Producer-consumer warp specialization at the tile-op level.

This pass runs before LayoutInference and LowerTileOp. It rewrites eligible pipelined tile-op loops into warp-specialized producer and consumer branches with explicit barrier synchronization.

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.LowerBlackwell2SM()¶

Lower 2SM TCGEN5MMA and related on Blackwell target

Returns:

tvm.transform.Pass: The result pass

Return type:

fpass

tilelang.cuda.transform.LowerHopperIntrin()¶: LowerHopperIntrin

tilelang.cuda.transform.LowerL2Persistent()¶: LowerL2Persistent

tilelang.cuda.transform.LowerSharedTmem()¶: LowerSharedTmem

tilelang.cuda.transform.LowerSharedBarrier()¶: LowerSharedBarrier

tilelang.cuda.transform.FuseMBarrierArriveExpectTx()¶: Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx.

tilelang.cuda.transform.LowerLDGSTG()¶

Lower Ramp-based global memory load/store to ldg/stg intrinsics.

This pass transforms vectorized global memory loads and stores (using Ramp indices) into explicit ldg32/64/128/256 and stg32/64/128/256 intrinsics for better codegen.

Key behaviors: - Converts Ramp-based global BufferLoad to ldg intrinsics - Converts Ramp-based global BufferStore to stg intrinsics - Supports predicated loads (if_then_else with else=0) - Supports predicated stores (if in then case) - Skips loads in async scope (will be lowered to cp.async) - Only enabled for CUDA targets

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.LowerPTXAsyncCopy()¶

Lower eligible global->shared copies into PTX cp.async on CUDA.

When enabled (pass config tl.enable_async_copy, default True), this pass may rewrite plain user-written global->shared BufferStore patterns (e.g. SIMT copies in T.Parallel) into tir.ptx_cp_async, and insert tir.ptx_commit_group + tir.ptx_wait_group(0) to preserve synchronous semantics for normal stores. If explicit commit/wait intrinsics already exist, the pass avoids duplicating them (and may insert a missing commit immediately before an existing wait to cover injected cp.async).

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.MarkCudaSyncCalls(have_pdl=False)¶

MarkCudaSyncCalls

Parameters:: have_pdl (bool)

tilelang.cuda.transform.InjectFenceProxy()¶

InjectFenceProxy

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.InjectTcgen05Fence()¶

Inject tcgen05.fence::before_thread_sync / after_thread_sync at conservative TCGEN05/TMEM synchronization boundaries on Blackwell (SM100+) targets.

The current pass wraps CTA-wide shared-memory syncs and also inserts fences around linear mbarrier wait/use and use/arrive handoff patterns. It is intentionally conservative and does not try to infer arbitrary barrier protocols.

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.AnnotateWarpGroupRegAlloc()¶

Inject set_max_nreg calls into warp-specialized functions.

This pass analyzes the function to collect register hints from set_max_nreg and no_set_max_nreg calls, then injects appropriate set_max_nreg calls into producer and consumer branches of warp-specialized code.

Returns:: fpass – The result pass
Return type:: tvm.transform.Pass

tilelang.cuda.transform.PersistThreadblock()¶: PersistThreadblock