tilelang.cuda.transform¶
CUDA-specific transformation frontends.
Functions¶
Producer-consumer warp specialization at the tile-op level. |
|
Lower 2SM TCGEN5MMA and related on Blackwell target |
|
LowerHopperIntrin |
|
LowerL2Persistent |
|
LowerSharedTmem |
|
LowerSharedBarrier |
|
Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx. |
|
Lower Ramp-based global memory load/store to ldg/stg intrinsics. |
|
Lower eligible global->shared copies into PTX cp.async on CUDA. |
|
|
MarkCudaSyncCalls |
InjectFenceProxy |
|
Inject tcgen05.fence::before_thread_sync / after_thread_sync at |
|
Inject set_max_nreg calls into warp-specialized functions. |
|
PersistThreadblock |
Package Contents¶
- tilelang.cuda.transform.ProducerConsumerWarpSpecialized()¶
Producer-consumer warp specialization at the tile-op level.
This pass runs before LayoutInference and LowerTileOp. It rewrites eligible pipelined tile-op loops into warp-specialized producer and consumer branches with explicit barrier synchronization.
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.LowerBlackwell2SM()¶
Lower 2SM TCGEN5MMA and related on Blackwell target
- Returns:
- tvm.transform.Pass
The result pass
- Return type:
fpass
- tilelang.cuda.transform.LowerHopperIntrin()¶
LowerHopperIntrin
- tilelang.cuda.transform.LowerL2Persistent()¶
LowerL2Persistent
LowerSharedTmem
LowerSharedBarrier
- tilelang.cuda.transform.FuseMBarrierArriveExpectTx()¶
Fuse simple expect_tx -> TMA issue -> arrive back into arrive_and_expect_tx.
- tilelang.cuda.transform.LowerLDGSTG()¶
Lower Ramp-based global memory load/store to ldg/stg intrinsics.
This pass transforms vectorized global memory loads and stores (using Ramp indices) into explicit ldg32/64/128/256 and stg32/64/128/256 intrinsics for better codegen.
Key behaviors: - Converts Ramp-based global BufferLoad to ldg intrinsics - Converts Ramp-based global BufferStore to stg intrinsics - Supports predicated loads (if_then_else with else=0) - Supports predicated stores (if in then case) - Skips loads in async scope (will be lowered to cp.async) - Only enabled for CUDA targets
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.LowerPTXAsyncCopy()¶
Lower eligible global->shared copies into PTX cp.async on CUDA.
When enabled (pass config tl.enable_async_copy, default True), this pass may rewrite plain user-written global->shared BufferStore patterns (e.g. SIMT copies in T.Parallel) into tir.ptx_cp_async, and insert tir.ptx_commit_group + tir.ptx_wait_group(0) to preserve synchronous semantics for normal stores. If explicit commit/wait intrinsics already exist, the pass avoids duplicating them (and may insert a missing commit immediately before an existing wait to cover injected cp.async).
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.MarkCudaSyncCalls(have_pdl=False)¶
MarkCudaSyncCalls
- Parameters:
have_pdl (bool)
- tilelang.cuda.transform.InjectFenceProxy()¶
InjectFenceProxy
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.InjectTcgen05Fence()¶
Inject tcgen05.fence::before_thread_sync / after_thread_sync at conservative TCGEN05/TMEM synchronization boundaries on Blackwell (SM100+) targets.
The current pass wraps CTA-wide shared-memory syncs and also inserts fences around linear mbarrier wait/use and use/arrive handoff patterns. It is intentionally conservative and does not try to infer arbitrary barrier protocols.
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.AnnotateWarpGroupRegAlloc()¶
Inject set_max_nreg calls into warp-specialized functions.
This pass analyzes the function to collect register hints from set_max_nreg and no_set_max_nreg calls, then injects appropriate set_max_nreg calls into producer and consumer branches of warp-specialized code.
- Returns:
fpass – The result pass
- Return type:
tvm.transform.Pass
- tilelang.cuda.transform.PersistThreadblock()¶
PersistThreadblock