InjectFenceProxy PassΒΆ
tl.InjectFenceProxy is a TIR-level transform that keeps the GPU proxy state consistent on NVIDIA Hopper (SM90+) by inserting fence.proxy.async instructions when execution switches from generic proxy memory operations to async proxy operations.
Why Fences Are NeededΒΆ
Hopper separates memory instructions into generic and asynchronous proxy paths. When an asynchronous instruction (for example, wgmma, tma.load, or cp.async.bulk) issues after generic traffic (like ldmatrix, cp.async, or shared-memory buffer stores), the hardware requires a fence.proxy.async to guarantee ordering. Missing fences can lead to race conditions or undefined behavior.
What the Pass DoesΒΆ
Walks statements in execution order while tracking a (may-)state of the last proxy kind (generic, async, or none/reset). Control-flow joins (e.g.
if) merge states conservatively.Normalizes
tma_storeby ensuring the requiredtma_store_arrive/tma_store_waithandshake exists immediately after the store.Injects
fence.proxy.asyncright before an async-proxy instruction whenever the preceding state can be generic.
By default, unknown/external calls do not affect proxy state. Opaque calls that may write into shared memory (e.g. via tvm_access_ptr / address_of) are treated as generic proxy traffic so a later async-proxy op will still be fenced.
Timeline ViewΒΆ
generic shared-store (or ldmatrix/stmatrix/cp.async) β async op (wgmma / tma / cp.async.bulk)
β β
ββ generic proxy ββ async proxy
β fence inserted here β
ββββββββββββββββββββββββββββββββ
The proxy tracker effectively scans the program in execution order. The moment it detects a possible transition from generic to async (between the store and the async op above), it synthesizes a fence.proxy.async to reset the hardware proxy state before the async path runs.
Coverage of IntrinsicsΒΆ
The tracker understands the TileLang intrinsics for TMA load/store, shared-memory MMA (wgmma), and TVM/PTX SM90 async copy intrinsics (cp.async.bulk family). Generic operations currently include ldmatrix, stmatrix, cp.async, and shared-memory BufferStore statements. Structured control flow (loops, blocks, branches) is handled by propagating and conservatively merging proxy state.
UsageΒΆ
The pass is part of the default TileLang lowering pipeline. To apply it manually:
from tilelang import tl
from tvm import IRModule
mod = IRModule({"main": prim_func})
with tvm.transform.PassContext():
mod = tl.transform.InjectFenceProxy()(mod)
End-to-End ExampleΒΆ
Before the pass:
@T.prim_func
def kernel():
with T.Kernel(1):
desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
smem = T.decl_buffer((128,), "float16", scope="shared")
T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
smem[0] = T.float16(0)
T.ptx_wgmma_ss(
"float16",
"m64n64k16",
T.bool(True),
T.bool(True),
"fp16",
"fp16",
"fp16",
desc.data,
T.int32(0),
desc.data,
T.int32(0),
smem.data,
T.int32(0),
T.bool(True),
1,
1,
)
After tl.transform.InjectFenceProxy:
@T.prim_func
def kernel():
with T.Kernel(1):
desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
smem = T.decl_buffer((128,), "float16", scope="shared")
T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
smem[0] = T.float16(0)
T.fence_proxy_async()
T.ptx_wgmma_ss(
"float16",
"m64n64k16",
T.bool(True),
T.bool(True),
"fp16",
"fp16",
"fp16",
desc.data,
T.int32(0),
desc.data,
T.int32(0),
smem.data,
T.int32(0),
T.bool(True),
1,
1,
)
The only change is the fence_proxy_async between the generic descriptor setup / shared-memory write and the async wgmma. In larger kernels the pass performs the same operation across nested blocks, loops, and conditional branches.
Extending the PassΒΆ
If you introduce a new intrinsic that behaves like an async proxy, add it to IsAsyncIntrinsic in src/transform/inject_fence_proxy.cc. Likewise, extend IsKnownGeneric for additional generic operations.
Most calls default to "none" (no proxy-state effect). IsNonProxyIntrinsic exists for well-known synchronization / scheduling helpers and to document intent, but it is not required for correctness if an op is neither generic nor async.
For custom/opaque ops, you must lower them into known intrinsics (or manually insert fence_proxy_async) if they participate in proxy switching.