InjectFenceProxy PassΒΆ

tl.InjectFenceProxy is a TIR-level transform that keeps the GPU proxy state consistent on NVIDIA Hopper (SM90+) by inserting fence.proxy.async instructions when execution switches from generic proxy memory operations to async proxy operations.

Why Fences Are NeededΒΆ

Hopper separates memory instructions into generic and asynchronous proxy paths. When an asynchronous instruction (for example, wgmma, tma.load, or cp.async.bulk) issues after generic traffic (like ldmatrix, cp.async, or shared-memory buffer stores), the hardware requires a fence.proxy.async to guarantee ordering. Missing fences can lead to race conditions or undefined behavior.

What the Pass DoesΒΆ

  • Walks statements in execution order while tracking a (may-)state of the last proxy kind (generic, async, or none/reset). Control-flow joins (e.g. if) merge states conservatively.

  • Normalizes tma_store by ensuring the required tma_store_arrive / tma_store_wait handshake exists immediately after the store.

  • Injects fence.proxy.async right before an async-proxy instruction whenever the preceding state can be generic.

By default, unknown/external calls do not affect proxy state. Opaque calls that may write into shared memory (e.g. via tvm_access_ptr / address_of) are treated as generic proxy traffic so a later async-proxy op will still be fenced.

Timeline ViewΒΆ

generic shared-store (or ldmatrix/stmatrix/cp.async) β†’ async op (wgmma / tma / cp.async.bulk)
                 β”‚                                  β”‚
                 └─ generic proxy                   └─ async proxy
                             β”‚        fence inserted here   ↑
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The proxy tracker effectively scans the program in execution order. The moment it detects a possible transition from generic to async (between the store and the async op above), it synthesizes a fence.proxy.async to reset the hardware proxy state before the async path runs.

Coverage of IntrinsicsΒΆ

The tracker understands the TileLang intrinsics for TMA load/store, shared-memory MMA (wgmma), and TVM/PTX SM90 async copy intrinsics (cp.async.bulk family). Generic operations currently include ldmatrix, stmatrix, cp.async, and shared-memory BufferStore statements. Structured control flow (loops, blocks, branches) is handled by propagating and conservatively merging proxy state.

UsageΒΆ

The pass is part of the default TileLang lowering pipeline. To apply it manually:

from tilelang import tl
from tvm import IRModule

mod = IRModule({"main": prim_func})
with tvm.transform.PassContext():
    mod = tl.transform.InjectFenceProxy()(mod)

End-to-End ExampleΒΆ

Before the pass:

@T.prim_func
def kernel():
    with T.Kernel(1):
        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
        smem = T.decl_buffer((128,), "float16", scope="shared")
        T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
        smem[0] = T.float16(0)
        T.ptx_wgmma_ss(
            "float16",
            "m64n64k16",
            T.bool(True),
            T.bool(True),
            "fp16",
            "fp16",
            "fp16",
            desc.data,
            T.int32(0),
            desc.data,
            T.int32(0),
            smem.data,
            T.int32(0),
            T.bool(True),
            1,
            1,
        )

After tl.transform.InjectFenceProxy:

@T.prim_func
def kernel():
    with T.Kernel(1):
        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
        smem = T.decl_buffer((128,), "float16", scope="shared")
        T.initialize_wgmma_descriptor(desc, T.uint64(0), 2, 1, 32)
        smem[0] = T.float16(0)
        T.fence_proxy_async()
        T.ptx_wgmma_ss(
            "float16",
            "m64n64k16",
            T.bool(True),
            T.bool(True),
            "fp16",
            "fp16",
            "fp16",
            desc.data,
            T.int32(0),
            desc.data,
            T.int32(0),
            smem.data,
            T.int32(0),
            T.bool(True),
            1,
            1,
        )

The only change is the fence_proxy_async between the generic descriptor setup / shared-memory write and the async wgmma. In larger kernels the pass performs the same operation across nested blocks, loops, and conditional branches.

Extending the PassΒΆ

If you introduce a new intrinsic that behaves like an async proxy, add it to IsAsyncIntrinsic in src/transform/inject_fence_proxy.cc. Likewise, extend IsKnownGeneric for additional generic operations.

Most calls default to "none" (no proxy-state effect). IsNonProxyIntrinsic exists for well-known synchronization / scheduling helpers and to document intent, but it is not required for correctness if an op is neither generic nor async. For custom/opaque ops, you must lower them into known intrinsics (or manually insert fence_proxy_async) if they participate in proxy switching.