tilelang.jit.adapter.nvrtc.wrapper¶

NVRTC Source Wrapper for TileLang.

Generates Python runtime code for launching CUDA kernels compiled via NVRTC.

Why this exists: - NVRTC compiles kernels at runtime, needs Python launch code (not C++) - TMA descriptors must be initialized once per unique buffer, not per kernel - L2 cache policies require explicit CUDA Driver API setup/teardown

Key design: - Two-pass generation: collect all descriptors first, then generate launches - Dict-based deduplication ensures TMA descriptors created only once - Generates pure Python using cuda.bindings.driver for zero C++ dependency

Attributes¶

Classes¶

TLNVRTCSourceWrapper

NVRTC backend wrapper: generates Python kernel launch code.

Module Contents¶

tilelang.jit.adapter.nvrtc.wrapper.PREDEF_HOST_FUNC_PY = Multiline-String¶
Show Value
"""
from cuda.bindings.driver import (
    CUtensorMapDataType,
    CUtensorMapInterleave,
    CUtensorMapSwizzle,
    CUtensorMapL2promotion,
    CUtensorMapFloatOOBfill,
    cuTensorMapEncodeTiled,
    cuTensorMapEncodeIm2col,
    CUresult,
    cuKernelSetAttribute,
    CUfunction_attribute,
    CUdevice,
    CUlaunchConfig,
    cuLaunchKernelEx,
    cuuint64_t,
    cuuint32_t,
    CUkernel,
)
import ctypes

_function_names = {}

def call({}):
    {}
"""
tilelang.jit.adapter.nvrtc.wrapper.TMA_DESC_INIT_FUNC_PY = Multiline-String¶
Show Value
"""
    {0}_type = CUtensorMapDataType({1})
    {0}_tensorRank = {2}
    {0}_globalAddress = {3}.data_ptr()
    {0}_globalDim = [{4}]
    {0}_globalStride = [{5}][1:]
    {0}_boxDim = [{6}]
    {0}_elementStrides = [{7}]
    {0}_interleave = CUtensorMapInterleave({8})
    {0}_swizzle = CUtensorMapSwizzle({9})
    {0}_l2Promotion = CUtensorMapL2promotion({10})
    {0}_oobFill = CUtensorMapFloatOOBfill({11})

    res, {0} = cuTensorMapEncodeTiled(
        {0}_type,
        {0}_tensorRank,
        {0}_globalAddress,
        {0}_globalDim,
        {0}_globalStride,
        {0}_boxDim,
        {0}_elementStrides,
        {0}_interleave,
        {0}_swizzle,
        {0}_l2Promotion,
        {0}_oobFill,
    )

    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to initialize the TMA descriptor {0}: {{res}}")
"""
tilelang.jit.adapter.nvrtc.wrapper.TMA_IM2COL_DESC_INIT_FUNC_PY = Multiline-String¶
Show Value
"""
    {0}_type = CUtensorMapDataType({1})
    {0}_tensorRank = {2}
    {0}_globalAddress = {3}.data_ptr()
    {0}_globalDim = [{4}]
    {0}_globalStride = [{5}][1:]
    {0}_elementStrides = [{6}]
    {0}_lowerCorner = [{7}]
    {0}_upperCorner = [{8}]
    {0}_channelsPerPixel = {9}
    {0}_pixelsPerColumn = {10}
    {0}_interleave = CUtensorMapInterleave({11})
    {0}_swizzle = CUtensorMapSwizzle({12})
    {0}_l2Promotion = CUtensorMapL2promotion({13})
    {0}_oobFill = CUtensorMapFloatOOBfill({14})

    res, {0} = cuTensorMapEncodeIm2col(
        {0}_type,
        {0}_tensorRank,
        {0}_globalAddress,
        {0}_globalDim,
        {0}_globalStride,
        {0}_lowerCorner,
        {0}_upperCorner,
        {0}_channelsPerPixel,
        {0}_pixelsPerColumn,
        {0}_elementStrides,
        {0}_interleave,
        {0}_swizzle,
        {0}_l2Promotion,
        {0}_oobFill,
    )

    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to initialize the TMA descriptor {0}: {{res}}")
"""
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_CREATE_HANDLE_PY = Multiline-String¶
Show Value
"""
    from cuda.bindings.driver import (
        CUstreamAttrValue,
        CUstreamAttrID,
        CUlimit,
        CUaccessProperty,
        cuCtxGetLimit,
        cuCtxSetLimit,
        cuStreamSetAttribute,
        cuCtxResetPersistingL2Cache,
    )

    stream_attribute = CUstreamAttrValue()
    res, init_persisting_l2_cache_size = cuCtxGetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE)
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to get L2 cache size limit: {{res}}")
"""
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_INIT_FUNC_PY = Multiline-String¶
Show Value
"""
    stream_attribute.accessPolicyWindow.hitRatio = {1}
    stream_attribute.accessPolicyWindow.hitProp = CUaccessProperty.CU_ACCESS_PROPERTY_PERSISTING
    stream_attribute.accessPolicyWindow.missProp = CUaccessProperty.CU_ACCESS_PROPERTY_STREAMING

    res = cuCtxSetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE, {2})[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to set L2 cache size limit: {{res}}")

    stream_attribute.accessPolicyWindow.base_ptr = {0}.data_ptr()
    stream_attribute.accessPolicyWindow.num_bytes = {2}

    res = cuStreamSetAttribute(stream, CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_ACCESS_POLICY_WINDOW, stream_attribute)[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to set stream L2 access policy: {{res}}")
"""
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_RESET_HANDLE_PY = Multiline-String¶
Show Value
"""
    stream_attribute.accessPolicyWindow.num_bytes = 0
    res = cuStreamSetAttribute(stream, CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_ACCESS_POLICY_WINDOW, stream_attribute)[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to reset stream L2 access policy: {{res}}")

    res = cuCtxResetPersistingL2Cache()[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to reset L2 cache: {{res}}")

    res = cuCtxSetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE, init_persisting_l2_cache_size)[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to restore L2 cache size limit: {{res}}")
"""
tilelang.jit.adapter.nvrtc.wrapper.KERNEL_LAUNCH_FUNC_PY = Multiline-String¶
Show Value
"""
    res = cuKernelSetAttribute(
        CUfunction_attribute.CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES,
        {7},
        kernels["{0}"],
        CUdevice({10})
    )[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to set max dynamic shared memory size to {7} for kernel {0}: {{res}}")

    config = CUlaunchConfig()
    config.gridDimX = {1}
    config.gridDimY = {2}
    config.gridDimZ = {3}
    config.blockDimX = {4}
    config.blockDimY = {5}
    config.blockDimZ = {6}
    config.sharedMemBytes = {7}
    config.hStream = stream

    arg_values = {8}
    arg_types = {9}

    res = cuLaunchKernelEx(config, kernels["{0}"], (arg_values, arg_types), 0)[0]
    if res != CUresult.CUDA_SUCCESS:
        raise RuntimeError(f"Failed to launch kernel {0}: {{res}}")
"""
class tilelang.jit.adapter.nvrtc.wrapper.TLNVRTCSourceWrapper(scheduled_ir_module, source, target, device_mod=None, host_mod=None, pass_configs=None)¶

Bases: tilelang.jit.adapter.wrapper.TLCUDASourceWrapper

NVRTC backend wrapper: generates Python kernel launch code.

Core responsibility: transform TVM IRModule into executable Python function that initializes resources (TMA descriptors, L2 cache) and launches kernels via CUDA Driver API.

Data flow:

IRModule → collect kernel metadata → deduplicate resources → generate Python code → executable function

Why Python generation instead of C++:

NVRTC workflow requires runtime compilation, Python is the natural host. Using cuda.bindings.driver eliminates C++ wrapper complexity.

Parameters:
  • scheduled_ir_module (tvm.IRModule)

  • source (str)

  • target (tvm.target.Target)

  • device_mod (tvm.IRModule | None)

  • host_mod (tvm.IRModule | None)

  • pass_configs (dict[str, Any] | None)

property host_func¶

Override parent’s host_func to return generated Python code.

create_dispatch_func(code, function_informations)¶

Generate Python dispatch function that launches multiple CUDA kernels.

Why two-pass design:

Pass 1: Collect TMA descriptors from all kernels into shared dicts Pass 2: Generate code - descriptors first (deduplicated), then launches

Single-pass would create duplicate descriptors for each kernel. Dict naturally deduplicates by descriptor name.

Parameters:
  • code – CUDA C++ source containing kernel declarations

  • function_informations – Dict mapping kernel names to metadata (grid/block dims, params, shared memory size)

Returns:

  1. Initializes L2 cache policies (if needed)

  2. Creates TMA descriptors once per unique buffer

  3. Launches each kernel with cuLaunchKernelEx

  4. Resets L2 cache policies (if needed)

Return type:

Python source code defining a call() function that

generate_l2_persistent_map(function_name)¶

Generate Python code to configure L2 cache persistence for a kernel.

L2 persistence pins frequently-accessed data in L2 cache to reduce memory bandwidth. Requires explicit setup via CUDA stream attributes.

Parameters:

function_name (str) – Kernel name to check for L2 persistence config

Returns:

Python code that sets stream access policy window, or empty string if no L2 persistence configured for this kernel.

Return type:

str

generate_tma_descriptor_args(desc_name_map, desc_name_var_map)¶

Generate Python code to initialize TMA descriptors.

TMA (Tensor Memory Accelerator) descriptors are opaque CUDA objects that describe memory layout for async copies. Must be created on host before kernel launch.

Parameters:
  • desc_name_map (dict[str, str]) – Maps descriptor variable names to buffer names

  • desc_name_var_map (dict[str, tilelang.tvm.tir.Var]) – Maps descriptor names to TVM variables

Returns:

Python code that calls cuTensorMapEncodeTiled/Im2col for each unique descriptor. Empty string if no TMA descriptors needed.

Return type:

str

update_lib_code(code)¶

Update library code and generate host dispatch function.

Entry point for code generation. Walks the host IR to extract kernel call sites, matches them with device kernels, then generates Python dispatch code via create_dispatch_func().

Parameters:

code (str) – CUDA C++ source code containing compiled kernels

Returns:

sets self.host_func to generated Python dispatcher.

Return type:

The same code string (stored in self.lib_code). Side effect

get_stream_type()¶

Return stream parameter spec for Python signature.

NVRTC backend uses raw int for stream handle (not cudaStream_t pointer). Default to 0 (NULL stream) for convenience.

Return type:

dict[str, str]