tilelang.jit.adapter.nvrtc.wrapper
==================================
.. py:module:: tilelang.jit.adapter.nvrtc.wrapper
.. autoapi-nested-parse::
NVRTC Source Wrapper for TileLang.
Generates Python runtime code for launching CUDA kernels compiled via NVRTC.
Why this exists:
- NVRTC compiles kernels at runtime, needs Python launch code (not C++)
- TMA descriptors must be initialized once per unique buffer, not per kernel
- L2 cache policies require explicit CUDA Driver API setup/teardown
Key design:
- Two-pass generation: collect all descriptors first, then generate launches
- Dict-based deduplication ensures TMA descriptors created only once
- Generates pure Python using cuda.bindings.driver for zero C++ dependency
Attributes
----------
.. autoapisummary::
tilelang.jit.adapter.nvrtc.wrapper.PREDEF_HOST_FUNC_PY
tilelang.jit.adapter.nvrtc.wrapper.TMA_DESC_INIT_FUNC_PY
tilelang.jit.adapter.nvrtc.wrapper.TMA_IM2COL_DESC_INIT_FUNC_PY
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_CREATE_HANDLE_PY
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_INIT_FUNC_PY
tilelang.jit.adapter.nvrtc.wrapper.L2_PERSISTENT_MAP_RESET_HANDLE_PY
tilelang.jit.adapter.nvrtc.wrapper.KERNEL_LAUNCH_FUNC_PY
Classes
-------
.. autoapisummary::
tilelang.jit.adapter.nvrtc.wrapper.TLNVRTCSourceWrapper
Module Contents
---------------
.. py:data:: PREDEF_HOST_FUNC_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
from cuda.bindings.driver import (
CUtensorMapDataType,
CUtensorMapInterleave,
CUtensorMapSwizzle,
CUtensorMapL2promotion,
CUtensorMapFloatOOBfill,
cuTensorMapEncodeTiled,
cuTensorMapEncodeIm2col,
CUresult,
cuKernelSetAttribute,
CUfunction_attribute,
CUdevice,
CUlaunchConfig,
cuLaunchKernelEx,
cuuint64_t,
cuuint32_t,
CUkernel,
)
import ctypes
_function_names = {}
def call({}):
{}
"""
.. raw:: html
.. py:data:: TMA_DESC_INIT_FUNC_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
{0}_type = CUtensorMapDataType({1})
{0}_tensorRank = {2}
{0}_globalAddress = {3}.data_ptr()
{0}_globalDim = [{4}]
{0}_globalStride = [{5}][1:]
{0}_boxDim = [{6}]
{0}_elementStrides = [{7}]
{0}_interleave = CUtensorMapInterleave({8})
{0}_swizzle = CUtensorMapSwizzle({9})
{0}_l2Promotion = CUtensorMapL2promotion({10})
{0}_oobFill = CUtensorMapFloatOOBfill({11})
res, {0} = cuTensorMapEncodeTiled(
{0}_type,
{0}_tensorRank,
{0}_globalAddress,
{0}_globalDim,
{0}_globalStride,
{0}_boxDim,
{0}_elementStrides,
{0}_interleave,
{0}_swizzle,
{0}_l2Promotion,
{0}_oobFill,
)
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to initialize the TMA descriptor {0}: {{res}}")
"""
.. raw:: html
.. py:data:: TMA_IM2COL_DESC_INIT_FUNC_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
{0}_type = CUtensorMapDataType({1})
{0}_tensorRank = {2}
{0}_globalAddress = {3}.data_ptr()
{0}_globalDim = [{4}]
{0}_globalStride = [{5}][1:]
{0}_elementStrides = [{6}]
{0}_lowerCorner = [{7}]
{0}_upperCorner = [{8}]
{0}_channelsPerPixel = {9}
{0}_pixelsPerColumn = {10}
{0}_interleave = CUtensorMapInterleave({11})
{0}_swizzle = CUtensorMapSwizzle({12})
{0}_l2Promotion = CUtensorMapL2promotion({13})
{0}_oobFill = CUtensorMapFloatOOBfill({14})
res, {0} = cuTensorMapEncodeIm2col(
{0}_type,
{0}_tensorRank,
{0}_globalAddress,
{0}_globalDim,
{0}_globalStride,
{0}_lowerCorner,
{0}_upperCorner,
{0}_channelsPerPixel,
{0}_pixelsPerColumn,
{0}_elementStrides,
{0}_interleave,
{0}_swizzle,
{0}_l2Promotion,
{0}_oobFill,
)
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to initialize the TMA descriptor {0}: {{res}}")
"""
.. raw:: html
.. py:data:: L2_PERSISTENT_MAP_CREATE_HANDLE_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
from cuda.bindings.driver import (
CUstreamAttrValue,
CUstreamAttrID,
CUlimit,
CUaccessProperty,
cuCtxGetLimit,
cuCtxSetLimit,
cuStreamSetAttribute,
cuCtxResetPersistingL2Cache,
)
stream_attribute = CUstreamAttrValue()
res, init_persisting_l2_cache_size = cuCtxGetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE)
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to get L2 cache size limit: {{res}}")
"""
.. raw:: html
.. py:data:: L2_PERSISTENT_MAP_INIT_FUNC_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
stream_attribute.accessPolicyWindow.hitRatio = {1}
stream_attribute.accessPolicyWindow.hitProp = CUaccessProperty.CU_ACCESS_PROPERTY_PERSISTING
stream_attribute.accessPolicyWindow.missProp = CUaccessProperty.CU_ACCESS_PROPERTY_STREAMING
res = cuCtxSetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE, {2})[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to set L2 cache size limit: {{res}}")
stream_attribute.accessPolicyWindow.base_ptr = {0}.data_ptr()
stream_attribute.accessPolicyWindow.num_bytes = {2}
res = cuStreamSetAttribute(stream, CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_ACCESS_POLICY_WINDOW, stream_attribute)[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to set stream L2 access policy: {{res}}")
"""
.. raw:: html
.. py:data:: L2_PERSISTENT_MAP_RESET_HANDLE_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
stream_attribute.accessPolicyWindow.num_bytes = 0
res = cuStreamSetAttribute(stream, CUstreamAttrID.CU_LAUNCH_ATTRIBUTE_ACCESS_POLICY_WINDOW, stream_attribute)[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to reset stream L2 access policy: {{res}}")
res = cuCtxResetPersistingL2Cache()[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to reset L2 cache: {{res}}")
res = cuCtxSetLimit(CUlimit.CU_LIMIT_PERSISTING_L2_CACHE_SIZE, init_persisting_l2_cache_size)[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to restore L2 cache size limit: {{res}}")
"""
.. raw:: html
.. py:data:: KERNEL_LAUNCH_FUNC_PY
:value: Multiline-String
.. raw:: html
Show Value
.. code-block:: python
"""
res = cuKernelSetAttribute(
CUfunction_attribute.CU_FUNC_ATTRIBUTE_MAX_DYNAMIC_SHARED_SIZE_BYTES,
{7},
kernels["{0}"],
CUdevice({10})
)[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to set max dynamic shared memory size to {7} for kernel {0}: {{res}}")
config = CUlaunchConfig()
config.gridDimX = {1}
config.gridDimY = {2}
config.gridDimZ = {3}
config.blockDimX = {4}
config.blockDimY = {5}
config.blockDimZ = {6}
config.sharedMemBytes = {7}
config.hStream = stream
arg_values = {8}
arg_types = {9}
res = cuLaunchKernelEx(config, kernels["{0}"], (arg_values, arg_types), 0)[0]
if res != CUresult.CUDA_SUCCESS:
raise RuntimeError(f"Failed to launch kernel {0}: {{res}}")
"""
.. raw:: html
.. py:class:: TLNVRTCSourceWrapper(scheduled_ir_module, source, target, device_mod = None, host_mod = None, pass_configs = None)
Bases: :py:obj:`tilelang.jit.adapter.wrapper.TLCUDASourceWrapper`
NVRTC backend wrapper: generates Python kernel launch code.
Core responsibility: transform TVM IRModule into executable Python function
that initializes resources (TMA descriptors, L2 cache) and launches kernels
via CUDA Driver API.
Data flow:
IRModule → collect kernel metadata → deduplicate resources →
generate Python code → executable function
Why Python generation instead of C++:
NVRTC workflow requires runtime compilation, Python is the natural host.
Using cuda.bindings.driver eliminates C++ wrapper complexity.
.. py:property:: host_func
Override parent's host_func to return generated Python code.
.. py:method:: create_dispatch_func(code, function_informations)
Generate Python dispatch function that launches multiple CUDA kernels.
Why two-pass design:
Pass 1: Collect TMA descriptors from all kernels into shared dicts
Pass 2: Generate code - descriptors first (deduplicated), then launches
Single-pass would create duplicate descriptors for each kernel.
Dict naturally deduplicates by descriptor name.
:param code: CUDA C++ source containing kernel declarations
:param function_informations: Dict mapping kernel names to metadata
(grid/block dims, params, shared memory size)
:returns:
1. Initializes L2 cache policies (if needed)
2. Creates TMA descriptors once per unique buffer
3. Launches each kernel with cuLaunchKernelEx
4. Resets L2 cache policies (if needed)
:rtype: Python source code defining a call() function that
.. py:method:: generate_l2_persistent_map(function_name)
Generate Python code to configure L2 cache persistence for a kernel.
L2 persistence pins frequently-accessed data in L2 cache to reduce
memory bandwidth. Requires explicit setup via CUDA stream attributes.
:param function_name: Kernel name to check for L2 persistence config
:returns: Python code that sets stream access policy window, or empty
string if no L2 persistence configured for this kernel.
.. py:method:: generate_tma_descriptor_args(desc_name_map, desc_name_var_map)
Generate Python code to initialize TMA descriptors.
TMA (Tensor Memory Accelerator) descriptors are opaque CUDA objects
that describe memory layout for async copies. Must be created on host
before kernel launch.
:param desc_name_map: Maps descriptor variable names to buffer names
:param desc_name_var_map: Maps descriptor names to TVM variables
:returns: Python code that calls cuTensorMapEncodeTiled/Im2col for each
unique descriptor. Empty string if no TMA descriptors needed.
.. py:method:: update_lib_code(code)
Update library code and generate host dispatch function.
Entry point for code generation. Walks the host IR to extract kernel
call sites, matches them with device kernels, then generates Python
dispatch code via create_dispatch_func().
:param code: CUDA C++ source code containing compiled kernels
:returns: sets self.host_func to generated Python dispatcher.
:rtype: The same code string (stored in self.lib_code). Side effect
.. py:method:: get_stream_type()
Return stream parameter spec for Python signature.
NVRTC backend uses raw int for stream handle (not cudaStream_t pointer).
Default to 0 (NULL stream) for convenience.