tilelang.language.copy_op

Copy operations exposed on the TileLang language surface.

Functions

copy(src, dst, *[, coalesced_width, disable_tma, ...])

Copy data between memory regions.

copy_cluster(src, dst, *[, dst_block, cluster_mask, ...])

Cluster-aware copy for TMA multicast or SM-to-SM shared-memory copy.

async_copy(src, dst, *[, coalesced_width, ...])

Asynchronous copy primitive lowered through cp.async.

tma_copy(src, dst, *[, barrier, leader_scope_threads, ...])

TMA copy with user-managed synchronization.

tma_gather4(src, dst, col, rows, *, barrier[, ...])

Issue a TMA tile::gather4 load (sm_100a, Blackwell).

tma_gather4_bytes(K_box, dtype)

Transaction byte count for a 4-row gather4 of width K_box. Pass

tma_scatter4(src, dst, col, rows, *[, swizzle, ...])

Issue a TMA tile::scatter4 store (sm_100a, Blackwell).

transpose(src, dst)

Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j].

im2col(img, col, nhw_step, c_step, kernel, stride, ...)

Perform im2col transformation for 2D convolution.

c2d_im2col(img, col, nhw_step, c_step, kernel, stride, ...)

Deprecated alias for im2col().

Module Contents

tilelang.language.copy_op.copy(src, dst, *, coalesced_width=None, disable_tma=False, eviction_policy=None, prefer_instruction=None, annotations=None, loop_layout=None)

Copy data between memory regions.

Parameters:
  • src (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Source memory region

  • dst (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Destination memory region

  • coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.

  • disable_tma (bool, keyword-only) – Whether to disable TMA acceleration. Defaults to False.

  • eviction_policy (Optional[str], keyword-only) – Cache eviction policy. Defaults to None.

  • prefer_instruction (Optional[str], keyword-only) – Backend-specific preferred lowering instruction category. For CUDA, recognized values include “tma”, “cp_async”, and “sync”. For “tma”, T.copy keeps synchronous copy semantics; global -> shared copies lower through TMA with an automatically allocated barrier and wait when constraints are satisfied.

  • annotations (Optional[dict], keyword-only) – Additional annotations dict. If provided, coalesced_width, disable_tma, eviction_policy, and prefer_instruction can also be specified here. Values in annotations take precedence over individual arguments.

  • loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy (only valid for normal SIMT copy; incompatible with TMA/LDSM/STSM/TMem). When provided, it is attached to the outermost parallel loop generated by this copy.

Raises:

TypeError – If copy extents cannot be deduced from arguments

Returns:

A handle to the copy operation

Return type:

tirx.Call

Range handling notes: - Accepts Buffer/BufferRegion/BufferLoad on either side. Extents are

derived as follows: Buffer -> shape, BufferRegion -> [r.extent], BufferLoad -> extents from its inferred/encoded region.

  • Normally, we require the extents of both sides to be the same. If they differ, the copy instruction follows an internal rule to select one side as the base range and create iteration space. This may generate unexpected code. And if some dimensions are 1, unexpected errors may happen.

  • Small Optimization: If both src and dst are scalar BufferLoad without region extents, lowers to a direct store: dst[…] = src[…].

  • Syntactic Sugar: TileLang supports passing the head address of a buffer to represent the whole buffer if there are no ambiguity. For example, T.copy(A, A_shared[i, j]). To support this, we need some special shape checking. But remember currently we don’t support something like “broadcast”.

  • The finalized extents are encoded with tl.region via to_buffer_region and passed through to the backend; low-level loop construction and any scope-specific decisions happen during lowering.

tilelang.language.copy_op.copy_cluster(src, dst, *, dst_block=None, cluster_mask=None, remote_barrier=None, eviction_policy=None, coalesced_width=None, loop_layout=None)

Cluster-aware copy for TMA multicast or SM-to-SM shared-memory copy.

Parameters:
  • src (tilelang._typing.BufferLikeType) – Source memory region.

  • dst (tilelang._typing.BufferLikeType) – Destination memory region.

  • dst_block (int | tvm.tirx.PrimExpr | None) – Destination CTA rank in the cluster for SM-to-SM copy.

  • cluster_mask (int | None) – Bitmask of CTAs that participate in TMA multicast.

  • remote_barrier (tvm.tirx.BufferLoad | None) – Shared-memory mbarrier for asynchronous SM-to-SM copy completion signalling. The destination CTA should wait on its local copy of this barrier.

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None) – Cache eviction hint passed to the TMA instruction. Only relevant for the TMA multicast path (cluster_mask set).

  • coalesced_width (int | None) – Vectorization width (in elements) for the SIMT loop used on the SM-to-SM fallback path (dst_block set, no fast bulk-async route available).

  • loop_layout (Any | None) – Parallel loop layout hint (Fragment) for the SIMT loop on the SM-to-SM fallback path. Incompatible with the TMA multicast path (cluster_mask set).

Returns:

A handle to the copy operation.

Return type:

tirx.Call

tilelang.language.copy_op.async_copy(src, dst, *, coalesced_width=None, annotations=None, loop_layout=None)

Asynchronous copy primitive lowered through cp.async.

This operator is intended for explicitly asynchronous global->shared copy. The backend enforces cp.async constraints and emits:

ptx_cp_async(…) + ptx_commit_group().

No wait is auto-inserted for T.async_copy; synchronization is explicit.

Parameters:
  • src (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Source memory region

  • dst (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Destination memory region

  • coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.

  • annotations (Optional[dict], keyword-only) – Additional annotations dict.

  • loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy loop.

Returns:

A handle to the async copy operation

Return type:

tirx.Call

tilelang.language.copy_op.tma_copy(src, dst, *, barrier=None, leader_scope_threads=None, eviction_policy=None, annotations=None)

TMA copy with user-managed synchronization.

For loads (global -> shared): issues expect_tx + tma_load (no wait). Unlike T.copy() which emits a full synchronous TMA sequence (arrive + load + wait), T.tma_copy() emits only the producer part (expect_tx + tma_load). The user manages synchronization explicitly via T.barrier_arrive() and T.mbarrier_wait_parity(). barrier is required for loads.

For stores (shared -> global): issues tma_store + tma_store_arrive (no wait). Unlike T.copy() which emits tma_store + tma_store_arrive + tma_store_wait, T.tma_copy() omits the wait so the user can batch multiple stores before calling T.tma_store_wait() explicitly. barrier is not needed for stores. FP4 unpacked shared-memory storage is load-only for TMA: packed global float4_e2m1fn may be loaded into unpacked shared float4_e2m1_unpacked, but the reverse TMA store is not supported.

Parameters:
  • src (tilelang._typing.BufferLikeType) – Source memory region (global or shared)

  • dst (tilelang._typing.BufferLikeType) – Destination memory region (shared or global)

  • barrier – Mbarrier (from T.alloc_barrier()) for TMA load synchronization. Required for loads (global -> shared). Not needed for stores. The TMA load will arrive at this barrier with expected byte count. The user must wait on the same barrier via T.mbarrier_wait_parity().

  • leader_scope_threads (int | None) – Number of threads in each TMA leader-election scope (e.g., 32 for per-warp). Defaults to the thread extend in the current context if not specified.

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None) – Cache eviction policy. Defaults to None.

  • annotations (dict | None) – Additional annotations dict. Values in annotations take precedence over individual arguments.

Returns:

A handle to the tma_copy operation

Return type:

tirx.Call

tilelang.language.copy_op.tma_gather4(src, dst, col, rows, *, barrier, swizzle=None, eviction_policy=None)

Issue a TMA tile::gather4 load (sm_100a, Blackwell).

Loads four arbitrary rows of a 2D global tensor src into a 2D shared tile dst of shape (4, K_box). The CUtensorMap descriptor (dtype + swizzle) is built by the compiler from buffer + layout info.

Caller must wrap this with T.shuffle_elect and pair it with T.mbarrier_expect_tx (use tma_gather4_bytes()) before, and barrier_arrive / mbarrier_wait_parity after.

The swizzle kwarg is deprecated; mark the shared tile via T.annotate_layout for non-default swizzle.

Parameters:
  • src (tvm.tirx.Buffer)

  • dst (tvm.tirx.Buffer)

  • col (tvm.tirx.PrimExpr)

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)

tilelang.language.copy_op.tma_gather4_bytes(K_box, dtype)

Transaction byte count for a 4-row gather4 of width K_box. Pass to T.mbarrier_expect_tx immediately before T.tma_gather4.

Parameters:

dtype (str)

Return type:

int

tilelang.language.copy_op.tma_scatter4(src, dst, col, rows, *, swizzle=None, eviction_policy=None)

Issue a TMA tile::scatter4 store (sm_100a, Blackwell).

Stores a 2D shared tile of shape (4, K_box) to four arbitrary rows of a 2D global tensor dst. Caller is responsible for tma_store_arrive / tma_store_wait and the T.shuffle_elect guard. See tma_gather4() for descriptor / swizzle inference details.

Parameters:
  • src (tvm.tirx.Buffer)

  • dst (tvm.tirx.Buffer)

  • col (tvm.tirx.PrimExpr)

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)

tilelang.language.copy_op.transpose(src, dst)

Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j].

Both src and dst should be shared memory buffers. If src has shape (M, N), dst should have shape (N, M).

Parameters:
  • src (tilelang._typing.BufferLikeType) – Source buffer or region of shape (…, M, N).

  • dst (tilelang._typing.BufferLikeType) – Destination buffer or region of shape (…, N, M).

Returns:

A handle to the transpose operation.

Return type:

tirx.Call

tilelang.language.copy_op.im2col(img, col, nhw_step, c_step, kernel, stride, dilation, pad, eviction_policy=None)

Perform im2col transformation for 2D convolution.

Parameters:
  • img (tirx.Buffer) – Input image buffer

  • col (tirx.Buffer) – Output column buffer

  • nhw_step (tirx.PrimExpr) – Step size for batch and spatial dimensions

  • c_step (tirx.PrimExpr) – Step size for channel dimension

  • kernel (int) – Kernel size

  • stride (int) – Stride of the convolution

  • dilation (int) – Dilation rate

  • pad (int) – Padding size

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)

Returns:

A handle to the im2col operation

Return type:

tirx.Call

tilelang.language.copy_op.c2d_im2col(img, col, nhw_step, c_step, kernel, stride, dilation, pad, eviction_policy=None)

Deprecated alias for im2col().

Deprecated:

Use im2col() instead. This alias is scheduled for removal in TileLang 0.14.0.

Parameters:
  • img (tilelang._typing.BufferLikeType)

  • col (tilelang._typing.BufferLikeType)

  • nhw_step (tvm.tirx.PrimExpr)

  • c_step (tvm.tirx.PrimExpr)

  • kernel (int)

  • stride (int)

  • dilation (int)

  • pad (int)

  • eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)

Return type:

tvm.tirx.PrimExpr