tilelang.language.copy_op¶

Copy operations exposed on the TileLang language surface.

Functions¶

`copy`(src, dst, *[, coalesced_width, disable_tma, ...])	Copy data between memory regions.
`async_copy`(src, dst, *[, coalesced_width, ...])	Asynchronous copy primitive lowered through cp.async.
`tma_copy`(src, dst, *[, barrier, eviction_policy, ...])	TMA copy with user-managed synchronization.
`transpose`(src, dst)	Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j].
`c2d_im2col`(img, col, nhw_step, c_step, kernel, stride, ...)	Perform im2col transformation for 2D convolution.

Module Contents¶

tilelang.language.copy_op.copy(src, dst, *, coalesced_width=None, disable_tma=False, eviction_policy=None, annotations=None, loop_layout=None)¶

Copy data between memory regions.

Parameters:

src (Union[tir.Buffer, tir.BufferLoad, tir.BufferRegion]) – Source memory region
dst (Union[tir.Buffer, tir.BufferLoad, tir.BufferRegion]) – Destination memory region
coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.
disable_tma (bool, keyword-only) – Whether to disable TMA acceleration. Defaults to False.
eviction_policy (Optional[str], keyword-only) – Cache eviction policy. Defaults to None.
annotations (Optional[dict], keyword-only) – Additional annotations dict. If provided, coalesced_width, disable_tma, and eviction_policy can also be specified here. Values in annotations take precedence over individual arguments.
loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy (only valid for normal SIMT copy; incompatible with TMA/LDSM/STSM/TMem). When provided, it is attached to the outermost parallel loop generated by this copy.

Raises:

TypeError – If copy extents cannot be deduced from arguments

Returns:

A handle to the copy operation

Return type:

tir.Call

Range handling notes: - Accepts Buffer/BufferRegion/BufferLoad on either side. Extents are

derived as follows: Buffer -> shape, BufferRegion -> [r.extent], BufferLoad -> extents from its inferred/encoded region.

Normally, we require the extents of both sides to be the same. If they differ, the copy instruction follows an internal rule to select one side as the base range and create iteration space. This may generate unexpected code. And if some dimensions are 1, unexpected errors may happen.
Small Optimization: If both src and dst are scalar BufferLoad without region extents, lowers to a direct store: dst[…] = src[…].
Syntactic Sugar: TileLang supports passing the head address of a buffer to represent the whole buffer if there are no ambiguity. For example, T.copy(A, A_shared[i, j]). To support this, we need some special shape checking. But remember currently we don’t support something like “broadcast”.
The finalized extents are encoded with tl.region via to_buffer_region and passed through to the backend; low-level loop construction and any scope-specific decisions happen during lowering.

tilelang.language.copy_op.async_copy(src, dst, *, coalesced_width=None, annotations=None, loop_layout=None)¶

Asynchronous copy primitive lowered through cp.async.

This operator is intended for explicitly asynchronous global->shared copy. The backend enforces cp.async constraints and emits:

ptx_cp_async(…) + ptx_commit_group().

No wait is auto-inserted for T.async_copy; synchronization is explicit.

Parameters:

src (Union[tir.Buffer, tir.BufferLoad, tir.BufferRegion]) – Source memory region
dst (Union[tir.Buffer, tir.BufferLoad, tir.BufferRegion]) – Destination memory region
coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.
annotations (Optional[dict], keyword-only) – Additional annotations dict.
loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy loop.

Returns:

A handle to the async copy operation

Return type:

tir.Call

tilelang.language.copy_op.tma_copy(src, dst, *, barrier=None, eviction_policy=None, annotations=None)¶

TMA copy with user-managed synchronization.

For loads (global -> shared): issues expect_tx + tma_load (no wait). Unlike T.copy() which emits a full synchronous TMA sequence (arrive + load + wait), T.tma_copy() emits only the producer part (expect_tx + tma_load). The user manages synchronization explicitly via T.barrier_arrive() and T.mbarrier_wait_parity(). barrier is required for loads.

For stores (shared -> global): issues tma_store + tma_store_arrive (no wait). Unlike T.copy() which emits tma_store + tma_store_arrive + tma_store_wait, T.tma_copy() omits the wait so the user can batch multiple stores before calling T.tma_store_wait() explicitly. barrier is not needed for stores.

Parameters:

src (tilelang._typing.BufferLikeType) – Source memory region (global or shared)
dst (tilelang._typing.BufferLikeType) – Destination memory region (shared or global)
barrier – Mbarrier (from T.alloc_barrier()) for TMA load synchronization. Required for loads (global -> shared). Not needed for stores. The TMA load will arrive at this barrier with expected byte count. The user must wait on the same barrier via T.mbarrier_wait_parity().
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None) – Cache eviction policy. Defaults to None.
annotations (dict | None) – Additional annotations dict. Values in annotations take precedence over individual arguments.

Returns:

A handle to the tma_copy operation

Return type:

tir.Call

tilelang.language.copy_op.transpose(src, dst)¶

Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j].

Both src and dst should be shared memory buffers. If src has shape (M, N), dst should have shape (N, M).

Parameters:

src (tilelang._typing.BufferLikeType) – Source buffer or region of shape (…, M, N).
dst (tilelang._typing.BufferLikeType) – Destination buffer or region of shape (…, N, M).

Returns:

A handle to the transpose operation.

Return type:

tir.Call

tilelang.language.copy_op.c2d_im2col(img, col, nhw_step, c_step, kernel, stride, dilation, pad, eviction_policy=None)¶

Perform im2col transformation for 2D convolution.

Parameters:

img (tir.Buffer) – Input image buffer
col (tir.Buffer) – Output column buffer
nhw_step (tir.PrimExpr) – Step size for batch and spatial dimensions
c_step (tir.PrimExpr) – Step size for channel dimension
kernel (int) – Kernel size
stride (int) – Stride of the convolution
dilation (int) – Dilation rate
pad (int) – Padding size
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)

Returns:

A handle to the im2col operation

Return type:

tir.Call