tilelang.language.copy_op¶
Copy operations exposed on the TileLang language surface.
Functions¶
|
Copy data between memory regions. |
|
Cluster-aware copy for TMA multicast or SM-to-SM shared-memory copy. |
|
Asynchronous copy primitive lowered through cp.async. |
|
TMA copy with user-managed synchronization. |
|
Issue a TMA tile::gather4 load (sm_100a, Blackwell). |
|
Transaction byte count for a 4-row gather4 of width |
|
Issue a TMA tile::scatter4 store (sm_100a, Blackwell). |
|
Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j]. |
|
Perform im2col transformation for 2D convolution. |
|
Deprecated alias for |
Module Contents¶
- tilelang.language.copy_op.copy(src, dst, *, coalesced_width=None, disable_tma=False, eviction_policy=None, prefer_instruction=None, annotations=None, loop_layout=None)¶
Copy data between memory regions.
- Parameters:
src (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Source memory region
dst (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Destination memory region
coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.
disable_tma (bool, keyword-only) – Whether to disable TMA acceleration. Defaults to False.
eviction_policy (Optional[str], keyword-only) – Cache eviction policy. Defaults to None.
prefer_instruction (Optional[str], keyword-only) – Backend-specific preferred lowering instruction category. For CUDA, recognized values include “tma”, “cp_async”, and “sync”. For “tma”, T.copy keeps synchronous copy semantics; global -> shared copies lower through TMA with an automatically allocated barrier and wait when constraints are satisfied.
annotations (Optional[dict], keyword-only) – Additional annotations dict. If provided, coalesced_width, disable_tma, eviction_policy, and prefer_instruction can also be specified here. Values in annotations take precedence over individual arguments.
loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy (only valid for normal SIMT copy; incompatible with TMA/LDSM/STSM/TMem). When provided, it is attached to the outermost parallel loop generated by this copy.
- Raises:
TypeError – If copy extents cannot be deduced from arguments
- Returns:
A handle to the copy operation
- Return type:
tirx.Call
Range handling notes: - Accepts Buffer/BufferRegion/BufferLoad on either side. Extents are
derived as follows: Buffer -> shape, BufferRegion -> [r.extent], BufferLoad -> extents from its inferred/encoded region.
Normally, we require the extents of both sides to be the same. If they differ, the copy instruction follows an internal rule to select one side as the base range and create iteration space. This may generate unexpected code. And if some dimensions are 1, unexpected errors may happen.
Small Optimization: If both src and dst are scalar BufferLoad without region extents, lowers to a direct store: dst[…] = src[…].
Syntactic Sugar: TileLang supports passing the head address of a buffer to represent the whole buffer if there are no ambiguity. For example, T.copy(A, A_shared[i, j]). To support this, we need some special shape checking. But remember currently we don’t support something like “broadcast”.
The finalized extents are encoded with tl.region via to_buffer_region and passed through to the backend; low-level loop construction and any scope-specific decisions happen during lowering.
- tilelang.language.copy_op.copy_cluster(src, dst, *, dst_block=None, cluster_mask=None, remote_barrier=None, eviction_policy=None, coalesced_width=None, loop_layout=None)¶
Cluster-aware copy for TMA multicast or SM-to-SM shared-memory copy.
- Parameters:
src (tilelang._typing.BufferLikeType) – Source memory region.
dst (tilelang._typing.BufferLikeType) – Destination memory region.
dst_block (int | tvm.tirx.PrimExpr | None) – Destination CTA rank in the cluster for SM-to-SM copy.
cluster_mask (int | None) – Bitmask of CTAs that participate in TMA multicast.
remote_barrier (tvm.tirx.BufferLoad | None) – Shared-memory mbarrier for asynchronous SM-to-SM copy completion signalling. The destination CTA should wait on its local copy of this barrier.
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None) – Cache eviction hint passed to the TMA instruction. Only relevant for the TMA multicast path (
cluster_maskset).coalesced_width (int | None) – Vectorization width (in elements) for the SIMT loop used on the SM-to-SM fallback path (
dst_blockset, no fast bulk-async route available).loop_layout (Any | None) – Parallel loop layout hint (Fragment) for the SIMT loop on the SM-to-SM fallback path. Incompatible with the TMA multicast path (
cluster_maskset).
- Returns:
A handle to the copy operation.
- Return type:
tirx.Call
- tilelang.language.copy_op.async_copy(src, dst, *, coalesced_width=None, annotations=None, loop_layout=None)¶
Asynchronous copy primitive lowered through cp.async.
This operator is intended for explicitly asynchronous global->shared copy. The backend enforces cp.async constraints and emits:
ptx_cp_async(…) + ptx_commit_group().
No wait is auto-inserted for T.async_copy; synchronization is explicit.
- Parameters:
src (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Source memory region
dst (Union[tirx.Buffer, tirx.BufferLoad, tirx.BufferRegion]) – Destination memory region
coalesced_width (Optional[int], keyword-only) – Width for coalesced memory access. Defaults to None.
annotations (Optional[dict], keyword-only) – Additional annotations dict.
loop_layout (Optional[Fragment], keyword-only) – A parallel loop layout hint for the SIMT copy loop.
- Returns:
A handle to the async copy operation
- Return type:
tirx.Call
- tilelang.language.copy_op.tma_copy(src, dst, *, barrier=None, leader_scope_threads=None, eviction_policy=None, annotations=None)¶
TMA copy with user-managed synchronization.
For loads (global -> shared): issues expect_tx + tma_load (no wait). Unlike T.copy() which emits a full synchronous TMA sequence (arrive + load + wait), T.tma_copy() emits only the producer part (expect_tx + tma_load). The user manages synchronization explicitly via T.barrier_arrive() and T.mbarrier_wait_parity().
barrieris required for loads.For stores (shared -> global): issues tma_store + tma_store_arrive (no wait). Unlike T.copy() which emits tma_store + tma_store_arrive + tma_store_wait, T.tma_copy() omits the wait so the user can batch multiple stores before calling T.tma_store_wait() explicitly.
barrieris not needed for stores. FP4 unpacked shared-memory storage is load-only for TMA: packed globalfloat4_e2m1fnmay be loaded into unpacked sharedfloat4_e2m1_unpacked, but the reverse TMA store is not supported.- Parameters:
src (tilelang._typing.BufferLikeType) – Source memory region (global or shared)
dst (tilelang._typing.BufferLikeType) – Destination memory region (shared or global)
barrier – Mbarrier (from T.alloc_barrier()) for TMA load synchronization. Required for loads (global -> shared). Not needed for stores. The TMA load will arrive at this barrier with expected byte count. The user must wait on the same barrier via T.mbarrier_wait_parity().
leader_scope_threads (int | None) – Number of threads in each TMA leader-election scope (e.g., 32 for per-warp). Defaults to the thread extend in the current context if not specified.
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None) – Cache eviction policy. Defaults to None.
annotations (dict | None) – Additional annotations dict. Values in annotations take precedence over individual arguments.
- Returns:
A handle to the tma_copy operation
- Return type:
tirx.Call
- tilelang.language.copy_op.tma_gather4(src, dst, col, rows, *, barrier, swizzle=None, eviction_policy=None)¶
Issue a TMA tile::gather4 load (sm_100a, Blackwell).
Loads four arbitrary rows of a 2D global tensor
srcinto a 2D shared tiledstof shape(4, K_box). The CUtensorMap descriptor (dtype + swizzle) is built by the compiler from buffer + layout info.Caller must wrap this with
T.shuffle_electand pair it withT.mbarrier_expect_tx(usetma_gather4_bytes()) before, andbarrier_arrive/mbarrier_wait_parityafter.The
swizzlekwarg is deprecated; mark the shared tile viaT.annotate_layoutfor non-default swizzle.- Parameters:
src (tvm.tirx.Buffer)
dst (tvm.tirx.Buffer)
col (tvm.tirx.PrimExpr)
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)
- tilelang.language.copy_op.tma_gather4_bytes(K_box, dtype)¶
Transaction byte count for a 4-row gather4 of width
K_box. Pass toT.mbarrier_expect_tximmediately beforeT.tma_gather4.- Parameters:
dtype (str)
- Return type:
int
- tilelang.language.copy_op.tma_scatter4(src, dst, col, rows, *, swizzle=None, eviction_policy=None)¶
Issue a TMA tile::scatter4 store (sm_100a, Blackwell).
Stores a 2D shared tile of shape
(4, K_box)to four arbitrary rows of a 2D global tensordst. Caller is responsible fortma_store_arrive/tma_store_waitand theT.shuffle_electguard. Seetma_gather4()for descriptor / swizzle inference details.- Parameters:
src (tvm.tirx.Buffer)
dst (tvm.tirx.Buffer)
col (tvm.tirx.PrimExpr)
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)
- tilelang.language.copy_op.transpose(src, dst)¶
Transpose a 2D buffer in shared memory: dst[j, i] = src[i, j].
Both src and dst should be shared memory buffers. If src has shape (M, N), dst should have shape (N, M).
- Parameters:
src (tilelang._typing.BufferLikeType) – Source buffer or region of shape (…, M, N).
dst (tilelang._typing.BufferLikeType) – Destination buffer or region of shape (…, N, M).
- Returns:
A handle to the transpose operation.
- Return type:
tirx.Call
- tilelang.language.copy_op.im2col(img, col, nhw_step, c_step, kernel, stride, dilation, pad, eviction_policy=None)¶
Perform im2col transformation for 2D convolution.
- Parameters:
img (tirx.Buffer) – Input image buffer
col (tirx.Buffer) – Output column buffer
nhw_step (tirx.PrimExpr) – Step size for batch and spatial dimensions
c_step (tirx.PrimExpr) – Step size for channel dimension
kernel (int) – Kernel size
stride (int) – Stride of the convolution
dilation (int) – Dilation rate
pad (int) – Padding size
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)
- Returns:
A handle to the im2col operation
- Return type:
tirx.Call
- tilelang.language.copy_op.c2d_im2col(img, col, nhw_step, c_step, kernel, stride, dilation, pad, eviction_policy=None)¶
Deprecated alias for
im2col().- Deprecated:
Use
im2col()instead. This alias is scheduled for removal in TileLang 0.14.0.
- Parameters:
img (tilelang._typing.BufferLikeType)
col (tilelang._typing.BufferLikeType)
nhw_step (tvm.tirx.PrimExpr)
c_step (tvm.tirx.PrimExpr)
kernel (int)
stride (int)
dilation (int)
pad (int)
eviction_policy (Literal['evict_normal', 'evict_first', 'evict_last'] | None)
- Return type:
tvm.tirx.PrimExpr