tilelang.contrib.cutedsl.cpasync¶
Functions¶
|
|
|
|
|
extract the tensormap pointer from a TMA Copy Atom. |
|
Load data from global memory to shared memory using TMA (Tensor Memory Access). |
|
Store data from shared memory to global memory using TMA (Tensor Memory Access). |
|
Reduce data from shared memory to global memory using TMA with atomic ADD reduction. |
|
Indicate arrival of warp issuing TMA_STORE. |
|
Wait for TMA_STORE operations to complete. |
|
Asynchronously copy data from global memory to shared memory. |
|
Prefetch a TMA descriptor. |
|
Waits on a mbarrier with a specified phase (blocking loop). |
|
|
Module Contents¶
- tilelang.contrib.cutedsl.cpasync.cp_async_gs(size, dst, src)¶
- tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, src, cond)¶
- tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)¶
extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom
- Parameters:
tma_atom (cutlass.cute.CopyAtom)
- Return type:
cutlass.cute.Pointer
- tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)¶
Load data from global memory to shared memory using TMA (Tensor Memory Access).
- Parameters:
tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) – TMA descriptor for the tensor
mbar (Pointer) – Mbarrier pointer in shared memory
smem_ptr (Pointer) – Destination pointer in shared memory
crd (tuple[Int, ...]) – Coordinates tuple for the tensor access
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶
Store data from shared memory to global memory using TMA (Tensor Memory Access).
- Parameters:
tma_desc (TMA descriptor) – TMA descriptor for the tensor
smem_ptr (Pointer) – Source pointer in shared memory
crd (tuple[Int, ...]) – Coordinates tuple for the tensor access
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.tma_reduce(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶
Reduce data from shared memory to global memory using TMA with atomic ADD reduction.
This performs an atomic add of shared memory data to global memory using the TMA unit’s reduce capability.
- Parameters:
tma_desc (TMA descriptor) – TMA descriptor for the tensor
smem_ptr (Pointer) – Source pointer in shared memory
crd (tuple[Int, ...]) – Coordinates tuple for the tensor access
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)¶
Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)¶
Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;
- Parameters:
count (Int) – The number of outstanding bulk async groups to wait for
- Return type:
None
Asynchronously copy data from global memory to shared memory.
- Parameters:
dst (Pointer) – Destination pointer in shared memory
src (Pointer) – Source pointer in global memory
size (Int) – Size of the copy in bytes
modifier (Int) – Cache modifier
cp_size (Int) – Optional copy size override
src_size (cutlass.cute.typing.Int)
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)¶
Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.mbarrier_wait(mbar_ptr, phase, timeout_ns=10000000, *, loc=None, ip=None)¶
Waits on a mbarrier with a specified phase (blocking loop).
Uses inline PTX to loop until the try_wait succeeds. The CUDA backend does: while (!mbar.try_wait(parity)) {}
- Parameters:
mbar_ptr (cutlass.cute.typing.Pointer)
phase (cutlass.cute.typing.Int)
timeout_ns (cutlass.cute.typing.Int)
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.mbarrier_cp_async_arrive(mbar_ptr, *, loc=None, ip=None)¶
- Parameters:
mbar_ptr (cutlass.cute.typing.Pointer)
- Return type:
None
- tilelang.contrib.cutedsl.cpasync.fence_proxy_async()¶
- tilelang.contrib.cutedsl.cpasync.fence_barrier_init()¶