tilelang.contrib.cutedsl.cpasync¶

Functions¶

cp_async_gs(size, dst, src)

cp_async_gs_conditional(size, dst, src, cond)

extract_tensormap_ptr(tma_atom, *[, loc, ip])

extract the tensormap pointer from a TMA Copy Atom.

tma_load(tma_desc, mbar, smem_ptr, crd, *[, loc, ip])

Load data from global memory to shared memory using TMA (Tensor Memory Access).

tma_store(tma_desc, smem_ptr, crd, *[, loc, ip])

Store data from shared memory to global memory using TMA (Tensor Memory Access).

tma_reduce(tma_desc, smem_ptr, crd, *[, loc, ip])

Reduce data from shared memory to global memory using TMA with atomic ADD reduction.

tma_store_arrive(*[, loc, ip])

Indicate arrival of warp issuing TMA_STORE.

tma_store_wait(count, *[, read, loc, ip])

Wait for TMA_STORE operations to complete.

cp_async_shared_global(dst, src, cp_size, modifier, *)

Asynchronously copy data from global memory to shared memory.

prefetch_tma_descriptor(tma_desc, *[, loc, ip])

Prefetch a TMA descriptor.

mbarrier_wait(mbar_ptr, phase[, timeout_ns, loc, ip])

Waits on a mbarrier with a specified phase (blocking loop).

mbarrier_cp_async_arrive(mbar_ptr, *[, loc, ip])

fence_proxy_async()

fence_barrier_init()

Module Contents¶

tilelang.contrib.cutedsl.cpasync.cp_async_gs(size, dst, src)¶
tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, src, cond)¶
tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)¶

extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom

Parameters:

tma_atom (cutlass.cute.CopyAtom)

Return type:

cutlass.cute.Pointer

tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)¶

Load data from global memory to shared memory using TMA (Tensor Memory Access).

Parameters:
  • tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) – TMA descriptor for the tensor

  • mbar (Pointer) – Mbarrier pointer in shared memory

  • smem_ptr (Pointer) – Destination pointer in shared memory

  • crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶

Store data from shared memory to global memory using TMA (Tensor Memory Access).

Parameters:
  • tma_desc (TMA descriptor) – TMA descriptor for the tensor

  • smem_ptr (Pointer) – Source pointer in shared memory

  • crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_reduce(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶

Reduce data from shared memory to global memory using TMA with atomic ADD reduction.

This performs an atomic add of shared memory data to global memory using the TMA unit’s reduce capability.

Parameters:
  • tma_desc (TMA descriptor) – TMA descriptor for the tensor

  • smem_ptr (Pointer) – Source pointer in shared memory

  • crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)¶

Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)¶

Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;

Parameters:

count (Int) – The number of outstanding bulk async groups to wait for

Return type:

None

tilelang.contrib.cutedsl.cpasync.cp_async_shared_global(dst, src, cp_size, modifier, *, src_size=None, loc=None, ip=None)¶

Asynchronously copy data from global memory to shared memory.

Parameters:
  • dst (Pointer) – Destination pointer in shared memory

  • src (Pointer) – Source pointer in global memory

  • size (Int) – Size of the copy in bytes

  • modifier (Int) – Cache modifier

  • cp_size (Int) – Optional copy size override

  • src_size (cutlass.cute.typing.Int)

Return type:

None

tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)¶

Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;

Return type:

None

tilelang.contrib.cutedsl.cpasync.mbarrier_wait(mbar_ptr, phase, timeout_ns=10000000, *, loc=None, ip=None)¶

Waits on a mbarrier with a specified phase (blocking loop).

Uses inline PTX to loop until the try_wait succeeds. The CUDA backend does: while (!mbar.try_wait(parity)) {}

Parameters:
  • mbar_ptr (cutlass.cute.typing.Pointer)

  • phase (cutlass.cute.typing.Int)

  • timeout_ns (cutlass.cute.typing.Int)

Return type:

None

tilelang.contrib.cutedsl.cpasync.mbarrier_cp_async_arrive(mbar_ptr, *, loc=None, ip=None)¶
Parameters:

mbar_ptr (cutlass.cute.typing.Pointer)

Return type:

None

tilelang.contrib.cutedsl.cpasync.fence_proxy_async()¶
tilelang.contrib.cutedsl.cpasync.fence_barrier_init()¶