tilelang.contrib.cutedsl.cpasync¶

Attributes¶

Functions¶

cp_async_gs(size, dst, dst_offset, src, src_offset)

cp_async_gs_conditional(size, dst, dst_offset, src, ...)

extract_tensormap_ptr(tma_atom, *[, loc, ip])

extract the tensormap pointer from a TMA Copy Atom.

tma_load(tma_desc, mbar, smem_ptr, crd, *[, loc, ip])

Load data from global memory to shared memory using TMA (Tensor Memory Access).

tma_store(tma_desc, smem_ptr, crd, *[, loc, ip])

Store data from shared memory to global memory using TMA (Tensor Memory Access).

tma_store_arrive(*[, loc, ip])

Indicate arrival of warp issuing TMA_STORE.

tma_store_wait(count, *[, read, loc, ip])

Wait for TMA_STORE operations to complete.

cp_async_shared_global(dst, src, cp_size, modifier, *)

Asynchronously copy data from global memory to shared memory.

prefetch_tma_descriptor(tma_desc, *[, loc, ip])

Prefetch a TMA descriptor.

Module Contents¶

tilelang.contrib.cutedsl.cpasync.BYTES_PER_TENSORMAP = 128¶
tilelang.contrib.cutedsl.cpasync.BYTES_PER_POINTER = 8¶
tilelang.contrib.cutedsl.cpasync.cp_async_gs(size, dst, dst_offset, src, src_offset)¶
tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, dst_offset, src, src_offset, cond)¶
tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)¶

extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom

Parameters:

tma_atom (cutlass.cute.CopyAtom)

Return type:

cutlass.cute.Pointer

tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)¶

Load data from global memory to shared memory using TMA (Tensor Memory Access).

Parameters:
  • tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) – TMA descriptor for the tensor

  • mbar (Pointer) – Mbarrier pointer in shared memory

  • smem_ptr (Pointer) – Destination pointer in shared memory

  • crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶

Store data from shared memory to global memory using TMA (Tensor Memory Access).

Parameters:
  • tma_desc (TMA descriptor) – TMA descriptor for the tensor

  • smem_ptr (Pointer) – Source pointer in shared memory

  • crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)¶

Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)¶

Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;

Parameters:

count (Int) – The number of outstanding bulk async groups to wait for

Return type:

None

tilelang.contrib.cutedsl.cpasync.cp_async_shared_global(dst, src, cp_size, modifier, *, src_size=None, loc=None, ip=None)¶

Asynchronously copy data from global memory to shared memory.

Parameters:
  • dst (Pointer) – Destination pointer in shared memory

  • src (Pointer) – Source pointer in global memory

  • size (Int) – Size of the copy in bytes

  • modifier (Int) – Cache modifier

  • cp_size (Int) – Optional copy size override

  • src_size (cutlass.cute.typing.Int)

Return type:

None

tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)¶

Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;

Return type:

None