tilelang.contrib.cutedsl.cpasync¶

Attributes¶

`BYTES_PER_TENSORMAP`
`BYTES_PER_POINTER`

`cp_async_gs`(size, dst, src)
`cp_async_gs_conditional`(size, dst, src, cond)
`extract_tensormap_ptr`(tma_atom, *[, loc, ip])	extract the tensormap pointer from a TMA Copy Atom.
`tma_load`(tma_desc, mbar, smem_ptr, crd, *[, loc, ip])	Load data from global memory to shared memory using TMA (Tensor Memory Access).
`tma_store`(tma_desc, smem_ptr, crd, *[, loc, ip])	Store data from shared memory to global memory using TMA (Tensor Memory Access).
`tma_store_arrive`(*[, loc, ip])	Indicate arrival of warp issuing TMA_STORE.
`tma_store_wait`(count, *[, read, loc, ip])	Wait for TMA_STORE operations to complete.
`cp_async_shared_global`(dst, src, cp_size, modifier, *)	Asynchronously copy data from global memory to shared memory.
`prefetch_tma_descriptor`(tma_desc, *[, loc, ip])	Prefetch a TMA descriptor.

tilelang.contrib.cutedsl.cpasync.cp_async_gs_conditional(size, dst, src, cond)¶

tilelang.contrib.cutedsl.cpasync.extract_tensormap_ptr(tma_atom, *, loc=None, ip=None)¶

extract the tensormap pointer from a TMA Copy Atom. :param tma_atom: The TMA Copy Atom :type tma_atom: CopyAtom

tilelang.contrib.cutedsl.cpasync.tma_load(tma_desc, mbar, smem_ptr, crd, *, loc=None, ip=None)¶

Load data from global memory to shared memory using TMA (Tensor Memory Access).

Parameters:

tma_desc (CopyAtom or tensormap_ptr or Tensor of tensormap_ptr) – TMA descriptor for the tensor
mbar (Pointer) – Mbarrier pointer in shared memory
smem_ptr (Pointer) – Destination pointer in shared memory
crd (tuple[Int, ...]) – Coordinates tuple for the tensor access

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store(tma_desc, smem_ptr, crd, *, loc=None, ip=None)¶

Store data from shared memory to global memory using TMA (Tensor Memory Access).

Parameters:

Return type:

None

tilelang.contrib.cutedsl.cpasync.tma_store_arrive(*, loc=None, ip=None)¶

Indicate arrival of warp issuing TMA_STORE. Corresponds to PTX instruction: cp.async.bulk.commit_group;

tilelang.contrib.cutedsl.cpasync.tma_store_wait(count, *, read=None, loc=None, ip=None)¶

Wait for TMA_STORE operations to complete. Corresponds to PTX instruction: cp.async.bulk.wait_group.read <count>;

Parameters:: count (Int) – The number of outstanding bulk async groups to wait for
Return type:: None

tilelang.contrib.cutedsl.cpasync.cp_async_shared_global(dst, src, cp_size, modifier, *, src_size=None, loc=None, ip=None)¶

Asynchronously copy data from global memory to shared memory.

Parameters:

Return type:

None

tilelang.contrib.cutedsl.cpasync.prefetch_tma_descriptor(tma_desc, *, loc=None, ip=None)¶

Prefetch a TMA descriptor. Corresponds to PTX instruction: prefetch.tensormap;