tilelang.contrib.cutedsl.reduce¶
Reduce operations for CuTeDSL backend. Based on tl_templates/cuda/reduce.h
Classes¶
Sum reduction operator |
|
Max reduction operator |
|
Min reduction operator |
|
Bitwise AND reduction operator |
|
Bitwise OR reduction operator |
|
Bitwise XOR reduction operator |
|
1D cumulative sum operation. |
|
2D cumulative sum operation. |
|
Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads. |
Functions¶
|
Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats. |
|
Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats. |
|
|
|
|
|
AllReduce operation implementing warp/block-level reduction. |
Module Contents¶
- tilelang.contrib.cutedsl.reduce.min(a, b, c=None)¶
Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats. Falls back to integer path if float conversion fails (signless int types).
- tilelang.contrib.cutedsl.reduce.max(a, b, c=None)¶
Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats. Falls back to integer path if float conversion fails (signless int types).
- class tilelang.contrib.cutedsl.reduce.BitAndOp¶
Bitwise AND reduction operator
- static __call__(x, y)¶
- class tilelang.contrib.cutedsl.reduce.BitOrOp¶
Bitwise OR reduction operator
- static __call__(x, y)¶
- class tilelang.contrib.cutedsl.reduce.BitXorOp¶
Bitwise XOR reduction operator
- static __call__(x, y)¶
- tilelang.contrib.cutedsl.reduce.bar_sync(barrier_id, number_of_threads)¶
- tilelang.contrib.cutedsl.reduce.bar_sync_ptx(barrier_id, number_of_threads)¶
- class tilelang.contrib.cutedsl.reduce.CumSum1D(threads, reverse)¶
1D cumulative sum operation. Based on tl::CumSum1D from reduce.h
- Template params:
threads: Number of threads reverse: Whether to cumsum in reverse order
- Parameters:
threads (cutlass.Constexpr[int])
reverse (cutlass.Constexpr[bool])
- threads¶
- reverse¶
- SEG = 32¶
- run(src, dst, N)¶
Perform 1D cumulative sum.
- Parameters:
src (cutlass.cute.Pointer) – Source pointer
dst (cutlass.cute.Pointer) – Destination pointer
N – Number of elements (must be compile-time constant or small)
- class tilelang.contrib.cutedsl.reduce.CumSum2D(threads, dim, reverse)¶
2D cumulative sum operation. Based on tl::CumSum2D from reduce.h
- Template params:
threads: Number of threads (must be power of 2, 32-1024) dim: Axis along which to cumsum (0 or 1) reverse: Whether to cumsum in reverse order
- Parameters:
threads (cutlass.Constexpr[int])
dim (cutlass.Constexpr[int])
reverse (cutlass.Constexpr[bool])
- threads¶
- dim¶
- reverse¶
- SEG = 32¶
- TILE_H¶
- run(src, dst, H, W)¶
Perform 2D cumulative sum.
- Parameters:
src (cutlass.cute.Pointer) – Source pointer
dst (cutlass.cute.Pointer) – Destination pointer
H – Number of rows
W – Number of columns (should be <= 32 for single-segment case)
- class tilelang.contrib.cutedsl.reduce.NamedBarrier(all_threads)¶
Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads. Based on tl::NamedBarrier<all_threads> from reduce.h
- all_threads¶
- tilelang.contrib.cutedsl.reduce.AllReduce(reducer, threads, scale, thread_offset, all_threads=None)¶
AllReduce operation implementing warp/block-level reduction. Based on tl::AllReduce from reduce.h
- Parameters:
reducer – Reducer operator class (SumOp, MaxOp, etc.)
threads – Number of threads participating in reduction
scale – Reduction scale factor
thread_offset – Thread ID offset
all_threads – Total number of threads in block
- Returns:
A callable object with run() and run_hopper() methods