tilelang.contrib.cutedsl.reduce¶

Reduce operations for CuTeDSL backend. Based on tl_templates/cuda/reduce.h

Classes¶

SumOp

Sum reduction operator

MaxOp

Max reduction operator

MinOp

Min reduction operator

BitAndOp

Bitwise AND reduction operator

BitOrOp

Bitwise OR reduction operator

BitXorOp

Bitwise XOR reduction operator

CumSum1D

1D cumulative sum operation.

CumSum2D

2D cumulative sum operation.

NamedBarrier

Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads.

Functions¶

min(a, b[, c])

Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats.

max(a, b[, c])

Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats.

bar_sync(barrier_id, number_of_threads)

bar_sync_ptx(barrier_id, number_of_threads)

AllReduce(reducer, threads, scale, thread_offset[, ...])

AllReduce operation implementing warp/block-level reduction.

Module Contents¶

tilelang.contrib.cutedsl.reduce.min(a, b, c=None)¶

Type-aware min: uses arith.minsi for integers, nvvm.fmin for floats. Falls back to integer path if float conversion fails (signless int types).

tilelang.contrib.cutedsl.reduce.max(a, b, c=None)¶

Type-aware max: uses arith.maxsi for integers, nvvm.fmax for floats. Falls back to integer path if float conversion fails (signless int types).

class tilelang.contrib.cutedsl.reduce.SumOp¶

Sum reduction operator

static __call__(x, y)¶
class tilelang.contrib.cutedsl.reduce.MaxOp¶

Max reduction operator

static __call__(x, y)¶
class tilelang.contrib.cutedsl.reduce.MinOp¶

Min reduction operator

static __call__(x, y)¶
class tilelang.contrib.cutedsl.reduce.BitAndOp¶

Bitwise AND reduction operator

static __call__(x, y)¶
class tilelang.contrib.cutedsl.reduce.BitOrOp¶

Bitwise OR reduction operator

static __call__(x, y)¶
class tilelang.contrib.cutedsl.reduce.BitXorOp¶

Bitwise XOR reduction operator

static __call__(x, y)¶
tilelang.contrib.cutedsl.reduce.bar_sync(barrier_id, number_of_threads)¶
tilelang.contrib.cutedsl.reduce.bar_sync_ptx(barrier_id, number_of_threads)¶
class tilelang.contrib.cutedsl.reduce.CumSum1D(threads, reverse)¶

1D cumulative sum operation. Based on tl::CumSum1D from reduce.h

Template params:

threads: Number of threads reverse: Whether to cumsum in reverse order

Parameters:
  • threads (cutlass.Constexpr[int])

  • reverse (cutlass.Constexpr[bool])

threads¶
reverse¶
SEG = 32¶
run(src, dst, N)¶

Perform 1D cumulative sum.

Parameters:
  • src (cutlass.cute.Pointer) – Source pointer

  • dst (cutlass.cute.Pointer) – Destination pointer

  • N – Number of elements (must be compile-time constant or small)

class tilelang.contrib.cutedsl.reduce.CumSum2D(threads, dim, reverse)¶

2D cumulative sum operation. Based on tl::CumSum2D from reduce.h

Template params:

threads: Number of threads (must be power of 2, 32-1024) dim: Axis along which to cumsum (0 or 1) reverse: Whether to cumsum in reverse order

Parameters:
  • threads (cutlass.Constexpr[int])

  • dim (cutlass.Constexpr[int])

  • reverse (cutlass.Constexpr[bool])

threads¶
dim¶
reverse¶
SEG = 32¶
TILE_H¶
run(src, dst, H, W)¶

Perform 2D cumulative sum.

Parameters:
  • src (cutlass.cute.Pointer) – Source pointer

  • dst (cutlass.cute.Pointer) – Destination pointer

  • H – Number of rows

  • W – Number of columns (should be <= 32 for single-segment case)

class tilelang.contrib.cutedsl.reduce.NamedBarrier(all_threads)¶

Named barrier policy for AllReduce, uses bar.sync instead of __syncthreads. Based on tl::NamedBarrier<all_threads> from reduce.h

all_threads¶
tilelang.contrib.cutedsl.reduce.AllReduce(reducer, threads, scale, thread_offset, all_threads=None)¶

AllReduce operation implementing warp/block-level reduction. Based on tl::AllReduce from reduce.h

Parameters:
  • reducer – Reducer operator class (SumOp, MaxOp, etc.)

  • threads – Number of threads participating in reduction

  • scale – Reduction scale factor

  • thread_offset – Thread ID offset

  • all_threads – Total number of threads in block

Returns:

A callable object with run() and run_hopper() methods