tilelang.contrib.cutedsl.quantize¶

Quantization/dequantization functions for CuTeDSL backend. These implement the same functionality as the CUDA templates in tilelang/quantize/lop3.py using inline PTX via llvm.inline_asm.

Attributes¶

Functions¶

decode_i4u_to_f16(src_ptr, dst_ptr[, N])

Decode unsigned INT4 to FP16.

decode_i4s_to_f16(src_ptr, dst_ptr[, N])

Decode signed INT4 to FP16. N must be even.

decode_fp4_to_bf16_twiddling(src_ptr, dst_ptr[, N])

Decode FP4 to BF16 using twiddling technique.

Module Contents¶

tilelang.contrib.cutedsl.quantize.BOTTOM_MASK = 983055¶
tilelang.contrib.cutedsl.quantize.FP16_TOP_MAGIC_NUM = 1677747200¶
tilelang.contrib.cutedsl.quantize.IMMLUT = 234¶
tilelang.contrib.cutedsl.quantize.MEDIAN_NUM_UNSIGNED = 1677747200¶
tilelang.contrib.cutedsl.quantize.MEDIAN_NUM_SIGNED = 1678271496¶
tilelang.contrib.cutedsl.quantize.decode_i4u_to_f16(src_ptr, dst_ptr, N=8)¶

Decode unsigned INT4 to FP16.

Equivalent to CUDA template:

decode_i4b_to_f16<T1, T2, false>(_i4u, B_local_decode, N);

Parameters:
  • src_ptr – Pointer to packed INT4 data (4 bytes for 8 elements)

  • dst_ptr – Pointer to FP16 output (16 bytes for 8 elements)

  • N (int) – Number of elements to decode (default 8, must be even)

tilelang.contrib.cutedsl.quantize.decode_i4s_to_f16(src_ptr, dst_ptr, N=8)¶

Decode signed INT4 to FP16. N must be even.

Parameters:

N (int)

tilelang.contrib.cutedsl.quantize.decode_fp4_to_bf16_twiddling(src_ptr, dst_ptr, N=8)¶

Decode FP4 to BF16 using twiddling technique.

Reference: triton/tensor_details/layout_details/hopper_value.py

For each iteration:
  • Input: 4 bytes (uint32) = 8 FP4 values

  • Output: 8 BF16 values (16 bytes)

C code output layout:

B_local_decode[(i << 3) + j] = vec[j].high (j=0..3) B_local_decode[(i << 3) + j + 4] = vec[j].low (j=0..3)

So output as uint32:

dst[i*4 + 0] = {r1.high, r0.high} dst[i*4 + 1] = {r3.high, r2.high} dst[i*4 + 2] = {r1.low, r0.low} dst[i*4 + 3] = {r3.low, r2.low}

Parameters:
  • src_ptr – Pointer to packed FP4 data

  • dst_ptr – Pointer to BF16 output

  • N (int) – Number of iterations (default 8, processing 64 FP4 -> 64 BF16)