tilelang.contrib.cutedsl.quantize¶
Quantization/dequantization functions for CuTeDSL backend. These implement the same functionality as the CUDA templates in tilelang/quantize/lop3.py using inline PTX via llvm.inline_asm.
Attributes¶
Functions¶
|
Decode unsigned INT4 to FP16. |
|
Decode signed INT4 to FP16. N must be even. |
|
Decode FP4 to BF16 using twiddling technique. |
Module Contents¶
- tilelang.contrib.cutedsl.quantize.BOTTOM_MASK = 983055¶
- tilelang.contrib.cutedsl.quantize.FP16_TOP_MAGIC_NUM = 1677747200¶
- tilelang.contrib.cutedsl.quantize.IMMLUT = 234¶
- tilelang.contrib.cutedsl.quantize.MEDIAN_NUM_UNSIGNED = 1677747200¶
- tilelang.contrib.cutedsl.quantize.MEDIAN_NUM_SIGNED = 1678271496¶
- tilelang.contrib.cutedsl.quantize.decode_i4u_to_f16(src_ptr, dst_ptr, N=8)¶
Decode unsigned INT4 to FP16.
- Equivalent to CUDA template:
decode_i4b_to_f16<T1, T2, false>(_i4u, B_local_decode, N);
- Parameters:
src_ptr – Pointer to packed INT4 data (4 bytes for 8 elements)
dst_ptr – Pointer to FP16 output (16 bytes for 8 elements)
N (int) – Number of elements to decode (default 8, must be even)
- tilelang.contrib.cutedsl.quantize.decode_i4s_to_f16(src_ptr, dst_ptr, N=8)¶
Decode signed INT4 to FP16. N must be even.
- Parameters:
N (int)
- tilelang.contrib.cutedsl.quantize.decode_fp4_to_bf16_twiddling(src_ptr, dst_ptr, N=8)¶
Decode FP4 to BF16 using twiddling technique.
Reference: triton/tensor_details/layout_details/hopper_value.py
- For each iteration:
Input: 4 bytes (uint32) = 8 FP4 values
Output: 8 BF16 values (16 bytes)
- C code output layout:
B_local_decode[(i << 3) + j] = vec[j].high (j=0..3) B_local_decode[(i << 3) + j + 4] = vec[j].low (j=0..3)
- So output as uint32:
dst[i*4 + 0] = {r1.high, r0.high} dst[i*4 + 1] = {r3.high, r2.high} dst[i*4 + 2] = {r1.low, r0.low} dst[i*4 + 3] = {r3.low, r2.low}
- Parameters:
src_ptr – Pointer to packed FP4 data
dst_ptr – Pointer to BF16 output
N (int) – Number of iterations (default 8, processing 64 FP4 -> 64 BF16)