tilelang.contrib.cutedsl.ldsm ============================= .. py:module:: tilelang.contrib.cutedsl.ldsm .. autoapi-nested-parse:: LDMATRIX and STMATRIX operations for CuTeDSL backend. Based on tl_templates/cuda/ldsm.h These functions provide wrappers around PTX ldmatrix/stmatrix instructions for loading/storing 8x8 matrix fragments between shared memory and registers. Functions --------- .. autoapisummary:: tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x1 tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x2 tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x4 tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x1_trans tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x2_trans tilelang.contrib.cutedsl.ldsm.ptx_ldmatrix_x4_trans tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x1 tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x2 tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x4 tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x1_trans tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x2_trans tilelang.contrib.cutedsl.ldsm.ptx_stmatrix_x4_trans Module Contents --------------- .. py:function:: ptx_ldmatrix_x1(smem_ptr, local_ptr, *, loc=None, ip=None) Load 1 matrix (8x8) from shared memory .. py:function:: ptx_ldmatrix_x2(smem_ptr, local_ptr, *, loc=None, ip=None) Load 2 matrices (8x8 each) from shared memory .. py:function:: ptx_ldmatrix_x4(smem_ptr, local_ptr, *, loc=None, ip=None) Load 4 matrices (8x8 each) from shared memory .. py:function:: ptx_ldmatrix_x1_trans(smem_ptr, local_ptr, *, loc=None, ip=None) Load 1 matrix (8x8) with transpose from shared memory .. py:function:: ptx_ldmatrix_x2_trans(smem_ptr, local_ptr, *, loc=None, ip=None) Load 2 matrices (8x8 each) with transpose from shared memory .. py:function:: ptx_ldmatrix_x4_trans(smem_ptr, local_ptr, *, loc=None, ip=None) Load 4 matrices (8x8 each) with transpose from shared memory .. py:function:: ptx_stmatrix_x1(smem_ptr, value0, *, loc=None, ip=None) Store 1 matrix (8x8) to shared memory .. py:function:: ptx_stmatrix_x2(smem_ptr, value0, value1, *, loc=None, ip=None) Store 2 matrices (8x8 each) to shared memory .. py:function:: ptx_stmatrix_x4(smem_ptr, value0, value1, value2, value3, *, loc=None, ip=None) Store 4 matrices (8x8 each) to shared memory .. py:function:: ptx_stmatrix_x1_trans(smem_ptr, value0, *, loc=None, ip=None) Store 1 matrix (8x8) with transpose to shared memory .. py:function:: ptx_stmatrix_x2_trans(smem_ptr, value0, value1, *, loc=None, ip=None) Store 2 matrices (8x8 each) with transpose to shared memory .. py:function:: ptx_stmatrix_x4_trans(smem_ptr, value0, value1, value2, value3, *, loc=None, ip=None) Store 4 matrices (8x8 each) with transpose to shared memory