Writing High-Performance Kernels with Thread Primitives#