Bielik Anatomy 🧠Ep.2 - How to Beat PyTorch? Fast MatMul Kernel in Triton
14 Feb 2026Second episode of the Bielik Anatomy series. Matrix multiplication is the heart of every Transformer model - if it’s slow, your model is slow. This episode builds a custom MatMul kernel in OpenAI Triton from scratch and aggressively optimizes it to match PyTorch’s performance on the GPU.
The episode covers writing a basic tiled kernel with masking, then applies four layers of optimization: Grouped Block Ordering for L2 cache hit maximization, switching to FP16 to unlock Tensor Cores, Auto-Tuning for parameter search, and pipelining with warp-level control - ending with a benchmark that matches PyTorch.
Resources: