Bielik Anatomy 🧠 Ep.2 - How to Beat PyTorch? Fast MatMul Kernel in Triton

Second episode of the Bielik Anatomy series. Matrix multiplication is the heart of every Transformer model - if it’s slow, your model is slow. This episode builds a custom MatMul kernel in OpenAI Triton from scratch and aggressively optimizes it to match PyTorch’s performance on the GPU.

The episode covers writing a basic tiled kernel with masking, then applies four layers of optimization: Grouped Block Ordering for L2 cache hit maximization, switching to FP16 to unlock Tensor Cores, Auto-Tuning for parameter search, and pipelining with warp-level control - ending with a benchmark that matches PyTorch.

Resources: