Bielik Anatomy 🧠 Ep.2 - How to Beat PyTorch? Fast MatMul Kernel in Triton

14 Feb 2026

Second episode of the Bielik Anatomy series. Matrix multiplication is the heart of every Transformer model - if it’s slow, your model is slow. This episode builds a custom MatMul kernel in OpenAI Triton from scratch and aggressively optimizes it to match PyTorch’s performance on the GPU.

The episode covers writing a basic tiled kernel with masking, then applies four layers of optimization: Grouped Block Ordering for L2 cache hit maximization, switching to FP16 to unlock Tensor Cores, Auto-Tuning for parameter search, and pipelining with warp-level control - ending with a benchmark that matches PyTorch.

Resources:

GitHub Repo
Previous Episode

Qooba

Bielik Anatomy 🧠 Ep.2 - How to Beat PyTorch? Fast MatMul Kernel in Triton

Related Posts

Bielik Anatomy 🧠 Ep.6 - SwiGLU Kernel Fusion: 30% Faster with Triton on RTX 4060 16 May 2026

Bielik Anatomy 🧠 Ep.5 - Flash Attention vs Standard Attention | 20x Faster in Triton 02 May 2026

Bielik Anatomy 🧠 Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code 26 Mar 2026