Bielik Anatomy 🧠 Ep.3 - JUST FUSE IT: Fixing GPU Memory Bottlenecks with Kernel Fusion

10 Mar 2026

Third episode of the Bielik Anatomy series. GPU memory bottlenecks are the silent killer of LLM performance - your GPU sits idle 98% of the time during operations like RMSNorm and Softmax. Kernel fusion is the fix.

This episode builds custom single-pass Triton kernels for RMSNorm and causal Softmax, processing data directly in ultra-fast SRAM instead of repeatedly hitting VRAM. The fused causal mask Softmax implementation scales cleanly and benchmarks at 2x faster than PyTorch - demonstrating that optimizing data movement, not just compute, is essential for modern AI models.

Resources:

GitHub Repo
Previous Episode (MatMul from scratch)

Qooba

Bielik Anatomy 🧠 Ep.3 - JUST FUSE IT: Fixing GPU Memory Bottlenecks with Kernel Fusion

Related Posts

Bielik Anatomy 🧠 Ep.7 - Assembling a Full LLM from Custom Triton Kernels 27 Jun 2026

Bielik Anatomy 🧠 Ep.6 - SwiGLU Kernel Fusion: 30% Faster with Triton on RTX 4060 16 May 2026

Bielik Anatomy 🧠 Ep.5 - Flash Attention vs Standard Attention | 20x Faster in Triton 02 May 2026