Bielik Anatomy 🧠Ep.3 - JUST FUSE IT: Fixing GPU Memory Bottlenecks with Kernel Fusion
10 Mar 2026Third episode of the Bielik Anatomy series. GPU memory bottlenecks are the silent killer of LLM performance - your GPU sits idle 98% of the time during operations like RMSNorm and Softmax. Kernel fusion is the fix.
This episode builds custom single-pass Triton kernels for RMSNorm and causal Softmax, processing data directly in ultra-fast SRAM instead of repeatedly hitting VRAM. The fused causal mask Softmax implementation scales cleanly and benchmarks at 2x faster than PyTorch - demonstrating that optimizing data movement, not just compute, is essential for modern AI models.
Resources: