Bielik Anatomy 🧠 Ep.3 - JUST FUSE IT: Fixing GPU Memory Bottlenecks with Kernel Fusion

Third episode of the Bielik Anatomy series. GPU memory bottlenecks are the silent killer of LLM performance - your GPU sits idle 98% of the time during operations like RMSNorm and Softmax. Kernel fusion is the fix.

This episode builds custom single-pass Triton kernels for RMSNorm and causal Softmax, processing data directly in ultra-fast SRAM instead of repeatedly hitting VRAM. The fused causal mask Softmax implementation scales cleanly and benchmarks at 2x faster than PyTorch - demonstrating that optimizing data movement, not just compute, is essential for modern AI models.

Resources: