Seventh episode of the Bielik Anatomy series. It’s moment of truth time - every custom Triton kernel built throughout the series (RMSNorm, MatMul, RoPE, Flash Attention, and SwiGLU) is now wired together into a fully functional Bielik 1.5B Instruct model architecture.
The episode walks through constructing the complete decoder layer step by step, tackles the unexpected challenge of handling bias in linear and activation layers, loads the official pretrained weights from HuggingFace safetensors, and spins up an interactive chat interface to talk with Bielik live.
Key engineering highlights covered:
Bielik 1.5B architecture overview and why biases forced a change in kernel strategy
Custom 2D-grid Embedding kernel vs. native PyTorch
Zero-cost MatMul Bias fusion via tl.constexpr compile-time flags - no branching overhead
Fused SwiGLU with Bias computed entirely in GPU registers
Decoder Layer assembly integrating GQA, RoPE, and Flash Attention
Autoregressive sampling mechanics and full model construction
Benchmarks on RTX 4060 Ti: a 28% speedup in Time To First Token (TTFT) - plus a deep dive into the memory bandwidth bottleneck that limits throughput during generation without a KV Cache
Sixth episode of the Bielik Anatomy series. Eliminating redundant read/write cycles to GPU global memory can unlock massive performance gains. Benchmarked on an RTX 4060, a custom Triton kernel delivers a 15-30% speedup in TFLOPS compared to standard PyTorch and torch.compile.
The episode walks through the Triton code line by line: managing L2 cache swizzling for maximum block reuse, designing dual accumulators in registers for parallel matrix multiplications, and applying the SiLU (Swish) activation directly on-chip before writing back to HBM - fusing the entire SwiGLU operation into a single kernel pass.
Fifth episode of the Bielik Anatomy series. Why does your GPU run out of memory when running large language models? This episode breaks down the math behind Standard Self-Attention, exposes the HBM memory bottleneck, and fixes it by implementing Flash Attention from scratch in OpenAI Triton.
By keeping matrix calculations inside fast SRAM and applying the online softmax algorithm, the standard O(N²) memory trap is bypassed entirely. The result: VRAM usage drops from 7.8 GB down to just 65 MB for an 8k context window, reaching 65 TFLOPS - a 20x speedup over standard PyTorch.
Fourth episode of the Bielik Anatomy series. Rotary Position Embeddings (RoPE) give Transformers the positional awareness needed to distinguish “Dog bites man” from “Man bites dog” - and this episode implements them from scratch in OpenAI Triton.
The episode starts with the core RoPE math, then builds a custom zero-allocation Triton kernel handling sequence data efficiently. But it goes deeper than a coding tutorial: benchmarking reveals a surprising 670 GB/s bandwidth anomaly, which leads to a real hardware profiling session with NVIDIA Nsight Compute (ncu). The investigation uncovers the “L2 Cache Illusion”, wave quantization, and kernel launch overhead - a practical lesson in why software benchmarks alone can’t be trusted without understanding GPU memory architecture.
Third episode of the Bielik Anatomy series. GPU memory bottlenecks are the silent killer of LLM performance - your GPU sits idle 98% of the time during operations like RMSNorm and Softmax. Kernel fusion is the fix.
This episode builds custom single-pass Triton kernels for RMSNorm and causal Softmax, processing data directly in ultra-fast SRAM instead of repeatedly hitting VRAM. The fused causal mask Softmax implementation scales cleanly and benchmarks at 2x faster than PyTorch - demonstrating that optimizing data movement, not just compute, is essential for modern AI models.
We use cookies to ensure that we give you the best experience on our website. If you continue to use this site we will assume that you are happy with it.Ok