Bielik Anatomy 🧠Ep.7 - Assembling a Full LLM from Custom Triton Kernels
27 Jun 2026Seventh episode of the Bielik Anatomy series. It’s moment of truth time - every custom Triton kernel built throughout the series (RMSNorm, MatMul, RoPE, Flash Attention, and SwiGLU) is now wired together into a fully functional Bielik 1.5B Instruct model architecture.
The episode walks through constructing the complete decoder layer step by step, tackles the unexpected challenge of handling bias in linear and activation layers, loads the official pretrained weights from HuggingFace safetensors, and spins up an interactive chat interface to talk with Bielik live.
Key engineering highlights covered:
- Bielik 1.5B architecture overview and why biases forced a change in kernel strategy
- Custom 2D-grid Embedding kernel vs. native PyTorch
- Zero-cost MatMul Bias fusion via
tl.constexprcompile-time flags - no branching overhead - Fused SwiGLU with Bias computed entirely in GPU registers
- Decoder Layer assembly integrating GQA, RoPE, and Flash Attention
- Autoregressive sampling mechanics and full model construction
- Benchmarks on RTX 4060 Ti: a 28% speedup in Time To First Token (TTFT) - plus a deep dive into the memory bandwidth bottleneck that limits throughput during generation without a KV Cache
Resources: