Bielik Anatomy 🧠 Ep.7 - Assembling a Full LLM from Custom Triton Kernels

27 Jun 2026

Seventh episode of the Bielik Anatomy series. It’s moment of truth time - every custom Triton kernel built throughout the series (RMSNorm, MatMul, RoPE, Flash Attention, and SwiGLU) is now wired together into a fully functional Bielik 1.5B Instruct model architecture.

The episode walks through constructing the complete decoder layer step by step, tackles the unexpected challenge of handling bias in linear and activation layers, loads the official pretrained weights from HuggingFace safetensors, and spins up an interactive chat interface to talk with Bielik live.

Key engineering highlights covered:

Bielik 1.5B architecture overview and why biases forced a change in kernel strategy
Custom 2D-grid Embedding kernel vs. native PyTorch
Zero-cost MatMul Bias fusion via tl.constexpr compile-time flags - no branching overhead
Fused SwiGLU with Bias computed entirely in GPU registers
Decoder Layer assembly integrating GQA, RoPE, and Flash Attention
Autoregressive sampling mechanics and full model construction
Benchmarks on RTX 4060 Ti: a 28% speedup in Time To First Token (TTFT) - plus a deep dive into the memory bandwidth bottleneck that limits throughput during generation without a KV Cache

Resources:

GitHub Repo

Qooba

Bielik Anatomy 🧠 Ep.7 - Assembling a Full LLM from Custom Triton Kernels

Related Posts

Bielik Anatomy 🧠 Ep.6 - SwiGLU Kernel Fusion: 30% Faster with Triton on RTX 4060 16 May 2026

Bielik Anatomy 🧠 Ep.5 - Flash Attention vs Standard Attention | 20x Faster in Triton 02 May 2026

Bielik Anatomy 🧠 Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code 26 Mar 2026