Bielik Anatomy đź§  Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code

Fourth episode of the Bielik Anatomy series. Rotary Position Embeddings (RoPE) give Transformers the positional awareness needed to distinguish “Dog bites man” from “Man bites dog” - and this episode implements them from scratch in OpenAI Triton.

The episode starts with the core RoPE math, then builds a custom zero-allocation Triton kernel handling sequence data efficiently. But it goes deeper than a coding tutorial: benchmarking reveals a surprising 670 GB/s bandwidth anomaly, which leads to a real hardware profiling session with NVIDIA Nsight Compute (ncu). The investigation uncovers the “L2 Cache Illusion”, wave quantization, and kernel launch overhead - a practical lesson in why software benchmarks alone can’t be trusted without understanding GPU memory architecture.

Resources:

Bielik Anatomy đź§  Ep.3 - JUST FUSE IT: Fixing GPU Memory Bottlenecks with Kernel Fusion

Third episode of the Bielik Anatomy series. GPU memory bottlenecks are the silent killer of LLM performance - your GPU sits idle 98% of the time during operations like RMSNorm and Softmax. Kernel fusion is the fix.

This episode builds custom single-pass Triton kernels for RMSNorm and causal Softmax, processing data directly in ultra-fast SRAM instead of repeatedly hitting VRAM. The fused causal mask Softmax implementation scales cleanly and benchmarks at 2x faster than PyTorch - demonstrating that optimizing data movement, not just compute, is essential for modern AI models.

Resources:

Bielik Anatomy đź§  Ep.2 - How to Beat PyTorch? Fast MatMul Kernel in Triton

Second episode of the Bielik Anatomy series. Matrix multiplication is the heart of every Transformer model - if it’s slow, your model is slow. This episode builds a custom MatMul kernel in OpenAI Triton from scratch and aggressively optimizes it to match PyTorch’s performance on the GPU.

The episode covers writing a basic tiled kernel with masking, then applies four layers of optimization: Grouped Block Ordering for L2 cache hit maximization, switching to FP16 to unlock Tensor Cores, Auto-Tuning for parameter search, and pipelining with warp-level control - ending with a benchmark that matches PyTorch.

Resources:

Bielik Anatomy đź§  Ep.1 - Bielik LM in Triton - Can I Actually Pull This Off?

First episode of the Bielik Anatomy series - implementing the Polish language model Bielik 1.5 (1.6B parameters) from scratch using GPU kernels in OpenAI Triton.

This episode covers the Bielik 1.5 Instruct architecture, Grouped Query Attention (GQA) vs. Multi-Head Attention, SwiGLU activation and RMSNorm, and an introduction to GPU programming in Triton. It also lays out the full roadmap for the 8-episode series including Flash Attention, RoPE, and custom kernels.

Resources:

Polish 🇵🇱 LLM Bielik v2.5 v2.6 v3.0 - Tools Calling and Structured Output 🚀

This video (in Polish 🇵🇱) explores the new capabilities introduced across Bielik versions 2.5, 2.6, and 3.0 - specifically tools calling and structured output.

The episode walks through running Bielik on free Google Colab with the Unsloth library, explains what tools calling and structured output are and how to use them in practice, and traces the prompt formats used under the hood for both features. A sample application called e-bazarek demonstrates tools calling end-to-end, while a separate example shows practical use cases for structured output.

Resources: