Bielik Anatomy 🧠 Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code

26 Mar 2026

Fourth episode of the Bielik Anatomy series. Rotary Position Embeddings (RoPE) give Transformers the positional awareness needed to distinguish “Dog bites man” from “Man bites dog” - and this episode implements them from scratch in OpenAI Triton.

The episode starts with the core RoPE math, then builds a custom zero-allocation Triton kernel handling sequence data efficiently. But it goes deeper than a coding tutorial: benchmarking reveals a surprising 670 GB/s bandwidth anomaly, which leads to a real hardware profiling session with NVIDIA Nsight Compute (ncu). The investigation uncovers the “L2 Cache Illusion”, wave quantization, and kernel launch overhead - a practical lesson in why software benchmarks alone can’t be trusted without understanding GPU memory architecture.

Resources:

GitHub Repo

Qooba

Bielik Anatomy 🧠 Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code

Related Posts

Bielik Anatomy 🧠 Ep.7 - Assembling a Full LLM from Custom Triton Kernels 27 Jun 2026

Bielik Anatomy 🧠 Ep.6 - SwiGLU Kernel Fusion: 30% Faster with Triton on RTX 4060 16 May 2026

Bielik Anatomy 🧠 Ep.5 - Flash Attention vs Standard Attention | 20x Faster in Triton 02 May 2026