Bielik Anatomy đź§  Ep.4 - Implementing RoPE: From Mathematical Formula to Triton Code

Fourth episode of the Bielik Anatomy series. Rotary Position Embeddings (RoPE) give Transformers the positional awareness needed to distinguish “Dog bites man” from “Man bites dog” - and this episode implements them from scratch in OpenAI Triton.

The episode starts with the core RoPE math, then builds a custom zero-allocation Triton kernel handling sequence data efficiently. But it goes deeper than a coding tutorial: benchmarking reveals a surprising 670 GB/s bandwidth anomaly, which leads to a real hardware profiling session with NVIDIA Nsight Compute (ncu). The investigation uncovers the “L2 Cache Illusion”, wave quantization, and kernel launch overhead - a practical lesson in why software benchmarks alone can’t be trusted without understanding GPU memory architecture.

Resources: