Unveiling Position Encoding in Transformers - From Absolute to Relative with RoPE
Explore transformer position encodings, from absolute sinusoidal methods through T5 and XLNet innovations to Rotary Position Embedding (RoPE), detailing how complex exponentials and rotational geometry elegantly solve the challenge of encoding token positions in a way that preserves critical relative relationships.
- Introduction
- 1. Position Encoding in Transformers: Evolution and Approaches
- 2. Rotary Position Encoding: The Foundation
- 2. The Complex Number Approach
- 3. Complex Form: A Deeper Dive
- 4. Scaling to Higher Dimensions
- 5. RoPE: Relative Position Encoding
- 5. Advantages of Relative Position Encoding
- 6. Technical Implementation
- Conclusion
- References:
Introduction
Let's start with the mathematical foundations of absolute position encodings, examining the limitations that led to relative approaches, exploring complex number representations, and ultimately showing how these concepts enable transformer models to effectively capture positional relationships.
1. Position Encoding in Transformers: Evolution and Approaches
1.1 Absolute Position Encoding: The Original Approach
The original transformer model introduced sinusoidal position encodings:
$PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}})$ $PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}})$
Where $pos$ is the position and $i$ is the dimension. This encoding has two key properties:
- It provides a unique encoding for each position
- The model might extrapolate to sequence lengths unseen during training
However, this absolute approach faces limitations when dealing with relative relationships between tokens, which are crucial for many linguistic structures.
1.2 T5's Simplification: Decoupling Content and Position
T5 (Text-to-Text Transfer Transformer) modified the attention mechanism by explicitly decoupling content and position:
$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$
This expands into four distinct terms:
- Input-input: $x_i W_Q W_K^T x_j^T$ (content-content interaction)
- Input-position: $x_i W_Q W_K^T p_j^T$ (content attends to position)
- Position-input: $p_i W_Q W_K^T x_j^T$ (position attends to content)
- Position-position: $p_i W_Q W_K^T p_j^T$ (position-position interaction)
T5's key insight was that the position-position and position-input terms could be deleted and $p_i W_Q W_K^T p_j^T$ is actually a function $B_{ij}$ which can be trained as a parameter:
$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + B_{ij}$
Where $B_{ij} = f(i-j)$, a function assigning relative distances to "buckets." Each bucket has its own bias value, and all relative distances that fall into the same bucket share that value:
Example of bucketing:
- $i-j = 0: f(0) = 0$ (Same token)
- $i-j = 1: f(1) = 1$ (Adjacent tokens)
- $i-j = 2: f(2) = 2$ (2 tokens apart)
- ...
- $i-j = 8: f(8) = 8$ (8 tokens apart)
- $i-j = 9: f(9) = 8$ (Mapped to the same bucket as distance 8)
- $i-j = 10: f(10) = 8$ (Mapped to the same bucket as distance 8)
This bucketing approach maps small distances (0-7) to individual buckets, while larger distances map to broader buckets, capturing the intuition that precise positioning matters more for nearby tokens.
1.3 XLNet-Style: Relative Position Encoding
XLNet introduced a more sophisticated approach to relative position encoding, starting with the expanded attention score:
$q_i \cdot k_j^T = (x_i + p_i) W_Q \cdot W_K^T (x_j + p_j)^T$
This expands to: $q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$
XLNet replaces $p_j$ with a relative position encoding $\mathbf{R}_{ij}$ and modifies the computation:
$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T \mathbf{R}_{ij} + u W_K^T x_j^T + v W_K^T \mathbf{R}_{ij}$
Where:
- $u$ and $v$ are trainable vectors that interact with content and position
- $\mathbf{R}_{ij}$ is a learnable relative position encoding
This approach addresses the challenge that encoding spaces may not match—the input embeddings are in one space (typically $\mathbb{R}^d$), while relative position encoding $\mathbf{R}_{ij}$ is a vector that encodes distance. The projection matrices $W_K$ and $W_Q$ ensure these spaces are compatible for attention calculations.
1.4 The Complex Form Approach
Taking a step back, we can view relative position encoding through the lens of complex numbers, which provides an elegant mathematical framework.
For a vector $[x, y]$, we can treat this as a complex number $x + iy$. Using the natural representation of rotations in the complex plane, we can encode positional information by multiplying by a complex exponential:
$(x + iy) \cdot e^{in\theta} = (x + iy) \cdot (\cos(n\theta) + i\sin(n\theta))$
Where $n$ is the position and $\theta$ is a fixed angle (which could be position-dependent in more sophisticated schemes).
This naturally leads us to Rotary Position Embedding (RoPE), which builds on this complex number intuition but formalizes it for high-dimensional transformer embeddings.
2. Rotary Position Encoding: The Foundation
2.1 The Core Idea
The central insight of RoPE is elegantly simple: represent token positions as rotations in vector space. For a sequence of tokens $(q_0, q_1, ..., q_{d-2}, q_{d-1})$, we apply a rotation to each pair of dimensions with an angle that's proportional to the position index.
Let's start with the rotation matrix $\mathbf{R}_m$ which is block-diagonal:
$$\mathbf{R}_m = \text{diag}(\mathbf{R}_m^0, \mathbf{R}_m^1, ..., \mathbf{R}_m^{d/2-1})$$
where each block $\mathbf{R}_m^i$ is a 2D rotation matrix:
$$\mathbf{R}_m^i = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}$$
Here, $m$ indexes the position, and $\theta_i$ is a fixed angle for dimension $i$.
1.2 Applying Rotations to Embeddings
In the context of transformers, we apply these rotations to query and key vectors before computing attention:
$$f(q, m) = \mathbf{R}_m q$$ $$f(k, n) = \mathbf{R}_n k$$
The dot product between transformed vectors then becomes:
$$\langle f(q,m), f(k,n) \rangle = q^T \mathbf{R}_m^T \mathbf{R}_n k = q^T \mathbf{R}_{n-m} k$$
This remarkable property shows that the attention score between positions $m$ and $n$ depends only on their relative distance $n-m$, not their absolute positions.
2. The Complex Number Approach
2.1 Reframing in the Complex Domain
We can express the rotation operations more elegantly using complex numbers. Let's assume $f(q,m)$ has an exponential form analogous to rotation in the complex plane:
$$f(q,m) = qe^{im\theta}$$
where $q$ is a complex vector and $e^{im\theta}$ rotates $q$ by an angle $m\theta$.
2.2 Verifying the Properties
Let's verify that this form exhibits the desired behavior:
For a query at position $m$: $f(q, m) = qe^{im\theta}$ For a key at position $n$: $f(k, n) = ke^{in\theta}$
Computing the inner product: $$\langle f(q,m), f^*(k,n) \rangle = (qe^{im\theta})(k^*e^{-in\theta}) = qk^*e^{i(m-n)\theta}$$
Taking the real part: $$\text{Re}[qk^*e^{i(m-n)\theta}] = \text{Re}[qk^*]\cos((m-n)\theta) - \text{Im}[qk^*]\sin((m-n)\theta)$$
This formula confirms that the dot product result depends on $q$, $k$, and their relative position $m-n$.
2.3 Expanding the Multiplication
To understand how this works with real-valued vectors, let's expand the complex multiplication:
$$(x+iy)(\cos(n\theta) + i\sin(n\theta)) = x\cos(n\theta) - y\sin(n\theta) + i(x\sin(n\theta) + y\cos(n\theta))$$
Grouping real and imaginary parts:
- Real part: $x\cos(n\theta) - y\sin(n\theta)$
- Imaginary part: $x\sin(n\theta) + y\cos(n\theta)$
As a vector, this becomes: $$\begin{pmatrix} x\cos(n\theta) - y\sin(n\theta) \\ x\sin(n\theta) + y\cos(n\theta) \end{pmatrix}$$
This is equivalent to applying the rotation matrix: $$\begin{pmatrix} \cos(n\theta) & -\sin(n\theta) \\ \sin(n\theta) & \cos(n\theta) \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}$$
3. Complex Form: A Deeper Dive
3.1 Multiple Word Vectors and Their Representation
In transformer architectures, each word $j$ at position $k$ has three sets of vectors $(v_j, w_j, θ_j)$ that are independent of position:
- $v_{j,m}$: Magnitude for dimension $m$
- $w_{j,m}$: Frequency (how fast the phase changes with position)
- $θ_{j,m}$: Initial phase (starting angle)
For each word $j$ at position $k$, its embedding is a vector of $d$ complex numbers, where each component is:
$v_{j,m} \cdot e^{i(w_{j,m}k+θ_{j,m})}$
Breaking down this formula:
- The term $e^{i(w_{j,m}k+θ_{j,m})}$ is a point on the unit circle in the complex plane with angle $w_{j,m}k+θ_{j,m}$
- Multiplying by $v_{j,m}$ scales this point to have radius $v_{j,m}$
- As $k$ (position) changes, the angle increases by $w_{j,m}$ times the position shift, like a rotating arrow
Consider a word "cat" at position $k=1$ with dimension $m=1$. If $v_{j,1}=2, w_{j,1}=0.5, θ_{j,1}=0$, the component would be:
$2e^{i(0.5 \cdot 1 + 0)} = 2e^{i0.5} = 2(\cos(0.5) + i\sin(0.5))$
At $k=2$, it becomes $2e^{i1.0}$, rotating further. This form elegantly encodes where the word is positioned while preserving its content characteristics.
3.2 Understanding the Two-Dimensional Vector Representation
To understand how these complex numbers are implemented in practice, we can view a 2D vector $[x,y]$ as the complex number $x+iy$.
Step 1: Understand the vector as a complex number Step 2: Multiply by a complex exponential
$(x+iy)e^{inθ} = (x+iy)(\cos(nθ)+i\sin(nθ))$
Expanding this multiplication:
$(x+iy)(\cos(nθ)+i\sin(nθ)) = x\cos(nθ) + xi\sin(nθ) + iy\cos(nθ) + iy\cdot i\sin(nθ)$
Since $i^2 = -1$, substitute: $yi^2\sin(nθ) = y(-1)\sin(nθ) = -y\sin(nθ)$
Rewriting the expression: $x\cos(nθ) + xi\sin(nθ) + iy\cos(nθ) - y\sin(nθ)$
Grouping real and imaginary parts:
- Real part: $x\cos(nθ) - y\sin(nθ)$
- Imaginary part: $x\sin(nθ) + y\cos(nθ)$
As a vector, this becomes: $\begin{pmatrix} x\cos(nθ) - y\sin(nθ) \\ x\sin(nθ) + y\cos(nθ) \end{pmatrix}$
This is equivalent to applying the rotation matrix: $\begin{pmatrix} \cos(nθ) & -\sin(nθ) \\ \sin(nθ) & \cos(nθ) \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}$
We've now established a direct link between complex number multiplication and rotation matrices, which is the foundation for RoPE.
4. Scaling to Higher Dimensions
4.1 Extending to d-dimensional Vectors
For a $d$-dimensional vector $q = (q_0, q_1, ..., q_{d-2}, q_{d-1})$, we group coordinates into pairs and apply a rotation to each pair:
- Split the $d$-dimensional vector into $d/2$ pairs: $(q_0, q_1), (q_2, q_3), ..., (q_{d-2}, q_{d-1})$
- Apply rotation to each pair with potentially different angles $\theta_k$
For example, with a 4D vector $[x_1, x_2, x_3, x_4]$:
- Pair 1: $(x_1, x_2)$ → Rotate by $n\theta_1$
- Pair 2: $(x_3, x_4)$ → Rotate by $n\theta_2$
3.2 Position Encoding Scheme
The transformation encodes position $n$ into the vector by making the rotation angle $n\theta$ depend directly on position $n$:
- Absolute position: Each position gets a unique transformation
- Relative position: In attention mechanisms, we compute dot products between vectors (query $q_m$ and key $k_n$). The dot product of two rotated vectors includes terms like $\cos((m-n)\theta)$, which depends on the relative position $(m-n)$.
5. RoPE: Relative Position Encoding
5.1 Complex Form: A Unified Approach
Before diving into the specifics of RoPE, let's explore the complex form approach that combines complex numbers and positional encoding:
The idea is to represent a word $j$ at position $k$ with a vector of complex numbers:
$[v_{j,1}e^{i(w_{j,1}k+\theta_{j,1})}, v_{j,2}e^{i(w_{j,2}k+\theta_{j,2})}, ..., v_{j,d}e^{i(w_{j,d}k+\theta_{j,d})}]$
Where:
- $j$ is the word in vocabulary
- $k$ is the position of the word in the sentence
- $d$ is the number of dimensions in the embedding
- $v_j, w_j, \theta_j$ are three vectors, each with $d$ components, unique to word $j$
- $v_{j,m}$: Magnitude for dimension $m$
- $w_{j,m}$: Frequency (how fast the phase changes with position)
- $\theta_{j,m}$: Initial phase (starting angle)
For each word $j$ at position $k$, each component of its embedding is:
$v_{j,m}e^{i(w_{j,m}k+\theta_{j,m})}$
Breaking down this formula:
- The term $e^{i(w_{j,m}k+\theta_{j,m})}$ represents a point on the unit circle with angle $w_{j,m}k+\theta_{j,m}$
- As $k$ (position) changes, the angle increases by $w_{j,m}$ times the position shift
- The phase changes with position, encoding where the word is in a continuous and periodic way
- $v_{j,m}$ reflects the word's inherent strength or importance
5.2 Key Idea of RoPE
RoPE (Rotary Position Encoding) builds on these insights, modifying token embeddings by applying a function $f$ that embeds position:
$\tilde{q}_m = f(q, m)$ $\tilde{k}_n = f(k, n)$
The dot product becomes: $\langle \tilde{q}_m, \tilde{k}_n \rangle = \langle f(q,m), f(k,n) \rangle = g(q, k, m-n)$
Here, $g$ is some function of the original vectors $q$, $k$, and their relative distance $m-n$.
4.2 Long-Range Attenuation
RoPE implements a natural decay of influence with distance:
As $|m-n|$ grows, terms oscillate with frequencies $\theta_i$, with $\theta_i = 10000^{-2i/d}$ for dimension $i$. Higher $i$ means faster oscillation, reducing sums via cancellation. This decay aligns with the linguistic intuition that distant tokens matter less.
4.3 Implementation in Attention
Linear attention approximates softmax attention for efficiency:
$$\text{Attention}(q_i, k_j, v_j) = \frac{\sum_j \phi(q_i)^T \phi(k_j) v_j}{\sum_j \phi(q_i)^T \phi(k_j)}$$
RoPE applies $\mathbf{R}_m$ to $q$ and $k$ before $\phi$, preserving relative position properties in the numerator:
$$\text{Attention}(q_i, k_j, v_j) = \frac{\sum_j \phi(\mathbf{R}_m q_i)^T \phi(\mathbf{R}_n k_j) v_j}{\sum_j \phi(\mathbf{R}_m q_i)^T \phi(\mathbf{R}_n k_j)}$$
5. Advantages of Relative Position Encoding
5.1 Why Relative Position Works Better
Relative position encoding offers several compelling advantages:
-
Invariance to Absolute Positions: For tokens at positions $(i,j)$ and $(i+k, j+k)$, the relative distance $(i-j)$ remains unchanged, ensuring the model generalizes across positions.
-
Efficiency: By clipping distances (e.g., $\pm 2$), we avoid $O(n^2)$ parameters for sequence length $n$.
-
Linguistic Relevance: Syntactic dependencies (e.g., subject-verb) often depend on proximity. For "cat sat" at positions $(2,3)$ or $(3,4)$, the relative distance $i-j = -1$ ensures consistent attention weighting.
Example: When encoding "The cat sat quietly", the attention score between "cat" and "sat" uses the same relative distance encoding regardless of where they appear in the sentence, allowing the model to learn consistent syntactic relationships.
5.2 Mathematical Proof
Claim: Relative position encoding captures dependencies based on token proximity, independent of absolute positions.
Proof:
-
Invariance to Shifts: For tokens at positions $(i,j)$ and $(i+k, j+k)$, the relative distance $(i-j)$ remains unchanged. Their attention score $a_{ij}$ is identical, ensuring the model generalizes across positions.
-
Efficiency: By clipping distances to a fixed range (e.g., $-2$ to $+2$), we need only a constant number of embeddings regardless of sequence length.
-
Linguistic Justification: For "cat sat" at positions $(2,3)$ with $i-j = -1$, the encoding $\mathbf{R}_{3-4} = \mathbf{R}_{-1}$ emphasizes adjacency regardless of absolute positions.
6. Technical Implementation
6.1 Some observations
Another useful observation relates to how relative position information is used in the model:
→ The model uses relative position information to decide which positions to pay attention to (via $v_j$), but when it retrieves the information from those positions, it only uses the content itself, not positional data.
This led to T5's simplification approach: decoupling content and position. In the full attention calculation, the attention score can be expanded as:
$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$
T5 noted that these can be interpreted as:
1) Content-Content Term: $(x_i W_Q W_K^T x_j^T)$
- Measures how much the content of token $i$ (e.g., cat) attends to the content of token $j$
2) Content-Position Term: $(x_i W_Q W_K^T p_j^T)$
- Measures how much the content of token $i$ attends to the absolute positions of token $j$
3) Position-Content Term: $(p_i W_Q W_K^T x_j^T)$
- Measures how much the absolute position of token $i$ attends to the content of token $j$
4) Position-Position Term: $(p_i W_Q W_K^T p_j^T)$
- Measures interaction between the absolute positions of $i$ & $j$
The idea: decouple content from position → there should not be interaction between "input-position" & "position-input" terms. The position-position term $(p_i W_Q W_K^T p_j^T)$ is actually $(i,j)$ which can be trained as a parameter.
6.2 Standard Self-Attention with Absolute Position Encoding
In standard transformer models with absolute position encoding:
$q_i = (x_i + p_i)W_Q$ $k_j = (x_j + p_j)W_K$ $v_j = (x_j + p_j)W_V$ $a_{ij} = \text{softmax}\left(\frac{q_i \cdot k_j^T}{\sqrt{d}}\right)$ $o_i = \sum_j a_{ij}v_j$
Expanding the attention score:
$q_i \cdot k_j^T = (x_i + p_i)W_Q W_K^T(x_j + p_j)^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$
This expands into four terms:
- Content-content interaction
- Content-position interaction
- Position-content interaction
- Position-position interaction
6.2 RoPE's Modification
RoPE replaces absolute position terms with learnable relative position vectors:
- Remove position bias from queries: $q_i = x_i W_Q$
- Replace absolute positions with relative position vector $\mathbf{R}_{ij}^K$: $k_j = x_j W_K + \mathbf{R}_{ij}^K$
- Modify the attention score: $$a_{ij} = \text{softmax}\left(\frac{x_i W_Q (x_j W_K + \mathbf{R}_{ij}^K)^T}{\sqrt{d}}\right)$$
6.3 Defining Relative Position Vectors
The relative position vectors depend on $i-j$, clipped to a fixed range $[p_{min}, p_{max}]$:
$$\mathbf{R}_{ij}^K = P_K[\text{clip}(i-j, p_{min}, p_{max})]$$ $$\mathbf{R}_{ij}^V = P_V[\text{clip}(i-j, p_{min}, p_{max})]$$
Where $P_K$ and $P_V$ are learnable embeddings for each clipped distance.
Example: If $p_{min} = -2$ and $p_{max} = 2$, distance beyond $\pm 2$ are clipped:
- For $i-j = 5$, $\text{clip}(5, -2, 2) = 2$
- Only 5 embeddings are needed: -2, -1, 0, 1, 2
6.4 Why Projection Matrices are Needed
An interesting question arises: Why is $\mathbf{R}_{ij}$ preceded by $W_k^T$ and $W_h^T$?
The key vector for position $j$ is computed as $k_j = x_j W_K$, where $W_K$ is a matrix that projects $x_j$ into key space (e.g., R^{64}).
For relative position encoding $\mathbf{R}_{ij}$, we need to bring it to a compatible space so it could interact with query & key vectors in the attention score.
- $W_K^T$ → transpose of the key projection matrix used by $x_j$
- Key point: $\mathbf{R}_{ij}$ gets its own dedicated projection matrix, $W_{K,R}$
- $W_{K,R}$ is a separate matrix designed to transform $\mathbf{R}_{ij}$ into the key space, ensuring it aligns with dimensions and structure needed for attention calculation
Similarly with query-side projections:
- $u$ and $v$ are used as standalone vectors without needing additional transformations by $W_Q$
- In the attention score, $u$ interacts with $x_j W_K$ and $v$ interacts with projected $\mathbf{R}_{ij} W_{K,R}$
The full attention calculation with these projections becomes: $x_i W_Q W_K^T x_j^T + x_i W_Q W_{K,R}^T \mathbf{R}_{ij} + u W_K^T x_j^T + v W_{K,R}^T \mathbf{R}_{ij}$
Conclusion
We've taken a comprehensive journey through the evolution of position encoding in transformer models, from absolute position encodings in the original transformer, through T5's simplifications and XLNet's innovations, to the elegant mathematics of RoPE.
The key insights we've covered:
- The Problem: The inherent permutation invariance of attention mechanisms necessitates explicit position information
- Early Solutions: Absolute position encodings added to token embeddings worked but had limitations
- T5's Contribution: Decoupling content and position by simplifying the attention mechanism
- XLNet's Approach: Combining content and relative position information with dedicated parameter vectors
- Complex Numbers: Providing an elegant mathematical framework for understanding rotations
- RoPE's Innovation: Encoding positions as rotations in vector space that naturally preserve relative distances
- Practical Advantages: Relative position methods generalize better, require fewer parameters, and align with linguistic intuition
The mathematical elegance of position encoding methods, particularly RoPE, demonstrates how first principles can lead to powerful practical techniques. By embracing the underlying geometry of the problem, we gain a position encoding method that not only works well in practice but also has theoretical properties that justify its success.
The advantages of relative position encoding are clear:
- Generalization: Handles variable sentence lengths and structures
- Efficiency: Fewer parameters than absolute encoding
- Linguistic Relevance: Prioritizes local dependencies that are critical for syntax
As transformer architectures continue to evolve, understanding the mathematical foundations of components like position encoding becomes increasingly important for developing more efficient and effective models. The journey from absolute to relative position encodings illustrates how theoretical insights can lead to practical improvements in model performance and capabilities.