Introduction

Let's start with the mathematical foundations of absolute position encodings, examining the limitations that led to relative approaches, exploring complex number representations, and ultimately showing how these concepts enable transformer models to effectively capture positional relationships.

1. Position Encoding in Transformers: Evolution and Approaches

1.1 Absolute Position Encoding: The Original Approach

The original transformer model introduced sinusoidal position encodings:

$PE_{(pos, 2i)} = \sin(pos/10000^{2i/d_{model}})$ $PE_{(pos, 2i+1)} = \cos(pos/10000^{2i/d_{model}})$

Where $pos$ is the position and $i$ is the dimension. This encoding has two key properties:

  1. It provides a unique encoding for each position
  2. The model might extrapolate to sequence lengths unseen during training

However, this absolute approach faces limitations when dealing with relative relationships between tokens, which are crucial for many linguistic structures.

1.2 T5's Simplification: Decoupling Content and Position

T5 (Text-to-Text Transfer Transformer) modified the attention mechanism by explicitly decoupling content and position:

$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$

This expands into four distinct terms:

  • Input-input: $x_i W_Q W_K^T x_j^T$ (content-content interaction)
  • Input-position: $x_i W_Q W_K^T p_j^T$ (content attends to position)
  • Position-input: $p_i W_Q W_K^T x_j^T$ (position attends to content)
  • Position-position: $p_i W_Q W_K^T p_j^T$ (position-position interaction)

T5's key insight was that the position-position and position-input terms could be deleted and $p_i W_Q W_K^T p_j^T$ is actually a function $B_{ij}$ which can be trained as a parameter:

$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + B_{ij}$

Where $B_{ij} = f(i-j)$, a function assigning relative distances to "buckets." Each bucket has its own bias value, and all relative distances that fall into the same bucket share that value:

Example of bucketing:

  • $i-j = 0: f(0) = 0$ (Same token)
  • $i-j = 1: f(1) = 1$ (Adjacent tokens)
  • $i-j = 2: f(2) = 2$ (2 tokens apart)
  • ...
  • $i-j = 8: f(8) = 8$ (8 tokens apart)
  • $i-j = 9: f(9) = 8$ (Mapped to the same bucket as distance 8)
  • $i-j = 10: f(10) = 8$ (Mapped to the same bucket as distance 8)

This bucketing approach maps small distances (0-7) to individual buckets, while larger distances map to broader buckets, capturing the intuition that precise positioning matters more for nearby tokens.

1.3 XLNet-Style: Relative Position Encoding

XLNet introduced a more sophisticated approach to relative position encoding, starting with the expanded attention score:

$q_i \cdot k_j^T = (x_i + p_i) W_Q \cdot W_K^T (x_j + p_j)^T$

This expands to: $q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$

XLNet replaces $p_j$ with a relative position encoding $\mathbf{R}_{ij}$ and modifies the computation:

$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T \mathbf{R}_{ij} + u W_K^T x_j^T + v W_K^T \mathbf{R}_{ij}$

Where:

  • $u$ and $v$ are trainable vectors that interact with content and position
  • $\mathbf{R}_{ij}$ is a learnable relative position encoding

This approach addresses the challenge that encoding spaces may not match—the input embeddings are in one space (typically $\mathbb{R}^d$), while relative position encoding $\mathbf{R}_{ij}$ is a vector that encodes distance. The projection matrices $W_K$ and $W_Q$ ensure these spaces are compatible for attention calculations.

1.4 The Complex Form Approach

Taking a step back, we can view relative position encoding through the lens of complex numbers, which provides an elegant mathematical framework.

For a vector $[x, y]$, we can treat this as a complex number $x + iy$. Using the natural representation of rotations in the complex plane, we can encode positional information by multiplying by a complex exponential:

$(x + iy) \cdot e^{in\theta} = (x + iy) \cdot (\cos(n\theta) + i\sin(n\theta))$

Where $n$ is the position and $\theta$ is a fixed angle (which could be position-dependent in more sophisticated schemes).

This naturally leads us to Rotary Position Embedding (RoPE), which builds on this complex number intuition but formalizes it for high-dimensional transformer embeddings.

2. Rotary Position Encoding: The Foundation

2.1 The Core Idea

The central insight of RoPE is elegantly simple: represent token positions as rotations in vector space. For a sequence of tokens $(q_0, q_1, ..., q_{d-2}, q_{d-1})$, we apply a rotation to each pair of dimensions with an angle that's proportional to the position index.

Let's start with the rotation matrix $\mathbf{R}_m$ which is block-diagonal:

$$\mathbf{R}_m = \text{diag}(\mathbf{R}_m^0, \mathbf{R}_m^1, ..., \mathbf{R}_m^{d/2-1})$$

where each block $\mathbf{R}_m^i$ is a 2D rotation matrix:

$$\mathbf{R}_m^i = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix}$$

Here, $m$ indexes the position, and $\theta_i$ is a fixed angle for dimension $i$.

1.2 Applying Rotations to Embeddings

In the context of transformers, we apply these rotations to query and key vectors before computing attention:

$$f(q, m) = \mathbf{R}_m q$$ $$f(k, n) = \mathbf{R}_n k$$

The dot product between transformed vectors then becomes:

$$\langle f(q,m), f(k,n) \rangle = q^T \mathbf{R}_m^T \mathbf{R}_n k = q^T \mathbf{R}_{n-m} k$$

This remarkable property shows that the attention score between positions $m$ and $n$ depends only on their relative distance $n-m$, not their absolute positions.

2. The Complex Number Approach

2.1 Reframing in the Complex Domain

We can express the rotation operations more elegantly using complex numbers. Let's assume $f(q,m)$ has an exponential form analogous to rotation in the complex plane:

$$f(q,m) = qe^{im\theta}$$

where $q$ is a complex vector and $e^{im\theta}$ rotates $q$ by an angle $m\theta$.

2.2 Verifying the Properties

Let's verify that this form exhibits the desired behavior:

For a query at position $m$: $f(q, m) = qe^{im\theta}$ For a key at position $n$: $f(k, n) = ke^{in\theta}$

Computing the inner product: $$\langle f(q,m), f^*(k,n) \rangle = (qe^{im\theta})(k^*e^{-in\theta}) = qk^*e^{i(m-n)\theta}$$

Taking the real part: $$\text{Re}[qk^*e^{i(m-n)\theta}] = \text{Re}[qk^*]\cos((m-n)\theta) - \text{Im}[qk^*]\sin((m-n)\theta)$$

This formula confirms that the dot product result depends on $q$, $k$, and their relative position $m-n$.

2.3 Expanding the Multiplication

To understand how this works with real-valued vectors, let's expand the complex multiplication:

$$(x+iy)(\cos(n\theta) + i\sin(n\theta)) = x\cos(n\theta) - y\sin(n\theta) + i(x\sin(n\theta) + y\cos(n\theta))$$

Grouping real and imaginary parts:

  • Real part: $x\cos(n\theta) - y\sin(n\theta)$
  • Imaginary part: $x\sin(n\theta) + y\cos(n\theta)$

As a vector, this becomes: $$\begin{pmatrix} x\cos(n\theta) - y\sin(n\theta) \\ x\sin(n\theta) + y\cos(n\theta) \end{pmatrix}$$

This is equivalent to applying the rotation matrix: $$\begin{pmatrix} \cos(n\theta) & -\sin(n\theta) \\ \sin(n\theta) & \cos(n\theta) \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}$$

3. Complex Form: A Deeper Dive

3.1 Multiple Word Vectors and Their Representation

In transformer architectures, each word $j$ at position $k$ has three sets of vectors $(v_j, w_j, θ_j)$ that are independent of position:

  • $v_{j,m}$: Magnitude for dimension $m$
  • $w_{j,m}$: Frequency (how fast the phase changes with position)
  • $θ_{j,m}$: Initial phase (starting angle)

For each word $j$ at position $k$, its embedding is a vector of $d$ complex numbers, where each component is:

$v_{j,m} \cdot e^{i(w_{j,m}k+θ_{j,m})}$

Breaking down this formula:

  • The term $e^{i(w_{j,m}k+θ_{j,m})}$ is a point on the unit circle in the complex plane with angle $w_{j,m}k+θ_{j,m}$
  • Multiplying by $v_{j,m}$ scales this point to have radius $v_{j,m}$
  • As $k$ (position) changes, the angle increases by $w_{j,m}$ times the position shift, like a rotating arrow

Consider a word "cat" at position $k=1$ with dimension $m=1$. If $v_{j,1}=2, w_{j,1}=0.5, θ_{j,1}=0$, the component would be:

$2e^{i(0.5 \cdot 1 + 0)} = 2e^{i0.5} = 2(\cos(0.5) + i\sin(0.5))$

At $k=2$, it becomes $2e^{i1.0}$, rotating further. This form elegantly encodes where the word is positioned while preserving its content characteristics.

3.2 Understanding the Two-Dimensional Vector Representation

To understand how these complex numbers are implemented in practice, we can view a 2D vector $[x,y]$ as the complex number $x+iy$.

Step 1: Understand the vector as a complex number Step 2: Multiply by a complex exponential

$(x+iy)e^{inθ} = (x+iy)(\cos(nθ)+i\sin(nθ))$

Expanding this multiplication:

$(x+iy)(\cos(nθ)+i\sin(nθ)) = x\cos(nθ) + xi\sin(nθ) + iy\cos(nθ) + iy\cdot i\sin(nθ)$

Since $i^2 = -1$, substitute: $yi^2\sin(nθ) = y(-1)\sin(nθ) = -y\sin(nθ)$

Rewriting the expression: $x\cos(nθ) + xi\sin(nθ) + iy\cos(nθ) - y\sin(nθ)$

Grouping real and imaginary parts:

  • Real part: $x\cos(nθ) - y\sin(nθ)$
  • Imaginary part: $x\sin(nθ) + y\cos(nθ)$

As a vector, this becomes: $\begin{pmatrix} x\cos(nθ) - y\sin(nθ) \\ x\sin(nθ) + y\cos(nθ) \end{pmatrix}$

This is equivalent to applying the rotation matrix: $\begin{pmatrix} \cos(nθ) & -\sin(nθ) \\ \sin(nθ) & \cos(nθ) \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix}$

We've now established a direct link between complex number multiplication and rotation matrices, which is the foundation for RoPE.

4. Scaling to Higher Dimensions

4.1 Extending to d-dimensional Vectors

For a $d$-dimensional vector $q = (q_0, q_1, ..., q_{d-2}, q_{d-1})$, we group coordinates into pairs and apply a rotation to each pair:

  1. Split the $d$-dimensional vector into $d/2$ pairs: $(q_0, q_1), (q_2, q_3), ..., (q_{d-2}, q_{d-1})$
  2. Apply rotation to each pair with potentially different angles $\theta_k$

For example, with a 4D vector $[x_1, x_2, x_3, x_4]$:

  • Pair 1: $(x_1, x_2)$ → Rotate by $n\theta_1$
  • Pair 2: $(x_3, x_4)$ → Rotate by $n\theta_2$

3.2 Position Encoding Scheme

The transformation encodes position $n$ into the vector by making the rotation angle $n\theta$ depend directly on position $n$:

  • Absolute position: Each position gets a unique transformation
  • Relative position: In attention mechanisms, we compute dot products between vectors (query $q_m$ and key $k_n$). The dot product of two rotated vectors includes terms like $\cos((m-n)\theta)$, which depends on the relative position $(m-n)$.

5. RoPE: Relative Position Encoding

5.1 Complex Form: A Unified Approach

Before diving into the specifics of RoPE, let's explore the complex form approach that combines complex numbers and positional encoding:

The idea is to represent a word $j$ at position $k$ with a vector of complex numbers:

$[v_{j,1}e^{i(w_{j,1}k+\theta_{j,1})}, v_{j,2}e^{i(w_{j,2}k+\theta_{j,2})}, ..., v_{j,d}e^{i(w_{j,d}k+\theta_{j,d})}]$

Where:

  • $j$ is the word in vocabulary
  • $k$ is the position of the word in the sentence
  • $d$ is the number of dimensions in the embedding
  • $v_j, w_j, \theta_j$ are three vectors, each with $d$ components, unique to word $j$
    • $v_{j,m}$: Magnitude for dimension $m$
    • $w_{j,m}$: Frequency (how fast the phase changes with position)
    • $\theta_{j,m}$: Initial phase (starting angle)

For each word $j$ at position $k$, each component of its embedding is:

$v_{j,m}e^{i(w_{j,m}k+\theta_{j,m})}$

Breaking down this formula:

  • The term $e^{i(w_{j,m}k+\theta_{j,m})}$ represents a point on the unit circle with angle $w_{j,m}k+\theta_{j,m}$
  • As $k$ (position) changes, the angle increases by $w_{j,m}$ times the position shift
  • The phase changes with position, encoding where the word is in a continuous and periodic way
  • $v_{j,m}$ reflects the word's inherent strength or importance

5.2 Key Idea of RoPE

RoPE (Rotary Position Encoding) builds on these insights, modifying token embeddings by applying a function $f$ that embeds position:

$\tilde{q}_m = f(q, m)$ $\tilde{k}_n = f(k, n)$

The dot product becomes: $\langle \tilde{q}_m, \tilde{k}_n \rangle = \langle f(q,m), f(k,n) \rangle = g(q, k, m-n)$

Here, $g$ is some function of the original vectors $q$, $k$, and their relative distance $m-n$.

4.2 Long-Range Attenuation

RoPE implements a natural decay of influence with distance:

As $|m-n|$ grows, terms oscillate with frequencies $\theta_i$, with $\theta_i = 10000^{-2i/d}$ for dimension $i$. Higher $i$ means faster oscillation, reducing sums via cancellation. This decay aligns with the linguistic intuition that distant tokens matter less.

4.3 Implementation in Attention

Linear attention approximates softmax attention for efficiency:

$$\text{Attention}(q_i, k_j, v_j) = \frac{\sum_j \phi(q_i)^T \phi(k_j) v_j}{\sum_j \phi(q_i)^T \phi(k_j)}$$

RoPE applies $\mathbf{R}_m$ to $q$ and $k$ before $\phi$, preserving relative position properties in the numerator:

$$\text{Attention}(q_i, k_j, v_j) = \frac{\sum_j \phi(\mathbf{R}_m q_i)^T \phi(\mathbf{R}_n k_j) v_j}{\sum_j \phi(\mathbf{R}_m q_i)^T \phi(\mathbf{R}_n k_j)}$$

5. Advantages of Relative Position Encoding

5.1 Why Relative Position Works Better

Relative position encoding offers several compelling advantages:

  1. Invariance to Absolute Positions: For tokens at positions $(i,j)$ and $(i+k, j+k)$, the relative distance $(i-j)$ remains unchanged, ensuring the model generalizes across positions.

  2. Efficiency: By clipping distances (e.g., $\pm 2$), we avoid $O(n^2)$ parameters for sequence length $n$.

  3. Linguistic Relevance: Syntactic dependencies (e.g., subject-verb) often depend on proximity. For "cat sat" at positions $(2,3)$ or $(3,4)$, the relative distance $i-j = -1$ ensures consistent attention weighting.

Example: When encoding "The cat sat quietly", the attention score between "cat" and "sat" uses the same relative distance encoding regardless of where they appear in the sentence, allowing the model to learn consistent syntactic relationships.

5.2 Mathematical Proof

Claim: Relative position encoding captures dependencies based on token proximity, independent of absolute positions.

Proof:

  1. Invariance to Shifts: For tokens at positions $(i,j)$ and $(i+k, j+k)$, the relative distance $(i-j)$ remains unchanged. Their attention score $a_{ij}$ is identical, ensuring the model generalizes across positions.

  2. Efficiency: By clipping distances to a fixed range (e.g., $-2$ to $+2$), we need only a constant number of embeddings regardless of sequence length.

  3. Linguistic Justification: For "cat sat" at positions $(2,3)$ with $i-j = -1$, the encoding $\mathbf{R}_{3-4} = \mathbf{R}_{-1}$ emphasizes adjacency regardless of absolute positions.

6. Technical Implementation

6.1 Some observations

Another useful observation relates to how relative position information is used in the model:

→ The model uses relative position information to decide which positions to pay attention to (via $v_j$), but when it retrieves the information from those positions, it only uses the content itself, not positional data.

This led to T5's simplification approach: decoupling content and position. In the full attention calculation, the attention score can be expanded as:

$q_i \cdot k_j^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$

T5 noted that these can be interpreted as:

1) Content-Content Term: $(x_i W_Q W_K^T x_j^T)$

  • Measures how much the content of token $i$ (e.g., cat) attends to the content of token $j$

2) Content-Position Term: $(x_i W_Q W_K^T p_j^T)$

  • Measures how much the content of token $i$ attends to the absolute positions of token $j$

3) Position-Content Term: $(p_i W_Q W_K^T x_j^T)$

  • Measures how much the absolute position of token $i$ attends to the content of token $j$

4) Position-Position Term: $(p_i W_Q W_K^T p_j^T)$

  • Measures interaction between the absolute positions of $i$ & $j$

The idea: decouple content from position → there should not be interaction between "input-position" & "position-input" terms. The position-position term $(p_i W_Q W_K^T p_j^T)$ is actually $(i,j)$ which can be trained as a parameter.

6.2 Standard Self-Attention with Absolute Position Encoding

In standard transformer models with absolute position encoding:

$q_i = (x_i + p_i)W_Q$ $k_j = (x_j + p_j)W_K$ $v_j = (x_j + p_j)W_V$ $a_{ij} = \text{softmax}\left(\frac{q_i \cdot k_j^T}{\sqrt{d}}\right)$ $o_i = \sum_j a_{ij}v_j$

Expanding the attention score:

$q_i \cdot k_j^T = (x_i + p_i)W_Q W_K^T(x_j + p_j)^T = x_i W_Q W_K^T x_j^T + x_i W_Q W_K^T p_j^T + p_i W_Q W_K^T x_j^T + p_i W_Q W_K^T p_j^T$

This expands into four terms:

  1. Content-content interaction
  2. Content-position interaction
  3. Position-content interaction
  4. Position-position interaction

6.2 RoPE's Modification

RoPE replaces absolute position terms with learnable relative position vectors:

  1. Remove position bias from queries: $q_i = x_i W_Q$
  2. Replace absolute positions with relative position vector $\mathbf{R}_{ij}^K$: $k_j = x_j W_K + \mathbf{R}_{ij}^K$
  3. Modify the attention score: $$a_{ij} = \text{softmax}\left(\frac{x_i W_Q (x_j W_K + \mathbf{R}_{ij}^K)^T}{\sqrt{d}}\right)$$

6.3 Defining Relative Position Vectors

The relative position vectors depend on $i-j$, clipped to a fixed range $[p_{min}, p_{max}]$:

$$\mathbf{R}_{ij}^K = P_K[\text{clip}(i-j, p_{min}, p_{max})]$$ $$\mathbf{R}_{ij}^V = P_V[\text{clip}(i-j, p_{min}, p_{max})]$$

Where $P_K$ and $P_V$ are learnable embeddings for each clipped distance.

Example: If $p_{min} = -2$ and $p_{max} = 2$, distance beyond $\pm 2$ are clipped:

  • For $i-j = 5$, $\text{clip}(5, -2, 2) = 2$
  • Only 5 embeddings are needed: -2, -1, 0, 1, 2

6.4 Why Projection Matrices are Needed

An interesting question arises: Why is $\mathbf{R}_{ij}$ preceded by $W_k^T$ and $W_h^T$?

The key vector for position $j$ is computed as $k_j = x_j W_K$, where $W_K$ is a matrix that projects $x_j$ into key space (e.g., R^{64}).

For relative position encoding $\mathbf{R}_{ij}$, we need to bring it to a compatible space so it could interact with query & key vectors in the attention score.

  • $W_K^T$ → transpose of the key projection matrix used by $x_j$
  • Key point: $\mathbf{R}_{ij}$ gets its own dedicated projection matrix, $W_{K,R}$
  • $W_{K,R}$ is a separate matrix designed to transform $\mathbf{R}_{ij}$ into the key space, ensuring it aligns with dimensions and structure needed for attention calculation

Similarly with query-side projections:

  • $u$ and $v$ are used as standalone vectors without needing additional transformations by $W_Q$
  • In the attention score, $u$ interacts with $x_j W_K$ and $v$ interacts with projected $\mathbf{R}_{ij} W_{K,R}$

The full attention calculation with these projections becomes: $x_i W_Q W_K^T x_j^T + x_i W_Q W_{K,R}^T \mathbf{R}_{ij} + u W_K^T x_j^T + v W_{K,R}^T \mathbf{R}_{ij}$

Conclusion

We've taken a comprehensive journey through the evolution of position encoding in transformer models, from absolute position encodings in the original transformer, through T5's simplifications and XLNet's innovations, to the elegant mathematics of RoPE.

The key insights we've covered:

  1. The Problem: The inherent permutation invariance of attention mechanisms necessitates explicit position information
  2. Early Solutions: Absolute position encodings added to token embeddings worked but had limitations
  3. T5's Contribution: Decoupling content and position by simplifying the attention mechanism
  4. XLNet's Approach: Combining content and relative position information with dedicated parameter vectors
  5. Complex Numbers: Providing an elegant mathematical framework for understanding rotations
  6. RoPE's Innovation: Encoding positions as rotations in vector space that naturally preserve relative distances
  7. Practical Advantages: Relative position methods generalize better, require fewer parameters, and align with linguistic intuition

The mathematical elegance of position encoding methods, particularly RoPE, demonstrates how first principles can lead to powerful practical techniques. By embracing the underlying geometry of the problem, we gain a position encoding method that not only works well in practice but also has theoretical properties that justify its success.

The advantages of relative position encoding are clear:

  • Generalization: Handles variable sentence lengths and structures
  • Efficiency: Fewer parameters than absolute encoding
  • Linguistic Relevance: Prioritizes local dependencies that are critical for syntax

As transformer architectures continue to evolve, understanding the mathematical foundations of components like position encoding becomes increasingly important for developing more efficient and effective models. The journey from absolute to relative position encodings illustrates how theoretical insights can lead to practical improvements in model performance and capabilities.