Introduction

When we process sequential data like text, the order of elements matters tremendously. The sentence "dog bites man" conveys a very different meaning than "man bites dog," despite containing exactly the same words. This simple example highlights why position information is crucial for understanding sequences.

Transformer architectures revolutionized natural language processing with their self-attention mechanism, allowing models to process entire sequences in parallel rather than sequentially like RNNs. However, this parallelization comes with a challenge: self-attention is inherently position-agnostic. If we simply feed word embeddings into a transformer, the model has no way of knowing which word came first, second, or last.

This position-blindness creates a fundamental problem. How can transformers understand the sequential nature of language without sacrificing their parallelization advantage? The answer lies in positional encodings - specially designed vectors that inject position information into the model. In this blog, we'll take a deep mathematical dive into how transformer positional encodings work, particularly focusing on the elegant sinusoidal solution presented in the "Attention Is All You Need" paper.

Let's explore the mathematical elegance that allows transformers to understand sequence order while maintaining their computational advantages.

The Challenge of Position in Self-Attention

To understand why positional encodings are necessary, we must first examine the self-attention mechanism's architecture. In the most basic form, self-attention operates on a set of input vectors and computes weighted connections between them.

In the self-attention mechanism, input tokens are converted to query (q), key (k), and value (v) vectors through linear transformations. The attention weights are then computed via dot products between queries and keys, determining how much each token should "attend" to other tokens.

Mathematically, for each token's position i, the attention mechanism computes:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The critical observation is that this formulation is permutation invariant. If we shuffle the order of tokens in the input sequence, the computed attention patterns would adjust accordingly but still produce mathematically equivalent results. There's nothing in the core attention mechanism that encodes or preserves information about the absolute or relative positions of tokens.

As noted in our analysis: "These are indistinguishable information for self-attention because the operation of self-attention is undirected." This position-blindness is a fundamental limitation that needs to be addressed.

Positional Encoding Requirements

Before diving into specific encoding strategies, let's establish what makes a good positional encoding. From our analysis, we can identify three critical requirements:

  1. Absolute Position Representation: The encoding must uniquely identify the absolute position of each token in the sequence (e.g., first token is 1, second token is 2).

  2. Relative Position Consistency: When sequences have different lengths, the relative positions/distances between tokens must remain consistent. For example, the relative distance between positions 2 and 4 should be encoded the same way regardless of whether the sequence has 10 tokens or 100 tokens.

  3. Length Generalization: The encoding system must work for sequence lengths that the model has never seen during training. This is crucial for practical applications where input lengths can vary widely.

These requirements create interesting constraints on the mathematical properties our positional encoding must satisfy. Let's explore different approaches and see how they measure up against these requirements.

Approaches to Positional Encoding

Integer Positional Encoding

The most intuitive approach might be to simply use integer values to mark positions:

$$\text{position}_i = i$$

Where i represents the position in the sequence (1st, 2nd, 3rd, etc.). This natural encoding labels the first token as 1, the second as 2, and so on.

However, this approach faces a significant problem. As our analysis points out: "model may encounter sequences longer than training, not conducive to generalization of model." The issue is that position values become unbounded as the sequence length increases. If a model is trained on sequences of maximum length 512 but then encounters a sequence of length 1000, positions 513-1000 would be completely out of the training distribution.

Additionally, as the length of sequences increases, position values grow larger and larger, potentially causing numerical instability or dominating the actual content embeddings.

Bounded Range Encoding

To address the unbounded nature of integer encoding, we can normalize positions to a bounded range [0,1]:

$$\text{position}_i = \frac{i-1}{L-1}$$

Where L is the sequence length, mapping the first position to 0 and the last position to 1.

This approach ensures that regardless of sequence length, position values remain bounded between 0 and 1. For example:

  • For a 3-token sequence: [0, 0.5, 1]
  • For a 4-token sequence: [0, 0.33, 0.67, 1]

This neatly addresses the generalization problem for variable sequence lengths. However, it introduces a new issue: the relative distances between tokens now depend on sequence length. In a 3-token sequence, adjacent tokens have a positional difference of 0.5, while in a 4-token sequence, adjacent tokens have a difference of 0.33.

This inconsistency in relative positions violates our second requirement and can make it harder for the model to learn consistent patterns across sequences of different lengths.

Vector-Based Positional Encoding

To overcome the limitations of scalar position values, we can move to vector-based representations where we use a vector with the same dimension as our token embeddings to represent position.

One approach is to use binary vectors for positional encoding. In this method, we represent positions using binary vectors where different dimensions encode different aspects of position.

For a token at position a₀ with d_model = 3, we might have:

a₀ a₁ a₂ a₃ a₄ a₅ a₆ a₇
0  0  0  0  1  1  1  1
0  0  1  1  0  0  1  1
0  1  0  1  0  1  0  1

This creates a unique binary signature for each position. However, this approach still has limitations in terms of generalization to unseen sequence lengths and maintaining consistent relative distances.

Sinusoidal Positional Encoding: The Elegant Solution

The transformer architecture introduced an elegant solution: sinusoidal positional encodings. We need functions that are bounded and continuous, with the sine function being the simplest option.

The sinusoidal encoding defines each dimension of the positional encoding vector as follows:

$$PE_{(t,2i)} = \sin\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right)$$

$$PE_{(t,2i+1)} = \cos\left(\frac{t}{10000^{2i/d_{\text{model}}}}\right)$$

Where:

  • t is the position in the sequence
  • i is the dimension index (ranging from 0 to d_model/2-1)
  • d_model is the dimensionality of the model embeddings

This creates a unique positional fingerprint for each position t, where each dimension oscillates at a different frequency. The frequencies form a geometric progression from 1 to 1/10000, providing a rich spectrum of periodic signals.

We can express the full positional encoding vector as:

$$PE_t = \left[\sin(\omega_1 t), \cos(\omega_1 t), \sin(\omega_2 t), \cos(\omega_2 t), \ldots, \sin(\omega_{d_{\text{model}}/2} t), \cos(\omega_{d_{\text{model}}/2} t)\right]$$

Where:

$$\omega_i = \frac{1}{10000^{2i/d_{\text{model}}}}$$

Let's explore why this formulation is particularly effective by examining its mathematical properties.

Basic Properties of Sinusoidal Positional Encoding

1. Uniqueness of Position Vectors

The vector representation for each position is unique because the frequencies of the sine and cosine functions are carefully selected to create distinct patterns for each position. As our analysis notes: "The vector of each token is unique (the frequency of each sin function is small enough)." This ensures that each position gets a distinctive representation.

2. Bounded Values

All values in the positional encoding vector are bounded between -1 and 1, preventing numerical instability regardless of sequence length. This addresses the unboundedness issue of integer encodings.

3. Frequency Spectrum

The use of different frequencies for different dimensions creates a rich representation. At lower values of t (positions near the beginning), high frequencies dominate, potentially creating overlap between position vectors. To avoid this, the frequencies are set to low values, achieved through the 10000 denominator term.

As noted in our analysis: "Relationship between frequency, wavelength, and t → At lower values of t, frequency is high so there could be lots of overlap between position vectors. To avoid this, we try to lengthen the wavelength of function."

4. Alternating Sine and Cosine

The alternation between sine and cosine functions for consecutive dimensions serves multiple purposes:

  • Creates unique vector representations for each position
  • Ensures values remain bounded in a continuous space
  • Facilitates generalization to sequence lengths not seen during training

This alternating pattern also enables the encoding of relative positions through linear transformations, a property we'll explore next.

Advanced Mathematical Properties

The Linear Transformation Property

One of the most powerful properties of sinusoidal encodings is that they can represent both absolute and relative positions efficiently. The key insight from our analysis: "Different position vectors can be obtained through linear transformation → this would help represent both absolute & relative position of tokens."

Mathematically, we can express this as:

$$PE_{t+\Delta t} = [T_{\Delta t}] \cdot PE_t$$

Where $[T_{\Delta t}]$ is a linear transformation matrix that depends only on the offset $\Delta t$, not on the absolute position $t$.

This linear transformation corresponds to a rotation in the 2D subspace spanned by each sine-cosine pair. It follows from a fundamental property of sinusoidal functions:

$$\begin{pmatrix} \sin(t+\Delta t) \\ \cos(t+\Delta t) \end{pmatrix} = \begin{pmatrix} \cos \Delta t & \sin \Delta t \\ -\sin \Delta t & \cos \Delta t \end{pmatrix} \begin{pmatrix} \sin(t) \\ \cos(t) \end{pmatrix}$$

This property means that the model can learn to "shift" positions through linear transformations, enabling it to understand relative positions in the sequence without explicitly computing them.

Linear Position Encoding: A Mathematical Necessity

A crucial mathematical property of effective positional encodings is that they must satisfy the condition:

$$\phi_m - \phi_n = \phi_{m-n}$$

This condition ensures that the encoding correctly captures relative positions. Let's prove that linear encodings of the form $\phi_m = m\theta$ are the only solution to this condition.

Step 1: Key Cases Analysis

Case 1: If n=0 $$\phi_m - \phi_0 = \phi_{m-0} = \phi_m$$ This implies $\phi_0 = 0$.

Case 2: If n=1 $$\phi_m - \phi_1 = \phi_{m-1}$$ Rearranging: $$\phi_m = \phi_{m-1} + \phi_1$$

This gives us a recurrence relation.

Step 2: Solving the Recurrence

Setting $\phi_1 = \theta$ (a constant), we can solve the recurrence: $$\phi_2 = \phi_1 + \theta = 2\theta$$ $$\phi_3 = \phi_2 + \theta = 3\theta$$ $$\phi_4 = \phi_3 + \theta = 4\theta$$

By induction, we get: $$\phi_m = m\theta$$

Step 3: Verifying the Solution

This solution satisfies the original condition: $$\phi_m - \phi_n = m\theta - n\theta = (m-n)\theta = \phi_{m-n}$$

Step 4: Why Non-Linear Solutions Fail

If $\phi_m$ were non-linear (for example, $\phi_m = m^2\theta$): $$\phi_m - \phi_n = m^2\theta - n^2\theta = (m^2 - n^2)\theta = (m+n)(m-n)\theta$$

But: $$\phi_{m-n} = (m-n)^2\theta$$

Since $(m+n)(m-n) \neq (m-n)^2$ in general, non-linear functions don't satisfy the condition.

Therefore, the only solution to $\phi_m - \phi_n = \phi_{m-n}$ is the linear function $\phi_m = m\theta$. This mathematical necessity explains why sinusoidal encodings with linearly scaled frequencies are used in transformers.

Inner Product and Relative Position Dependency

Perhaps the most remarkable property of sinusoidal positional encodings is how they encode relative distances through inner products. The inner product between two position encodings depends only on the relative distance between them, not their absolute positions.

Let's derive this mathematically, starting with a complex number representation:

$$\langle PE_m, PE_n \rangle = \text{Re}[P_m P_n^*]$$

Where $P_m = e^{im\theta}$ and $P_n = e^{in\theta}$ in the complex number representation.

Then: $$P_m P_n^* = e^{im\theta} \cdot e^{-in\theta} = e^{i(m-n)\theta}$$

Taking the real part: $$\text{Re}[P_m P_n^*] = \cos((m-n)\theta)$$

This elegant result shows that the inner product between two position encodings depends only on their relative offset (m-n), not their absolute positions. This is a crucial property for the self-attention mechanism, which relies heavily on inner products.

Extending to the full d-dimensional case:

$$\langle PE_m, PE_n \rangle = \sum_{i=0}^{d_{\text{model}}-1} \cos((m-n)\theta_i)$$

This means that the similarity between position vectors naturally captures their relative distance, with the similarity decreasing as the distance increases. This property elegantly addresses our requirement for consistent relative position encoding.

Asymptotic Behavior and Frequency Choice

Why Inner Products Decay with Distance

A critical feature of the positional encoding is how the inner product between positions decays as their distance increases. This creates a natural attention bias toward local context while still allowing global interactions when needed.

For large values of |m-n|, we need to analyze integrals of the form:

$$I = \int_0^1 e^{ix\phi(t)} dt$$

Where x = m-n (the relative distance) and $\phi(t)$ is the phase function.

For large |m-n|, these integrals decay due to rapid oscillations. This is a consequence of the Riemann-Lebesgue formula, which states that:

$$\lim_{x \to \infty} \int_a^b e^{ix\phi(t)} dt = 0$$

Provided $\phi(t)$ is smooth and not constant.

Mathematical Analysis of Decay Rate

For the specific case of transformer positional encodings with $\phi(t) = e^{-\lambda t}$ where $\lambda = \ln(10000)$:

$$I = \int_0^1 e^{ix e^{-\lambda t}} dt$$

The key insight is: "For large x, dominant contribution comes from regions where phase $xe^{-\lambda t}$ varies slowly. However, since $e^{-\lambda t}$ decreases slowly, there is no stationary point and the integral decays as 1/x."

We can analyze this more rigorously through substitution and integration by parts:

Step 1: Substitution

Let $s = e^{-\lambda t}$, then $t = -\frac{\ln s}{\lambda}$ and $dt = -\frac{1}{\lambda s}ds$.

This transforms the integral to:

$$I = \frac{1}{\lambda} \int_{e^{-\lambda}}^1 \frac{e^{ixs}}{s} ds$$

Step 2: Integration by Parts

Using $u = \frac{1}{s}$ and $dv = e^{ixs} ds$, we apply integration by parts:

$$I = \frac{1}{\lambda} \left[ \frac{e^{ixs}}{ixs} \right]_{e^{-\lambda}}^1 + \frac{1}{ix\lambda} \int_{e^{-\lambda}}^1 \frac{e^{ixs}}{s^2} ds$$

For large x, the boundary term dominates:

$$I \approx \frac{1}{\lambda} \left( \frac{e^{ix} - e^{ixe^{-\lambda}}}{ix} \right) = O\left(\frac{1}{x}\right)$$

This confirms that the integral decays as O(1/x) for large values of x, which means the correlation between positions decays as the relative distance increases.

The Role of the Frequency Parameter

The specific choice of frequency parameter $\theta_k = 10000^{-2k/d_{\text{model}}}$ plays a critical role in how correlations decay with distance.

For the inner product between positions m and n:

$$\langle PE_m, PE_n \rangle = \sum_{k=0}^{d-1} \cos((m-n)\theta_k)$$

For large d (high-dimensional embeddings), this sum can be approximated as an integral:

$$\langle PE_m, PE_n \rangle \approx \frac{d}{2} \cdot \text{Re}\left[ \int_0^1 e^{i(m-n)\theta_t} dt \right]$$

With $\theta_t = 10000^{-t}$ for $t \in [0,1]$, which corresponds to $\theta_t = e^{-\lambda t}$ where $\lambda = \ln(10000)$.

Our analysis shows that this integral decays as O(1/|m-n|) for large |m-n|, creating a smooth falloff in attention with distance.

Interestingly, alternate frequency choices would produce different decay rates:

  • For θ = t^-1, I ∝ O(1/|x|)
  • For θ = t^-2, I ∝ O(1/|x|^(1/2))

The exponential frequency spacing used in transformer positional encodings creates a balance between local and global attention that works well in practice.

Absence of Stationary Points

The specific mathematical behavior of the positional encoding comes from the absence of stationary points in the phase function. For the phase function $\phi(t) = e^{-\lambda t}$:

  • Phase Analysis: $\phi'(t) = -\lambda e^{-\lambda t} < 0$ for all $t > 0$
  • No Stationary Point: $\phi'(t) \neq 0$ for all $t \in [0,1]$
  • Monotonic Decay: $\phi(t)$ decreases exponentially and $\phi'(t)$ never changes sign

This absence of stationary points explains why the integral decays as O(1/|x|) rather than the faster decay rates typically seen in stationary phase approximations (which can be O(1/|x|^(1/2)) or faster).

This creates a gradual decay in attention with distance, rather than a sharp cutoff or a perfectly uniform attention pattern.

Breaking Symmetry: A Taylor Expansion Perspective

Another insightful way to understand positional encodings is through Taylor expansion. For a pure attention model without position information, the function is fully symmetric:

$$f(x_1, x_2, ..., x_n, ...) = f(x_n, ..., x_1, ...)$$

This means transformers cannot recognize position - the output would be the same regardless of token order. By adding positional encodings, we break this symmetry.

Using Taylor expansion:

$$\tilde{f}(..., x_m + p_m, ..., x_n + p_n, ...) = f(..., x_m, ..., x_n, ...) + p_m \frac{\partial f}{\partial x_m} + p_n \frac{\partial f}{\partial x_n} + \frac{1}{2}p_m^2 \frac{\partial^2 f}{\partial x_m^2} + ...$$

Where the terms $p_m \frac{\partial f}{\partial x_m}$ contain position-dependent information. The key insight: "As long as encoding vector of each position is different, this breaks the symmetry."

This Taylor expansion shows how positional information gets integrated with content information, allowing the model to distinguish between different token arrangements.

Practical Implications for Transformer Models

The mathematical properties we've explored have significant practical implications for transformer models:

  1. Generalization to unseen sequence lengths: Since sinusoidal encodings are defined for any position value, they can naturally handle sequences longer than those seen during training. This directly follows from the continuous nature of sine and cosine functions.

  2. Consistent relative positioning: The inner product properties ensure that relative positions are encoded consistently regardless of sequence length. As we proved, the inner product $\langle PE_m, PE_n \rangle$ depends only on (m-n), not on absolute positions.

  3. Natural attention decay: The asymptotic decay properties (O(1/|m-n|)) align with the intuition that distant tokens typically have weaker relationships. This creates an inductive bias toward local context while still allowing global interactions.

  4. Parameter efficiency: Unlike learned positional embeddings, sinusoidal encodings don't require additional trainable parameters. The encoding is deterministic and can be computed on-the-fly.

  5. Computational efficiency: The encodings can be computed on-the-fly rather than stored in a lookup table, saving memory for very long sequences.

Summary of Mathematical Properties

Bringing these mathematical analyses together:

  1. The condition φₘ - φₙ = φₘ₋ₙ forces positional encoding angles to follow a linear pattern φₘ = mθ, which is exactly what sinusoidal encoding provides.

  2. The specific choice of frequency parameter θₖ = 10000⁻²ᵏ/ᵈ creates an inner product that decays as O(1/|m-n|) for large distances, due to the absence of stationary points in the phase function.

  3. This decay property helps the model naturally focus more on local context while still maintaining the ability to detect long-range dependencies when needed.

As summarized in our analysis:

  • "Sinusoidal encodings use frequencies that decay exponentially across dimensions"
  • "Inner Product Decay: Results from destructive interference in high-frequency oscillatory integrals"
  • "Design choice: θₖ = 10000⁻²ᵏ/ᵈ ensures smooth frequency coverage & practical decay properties"

Conclusion

The sinusoidal positional encoding used in transformer models represents a beautiful intersection of mathematical elegance and practical utility. By encoding positions using sinusoidal functions at different frequencies, transformers gain the ability to understand sequence order while maintaining their parallelization advantages.

The key insights we've explored include:

  1. How sinusoidal functions provide a bounded, continuous representation of position
  2. Why the inner product between position encodings naturally captures relative distances
  3. How the specific frequency progression (10000⁻²ᵏ/ᵈ) creates a balanced representation
  4. Why linear position encoding (φₘ = mθ) is the only solution that correctly encodes relative positions
  5. How the asymptotic behavior creates a natural decay for distant token relationships

These mathematical properties combine to create a positional encoding scheme that satisfies all our requirements: representing absolute positions, maintaining consistent relative distances, and generalizing to unseen sequence lengths.

Understanding these mathematical foundations not only gives us deeper insight into transformer models but also opens doors to potential improvements and adaptations for specific tasks or domains. The elegant mathematics behind sinusoidal positional encodings reveals how transformers achieve their remarkable ability to understand sequence order while maintaining their computational advantages.

References

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (Vol. 30).
  • Kazemnejad, A. (2019). Transformer Architecture: The Positional Encoding. https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
  • Wang, P., & Peng, X. (2021). Understanding and Improving Positional Encoding for Transformers. ArXiv:2012.15832.