Position Encodings
Since attention has no inherent notion of token order, transformers add position information directly to token embeddings using sinusoidal functions of different frequencies. This allows understanding token positions without reintroducing sequential bottlenecks.
Position encodings embed each token's position in the sequence as a vector that gets added to the token's embedding. This ensures the model can distinguish between the same word appearing in different positions and understand concepts like word order and proximity.
The original implementation used fixed sinusoidal functions, though many modern variants use learned position embeddings that the model adjusts during training to capture position information optimally for its specific tasks.