Feed-Forward Networks
Between attention layers, point-wise feed-forward networks with non-linear activations provide additional representational capacity, transforming attention-weighted information through learned projections.
Each feed-forward network consists of two linear transformations with a ReLU activation in between, applied independently to each position. While attention layers capture interactions between positions, these feed-forward layers process each position's information independently.
Despite their simplicity, these networks significantly increase the model's capacity to represent complex functions and are often where much of a transformer's parameter count resides. They can be viewed as position-wise fully-connected layers that transform each token's representation.