Residual Connections and Layer Normalization
These components address training challenges in deep networks. Residual connections create shortcuts for gradient flow by adding sublayer inputs to their outputs, while layer normalization stabilizes activations, enabling much deeper models.
Residual connections (or skip connections) help mitigate the vanishing gradient problem in deep networks by allowing gradients to flow directly through the network. Each sublayer's output is added to its input, creating direct paths for backpropagation.
Layer normalization standardizes the inputs to each layer, reducing internal covariate shift and stabilizing the learning process. This normalization operates across the feature dimension for each token, helping maintain consistent scale throughout the network regardless of depth.