Layer Normalization in GPT models
In GPT models, Layer Normalization is applied before the self-attention and feedforward blocks. This positioning of LN is key to managing the gradient scales, supporting the training process. The normalization process ensures that the feature values within each token possess a mean of 0 and a standard deviation of 1.
Why use Layer Normalization?
The normalization process aids in managing the scale of the gradients, which can substantially impact the training process. The gradients could become too large without normalization, leading to unstable training. By normalizing the gradients, the model can learn more effectively and efficiently.
Furthermore, Layer Normalization helps to preserve the relative order of the spikes or values of various tokens, even though it alters the magnitude of these spikes. This characteristic has been demonstrated to enhance the model’s training time and performance.
Layer Normalization in pre-LN and post-LN transformers
The positioning of Layer Normalization within the transformer architecture can lead to different versions of transformer models. The original post-LN transformer places Layer Normalization between the residual blocks, which can result in large expected gradients near the output layer. A learning rate warm-up stage is often utilized to handle these large gradients.