Рет қаралды 1,340
LongRoPE & Theta Extrapolation Scaling of RoPE for extreme context length explained - in scientific detail. To increase the context lengths of modern LLMs, we evaluate the performance and methods of LongRope and Theta Extrapolation /Scaling for extreme context length extensions. From 8K to 4M context length for a Llama 3-7B LLM.
Rope encoding works well within the training context length but faces challenges when the sequence length during inference exceeds the training length, leading to a performance drop. This is primarily because the positional encodings become out-of-distribution (OOD), causing instability in the attention mechanism.
To overcome this issue, theta scaling is introduced. The idea is to adjust the "rotary base," which is a key parameter in RoPE. By increasing this base value, the model can extend its effective context length, allowing it to handle longer sequences more accurately. This adjustment aligns the positional encodings with the longer input texts, improving the model's ability to extrapolate and maintain performance.
Interestingly, decreasing the rotary base can also enhance the model's extrapolation capabilities. By doing so, the positional encodings are more tightly packed, ensuring that the model can fully learn the positional patterns within the training context. This approach helps the model generalize better to longer sequences beyond its training data. Both increasing and decreasing the rotary base offer ways to extend the context length that RoPE-based models can handle effectively, providing a versatile solution to improve their performance on longer texts.
#airesearch
#aieducation