LongRoPE & Theta Scaling to 1 Mio Token (2/2)

Рет қаралды 1,164

Ай бұрын

LongRoPE & Theta Extrapolation Scaling of RoPE for extreme context length explained - in scientific detail. To increase the context lengths of modern LLMs, we evaluate the performance and methods of LongRope and Theta Extrapolation /Scaling for extreme context length extensions. From 8K to 4M context length for a Llama 3-7B LLM.
Rope encoding works well within the training context length but faces challenges when the sequence length during inference exceeds the training length, leading to a performance drop. This is primarily because the positional encodings become out-of-distribution (OOD), causing instability in the attention mechanism.
To overcome this issue, theta scaling is introduced. The idea is to adjust the "rotary base," which is a key parameter in RoPE. By increasing this base value, the model can extend its effective context length, allowing it to handle longer sequences more accurately. This adjustment aligns the positional encodings with the longer input texts, improving the model's ability to extrapolate and maintain performance.
Interestingly, decreasing the rotary base can also enhance the model's extrapolation capabilities. By doing so, the positional encodings are more tightly packed, ensuring that the model can fully learn the positional patterns within the training context. This approach helps the model generalize better to longer sequences beyond its training data. Both increasing and decreasing the rotary base offer ways to extend the context length that RoPE-based models can handle effectively, providing a versatile solution to improve their performance on longer texts.
#airesearch
#aieducation

Пікірлер: 4

@manslaughterinc.9135 Ай бұрын

On the topic of attention and context, would love to see a video on Needle-in-a-hastack and multi-needle-in-a-haystack performance of these different kinds of context extension approaches.

@MattJonesYT Ай бұрын

Cutting edge stuff, this is great!!

@simonstrandgaard5503 Ай бұрын

Excellent topic. Fine tuning with a longer context length.

@joelvalim Ай бұрын

it seems they are doing the very opposite to quantize. (I am being very visual here ok?). Quantize is kind of squashing preserving proportions and shape. LongRoPE seems to act as a kind of hologramatic projection.... and a little bit of a hamer to adjust the edges... The final fine tuning would be a way to fill the voids created by the projection, which is imperfect by nature, cause it would be able to project a shadow, not a perfect picture. Final fine tuning would fill these voids, conecting the points in that weak blue print created by the rescaled new hiper dimensional space.