Layers like this will be compressed
Posted: Mon Dec 23, 2024 5:05 am
Sequence models store historical context in a hidden state.Although they are very efficient, their performance is limited by their expressiveness. The attention mechanism has a K cache that grows over time. This state does not compress any historical context but becomes increasingly expensive as the length of the context increases. The team members thought: In this case, why not compress the context into the weights of the model - just like processing Internet data? This "hidden state model" can maintain a fixed size in time while greatly enhancing the expressive power.
The researchers used self-supervised learning list phone number in cambodia to update the weights of the hidden state by performing a gradient descent for each k. When processing a sequence, the state has been "trained" on the k in its context window. It is worth noting that the hidden state only exists in one layer of the end-to-end architecture. Other components such as the K projection matrix are learned during pre-training through the standard cross-entropy objective function. So the end-to-end architecture is actually meta-learning to find the best way to compress the context so that it can better predict the next k, that is, "learning how to learn at test time".
The results show that compared with - has better perplexity and less (left) and makes better use of long context (right). The figure below shows the forward time (delay) for each k as the context length changes with batch size. The parameters of all models are . (for .). It can be seen that the forward time for each k increases linearly with the increase in context length, but the forward time of the other two methods remains basically unchanged. When the context is k, - is faster than and is comparable. The embarrassing reality of the scaling law paper in 2016 showed that (one of) cannot scale like that or use long context effectively.
The researchers used self-supervised learning list phone number in cambodia to update the weights of the hidden state by performing a gradient descent for each k. When processing a sequence, the state has been "trained" on the k in its context window. It is worth noting that the hidden state only exists in one layer of the end-to-end architecture. Other components such as the K projection matrix are learned during pre-training through the standard cross-entropy objective function. So the end-to-end architecture is actually meta-learning to find the best way to compress the context so that it can better predict the next k, that is, "learning how to learn at test time".
The results show that compared with - has better perplexity and less (left) and makes better use of long context (right). The figure below shows the forward time (delay) for each k as the context length changes with batch size. The parameters of all models are . (for .). It can be seen that the forward time for each k increases linearly with the increase in context length, but the forward time of the other two methods remains basically unchanged. When the context is k, - is faster than and is comparable. The embarrassing reality of the scaling law paper in 2016 showed that (one of) cannot scale like that or use long context effectively.