currentscurrents OP t1_j608oz5 wrote
TL;DR:
-
In-context learning (ICL) is the ability of language models to "learn from example" to perform new tasks just based on prompting. These researchers are studying the mechanism behind ICL.
-
They show that the attention layers allow transformers to implement a gradient descent optimization process at inference time. This mechanism produces very similar results to explicit optimization through fine-tuning, but was itself learned by optimization through gradient descent.
-
Based on this finding they apply momentum, a technique known to improve optimizers, to transformer attention layers. This produces a small-but-consistent improvement in performance on all tested tasks. They suggest that there are more improvements to be made by explicitly biasing transformers towards meta-optimization.
This reminds me of some meta-learning architectures that try to intentionally include gradient descent as part of inference (https://arxiv.org/abs/1909.04630) - the difference here is that LLMs somehow learned this technique during training. The implication is pretty impressive: at enough scale, meta-learning just emerges by itself because it's a good solution to the problem.
Other researchers are looking into ICL as well, here's another recent paper on the topic: https://arxiv.org/abs/2211.15661
lucidraisin t1_j61h7lf wrote
and one more paper along same lines! https://arxiv.org/abs/2212.07677
currentscurrents OP t1_j61ndkl wrote
Thanks for the link!
I think it's interesting that they spent so much time in the 90s trying to make meta-learning work, and now it appears emergently just from throwing scale at the problem.
DigThatData t1_j61zv3l wrote
Compute Is All You Need
endless_sea_of_stars t1_j627a9m wrote
Just rent out an AWS region for a month and you'll be good to go. Hold a couple bake sales to defray the cost.
robdogcronin t1_j61zvce wrote
That's the bitter lesson
currentscurrents OP t1_j623hb4 wrote
Yeah, but I want AI now. Not in 40 years when computers are 1000x better.
Also I'm not sure computers will be 1000x better in 40 years, Moore's law isn't what it used to be.
EarthquakeBass t1_j64jhk3 wrote
https://en.m.wikipedia.org/wiki/Huang%27s_law
A bit of marketing flair for sure, but I think at the crossroads of hardware improvements, ensembling, clever optimizations etc. we will keep improving models at a pretty darn fast pace. GPT-3 alone dramatically has improved the productivity of engineers, I’m sure of it.
throwaway2676 t1_j68vbfq wrote
> Not in 40 years when computers are 1000x better.
It won't take anywhere near that long. We've barely scratched the surface of ASICs and analog matrix multiplication, which is where the real fun is going to begin.
ElectronicCress3132 t1_j629tix wrote
> implement a gradient descent optimization process at inference time
Could you expand on what this means? At inference time, I thought all weights were frozen, so how could the attention layers be somehow performing gradient descent?
Edit: I read the paper in detail and understood it (walk through the math in Section 3). Basically, the sentence itself X has some weights that go through the attention layer (recall how attention works: it embeds the sentence, then multiplies it by key, value, query matrices). If you give it some examples, X', to learn from, well, of course there are going to be weights for both X, and X'. Turns out those weights for X' end up being equivalent to stepping in gradient descent.
Acceptable-Cress-374 t1_j62qh5g wrote
Thank you for putting it into words, I was having trouble understanding this myself.
curiousshortguy t1_j61silr wrote
This is cool, thanks for sharing
[deleted] t1_j61h1lt wrote
[deleted]
throwaway2676 t1_j6d99fw wrote
So shouldn't this mean we can train transformers using forward passes alone? It seems that it wouldn't be too difficult to derive an algorithm that updates the attention weights based on these results, but I don't believe the authors mention the possibility.
Viewing a single comment thread. View all comments