throwaway2676 t1_j6d99fw wrote
Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents
So shouldn't this mean we can train transformers using forward passes alone? It seems that it wouldn't be too difficult to derive an algorithm that updates the attention weights based on these results, but I don't believe the authors mention the possibility.
Viewing a single comment thread. View all comments