throwaway2676 t1_j6d99fw wrote on January 29, 2023 at 3:06 PM

Reply to comment by currentscurrents in [R] Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers by currentscurrents

So shouldn't this mean we can train transformers using forward passes alone? It seems that it wouldn't be too difficult to derive an algorithm that updates the attention weights based on these results, but I don't believe the authors mention the possibility.