Viewing a single comment thread. View all comments

throwaway2676 t1_j6d99fw wrote

So shouldn't this mean we can train transformers using forward passes alone? It seems that it wouldn't be too difficult to derive an algorithm that updates the attention weights based on these results, but I don't believe the authors mention the possibility.