Viewing a single comment thread. View all comments

cfoster0 t1_j4alveu wrote on January 14, 2023 at 9:29 AM

FWIW in certain sense this goes against the design philosophy of transformers, which is to jointly compute all representations within a layer at once, to maximize the degree of parallelism on GPUs and other accelerators.