[R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? Submitted by Chemont t3_109z8om on January 12, 2023 at 1:07 PM in MachineLearning 16 comments 19
cfoster0 t1_j4alveu wrote on January 14, 2023 at 9:29 AM FWIW in certain sense this goes against the design philosophy of transformers, which is to jointly compute all representations within a layer at once, to maximize the degree of parallelism on GPUs and other accelerators. Permalink 1
Viewing a single comment thread. View all comments