visarga t1_j46b2po wrote
Reply to comment by Chemont in [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont
No but if you use a decoder model (autoregressive) you can generate more tokens for the same task, depending on its difficulty. Chain-of-thought makes use of this trick.
Viewing a single comment thread. View all comments