I should have been clearer with my question. What I was wondering was, if there are any extensions to the Transformer architecture that allow it to, in theory, spent indefinite amounts of compute on one token. I suppose one could train a very deep Transformer, use CALM during inference and only use all of the layers for tokens which are difficult to predict, but this would still arbitrarily limit the maximum amount of compute per token.
Chemont OP t1_j41eamz wrote
Reply to comment by amrit_za in [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont
I should have been clearer with my question. What I was wondering was, if there are any extensions to the Transformer architecture that allow it to, in theory, spent indefinite amounts of compute on one token. I suppose one could train a very deep Transformer, use CALM during inference and only use all of the layers for tokens which are difficult to predict, but this would still arbitrarily limit the maximum amount of compute per token.