Chemont

Chemont OP t1_j41eamz wrote

I should have been clearer with my question. What I was wondering was, if there are any extensions to the Transformer architecture that allow it to, in theory, spent indefinite amounts of compute on one token. I suppose one could train a very deep Transformer, use CALM during inference and only use all of the layers for tokens which are difficult to predict, but this would still arbitrarily limit the maximum amount of compute per token.

6