Viewing a single comment thread. View all comments

amrit_za t1_j418a4l wrote

It sounds like what you're considering the "opposite" is just a reframing of original task i.e. if a token is difficult to predict, then more layers (and therefore compute) would be used used. If it's easy, fewer layers. Am I missing something from what you're asking?

18

Chemont OP t1_j41eamz wrote

I should have been clearer with my question. What I was wondering was, if there are any extensions to the Transformer architecture that allow it to, in theory, spent indefinite amounts of compute on one token. I suppose one could train a very deep Transformer, use CALM during inference and only use all of the layers for tokens which are difficult to predict, but this would still arbitrarily limit the maximum amount of compute per token.

6

tdgros t1_j41f1nz wrote

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.

4

visarga t1_j46b2po wrote

No but if you use a decoder model (autoregressive) you can generate more tokens for the same task, depending on its difficulty. Chain-of-thought makes use of this trick.

2