tdgros t1_j41f1nz wrote on January 12, 2023 at 2:47 PM

Reply to comment by Chemont in [R] Is there any research on allowing Transformers to spent more compute on more difficult to predict tokens? by Chemont

You'll still pay the full price at train time, right? Early decoding works by using decoders on earlier levels at train time. Conversely, if you want to spend more on some tokens, at train time, you will need to have more layers, so at some point you will hit your memory/complexity limits.