nucLeaRStarcraft

nucLeaRStarcraft t1_j08cjvc wrote

I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.

My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.

2

nucLeaRStarcraft t1_j07bufu wrote

We're generally trying to maximize the available labeled data. If the Transformer can ingest more data and in the end performs better than any other non-attention based model, given the same amount of data, then, it's a better architecture.

However, you are asking a proper question, but I think the body of recent work shows that the Transformer indeed generalizes better. Otherwise, we'd see similar results with non-transformed based architectures, since the data and compute is already there for these groups who do this kind of research.

3