Submitted by TensorDudee t3_zloof9 in MachineLearning
Internal-Diet-514 t1_j07pfk6 wrote
Reply to comment by nucLeaRStarcraft in [P] Implemented Vision Transformers π from scratch using TensorFlow 2.x by TensorDudee
On your first paragraph when you say given the same amount of data isnβt it shown here that the VIT was given more data as it was trained with other datasets as well, before being fine tuned on cifar-10? And then compared to other models which were most likely trained on cifar-10 alone? I guess my worry is if weβre going to do a proper comparison between models that they should all follow the same training procedure. You can reach SOTA performance on a dataset using other techniques rather than architecture alone.
nucLeaRStarcraft t1_j08cjvc wrote
I agree with you, if we want to test the architecture, we should use the same training procedure, including pre-training.
My theory is, that given the current results of GPT-like models, which use transformers under the hood, and given the fact that these groups have the compute power and data to train non-attention based recurrent models, it's quite unlikely that the architecture isn't a main contributor.
Viewing a single comment thread. View all comments