blueSGL t1_j8c26i1 wrote on February 13, 2023 at 4:40 AM

Reply to comment by turnip_burrito in This is Revolutionary?! Amazon's 738 Million(!!!) parameter's model outpreforms humans on sience, vision, language and much more tasks. by Ok_Criticism_1414

> Experimental Settings

> As the Multimodal-CoT task re- quires generating the reasoning chains and leveraging the vision features, we use the T5 encoder-decoder architec- ture (Raffel et al., 2020). Specifically, we adopt UnifiedQA (Khashabi et al., 2020) to initialize our models in the two stages because it achieves the best fine-tuning results in Lu et al. (2022a). To verify the generality of our approach across different LMs, we also employ FLAN-T5 (Chung et al., 2022) as the backbone in Section 6.3. As using im- age captions does not yield significant performance gains in Section 3.3, we did not use the captions. We fine-tune the models up to 20 epochs, with a learning rate of 5e-5. The maximum input sequence length is 512. The batch sizes for the base and large models are 16 and 8,respectively. Our experiments are run on 4 NVIDIA Tesla V100 32G GPUs.

So the GPUs were used in training, there is nothing to say what the system requirements will be for inference.

phira t1_j8c3mqx wrote on February 13, 2023 at 4:53 AM

Hrm. 512 limit on input might explain the performance vs parameters

rixtil41 t1_j8ctrij wrote on February 13, 2023 at 10:30 AM

I wonder how much more efficient these models can get ?