blueSGL t1_j8c26i1 wrote
Reply to comment by turnip_burrito in This is Revolutionary?! Amazon's 738 Million(!!!) parameter's model outpreforms humans on sience, vision, language and much more tasks. by Ok_Criticism_1414
> Experimental Settings
> As the Multimodal-CoT task re- quires generating the reasoning chains and leveraging the vision features, we use the T5 encoder-decoder architec- ture (Raffel et al., 2020). Specifically, we adopt UnifiedQA (Khashabi et al., 2020) to initialize our models in the two stages because it achieves the best fine-tuning results in Lu et al. (2022a). To verify the generality of our approach across different LMs, we also employ FLAN-T5 (Chung et al., 2022) as the backbone in Section 6.3. As using im- age captions does not yield significant performance gains in Section 3.3, we did not use the captions. We fine-tune the models up to 20 epochs, with a learning rate of 5e-5. The maximum input sequence length is 512. The batch sizes for the base and large models are 16 and 8,respectively. Our experiments are run on 4 NVIDIA Tesla V100 32G GPUs.
So the GPUs were used in training, there is nothing to say what the system requirements will be for inference.
Viewing a single comment thread. View all comments