Viewing a single comment thread. View all comments

blueSGL t1_j8c26i1 wrote

> Experimental Settings

> As the Multimodal-CoT task re- quires generating the reasoning chains and leveraging the vision features, we use the T5 encoder-decoder architec- ture (Raffel et al., 2020). Specifically, we adopt UnifiedQA (Khashabi et al., 2020) to initialize our models in the two stages because it achieves the best fine-tuning results in Lu et al. (2022a). To verify the generality of our approach across different LMs, we also employ FLAN-T5 (Chung et al., 2022) as the backbone in Section 6.3. As using im- age captions does not yield significant performance gains in Section 3.3, we did not use the captions. We fine-tune the models up to 20 epochs, with a learning rate of 5e-5. The maximum input sequence length is 512. The batch sizes for the base and large models are 16 and 8,respectively. Our experiments are run on 4 NVIDIA Tesla V100 32G GPUs.

So the GPUs were used in training, there is nothing to say what the system requirements will be for inference.

25

phira t1_j8c3mqx wrote

Hrm. 512 limit on input might explain the performance vs parameters

9

rixtil41 t1_j8ctrij wrote

I wonder how much more efficient these models can get ?

2