Comments
blueSGL t1_j8c26i1 wrote
> Experimental Settings
> As the Multimodal-CoT task re- quires generating the reasoning chains and leveraging the vision features, we use the T5 encoder-decoder architec- ture (Raffel et al., 2020). Specifically, we adopt UnifiedQA (Khashabi et al., 2020) to initialize our models in the two stages because it achieves the best fine-tuning results in Lu et al. (2022a). To verify the generality of our approach across different LMs, we also employ FLAN-T5 (Chung et al., 2022) as the backbone in Section 6.3. As using im- age captions does not yield significant performance gains in Section 3.3, we did not use the captions. We fine-tune the models up to 20 epochs, with a learning rate of 5e-5. The maximum input sequence length is 512. The batch sizes for the base and large models are 16 and 8,respectively. Our experiments are run on 4 NVIDIA Tesla V100 32G GPUs.
So the GPUs were used in training, there is nothing to say what the system requirements will be for inference.
WithoutReason1729 t1_j8c97b1 wrote
GPT-2 XL is 1.5 billion parameters. Unless they added some very computationally expensive change to this new model that's unrelated to the parameter count, this could definitely run on consumer hardware. Very very cool!
Red-HawkEye t1_j8cd8mp wrote
God damn, Amazon has entered the game.
Just when you think you had seen it all, an atomic bomb like this one gets announced.
This is equivalent to Black freiza villain coming back in Dragon ball and one shotting the main characters.
Amazon GPT one shots ChatGPT and Google LaMBDA out of nowhere.
grimorg80 t1_j8ctetu wrote
I want to see it in action out in the open, though.
DarkCeldori t1_j8cz4sg wrote
While on the topic of consumer h/w, ryzen ai xdna seems promising, as itll be able to easily make use of main system memory which will soon be able to easily reach 256GB. That can fit very large models and inference is usually far less computationally intensive than training.
gangstasadvocate t1_j8d43oc wrote
Apparently one of my professors well, her husband has a home grade computer system with 2 TB of RAM. I tried searching it up. It only seems like server type size but yeah
DarkCeldori t1_j8dk62j wrote
I think some threadripper pro workstations can reach up to 2TB of ram. Will be very good once treadrippers come with ryzen xdna ai built in as that can directly use main memory for ai tasks.
Tiamatium t1_j8i6ien wrote
Yeah, 2TB ram is doable with server/workstation hardware. Think Threadripper or Xeon for CPU.
NapkinsOnMyAnkle t1_j8e4p9m wrote
I've trained CNNs of up to 200m parameters on my laptops 3070 without any issues. I think it's only around 5gb of available VRAM.
This is a big concern of mine; AGI actually requires an insurmountable amount of VRAM to train in realistic timeframes thereby being essentially impossible. I mean, we could calculate all of these models by hand to train and then use to make predictions but it would take forever, like literally forever!
NTIASAAHMLGTTUD t1_j8c0cj3 wrote
In the interest of skepticism, can anyone pour any cold water on this or is it as good as it sounds?
turnip_burrito t1_j8c1932 wrote
It's only tested on one benchmark, called ScienceQA. Maybe testing it on others would allow us to how well it really stacks up.
el_chaquiste t1_j8c1z83 wrote
If I understand well, seems the input set (a science exam with solved exercises and detailed responses) is smaller than GPT3.5's own, but it overperforms GPT3.5 and humans on solving problems similar to those from said exam by some percent, more if it has a multimodal training including visual data.
I honestly don't know if we should get overly excited over this or not, but it seems like it would allow the creation of smaller models focused on some scientific and technical domains, with better accuracy in their reponses than generalist LLMs.
[deleted] t1_j8c34us wrote
[deleted]
SoylentRox t1_j8cblun wrote
Theoretically it should query a large number of models, and have a "confidence" based on how likely each model's answer is to be correct. Then return the most confidence answer.
duboispourlhiver t1_j8cpldd wrote
Artificial expert panel
RabidHexley t1_j8dzxsv wrote
I Am Legion
ReadSeparate t1_j8fb4cr wrote
One can easily imagine a generalist LLM outputting an action token which represents prompting the specialized LLM, which then gets routed to the specialized LLM, then the response is formatted and put into context by the generalist.
[deleted] t1_j8g761n wrote
[deleted]
Ishynethetruth t1_j8c6kn8 wrote
Unless it’s released to the public don’t trust Amazon
Cryptizard t1_j8d86dt wrote
The humans they tested on were random people on Mechanical Turk, so that data point is not very illuminating.
Borrowedshorts t1_j8erk8h wrote
It's as good as it sounds, and you can't really fake performance on a dataset such as this. Multimodal models will change the game. I don't think multimodal models by themselves are the end game, but they appear to be poised to takeover state of the art performance for the foreseeable future.
bladecg t1_j8bznlz wrote
Maybe their model is just overfitting a lot to the test data? That’s always a thing in ML
94746382926 t1_j8c1uju wrote
Yeah I feel like we need more benchmarks
FusionRocketsPlease t1_j8diuz2 wrote
This type of paper should not be published before passing all possible tests that can refute the claim in the title...
Yesyesnaaooo t1_j8cs7dl wrote
Feels like even if this isn't 'it' then the next stage is coming down the pipe really fucking soon.
If we simply consider the amount of human resources that went into reaching the moon, or breaking the atom both 'time sensitive' pursuits; and then weight that against the amount of human resource that's going into AI?
Well the race is on and there's more research going uni this than ever went into either of those projects.
It's going to take less than a decade for a winner to emerge.
maskedpaki t1_j8by3r7 wrote
This is like 2 weeks old. If it really does surpass gpt3 with under a billion parameters then why isn't this on headlines.
d00m_sayer t1_j8bzbnm wrote
Because people cannot use it yet 🙄
maskedpaki t1_j8c0eso wrote
I've seen so many things like this that actually end up surpassing gpt3 on some narrow benchmark with more optimised prompting rather than just being a better model overall
I hope I'm wrong this time
94746382926 t1_j8c1t4r wrote
Yeah we need more benchmarks.
beezlebub33 t1_j8d62pw wrote
Benchmarks are really hard and expensive. And they are not fun or exciting for the people involved; the groups that make them really deserve more credit.
gay_manta_ray t1_j8c1uv8 wrote
this benchmarks seems pretty comprehensive
maskedpaki t1_j8g2v6v wrote
no actually seems like a pretty narrow science benchmark
​
if you told me the MMLU 0 shot was higher than 175 billion gpt 3.5 with under a billion parameters then id be absolutely shocked
Ok_Criticism_1414 OP t1_j8bywbq wrote
because of ChatGPT hype ? Who knows. I think open AI already did it, the just dont showing to the puclic. Main thing i guess that Amazon made a different aproach integrating two modalities by prefintuning to be multimodal. You can read in the paper. + Looks like language + visual context gives a huge boost. But it already being done by Flamingo model so i gues the first is crucial.
turnip_burrito t1_j8bz5rv wrote
10 days old, going by version 1 on arxiv.
RahnuLe t1_j8cz9sc wrote
These passages really stuck out to me.
>We speculate that such a phenomenon of hallucination is due to a lack of necessary vision contexts for performing effective Multimodal-CoT. To inject vision information, a simple way is to transform the paired image into a caption (Lu et al., 2022a) and then append the caption in the input of both stages. However, as shown in Table 3, using captions only yields marginal performance gains (↑0.59%). Then, we explore an advanced technique by incorporating vision features into the language model. Concretely, we feed the paired image to the DETR model (Carion et al., 2020) to extract vision features. Then we fuse the vision features with the encoded language representations before feeding to the decoder (more details will be presented in Section 4). Interestingly, with vision features, the RougeL score of the rationale generation has boosted to 96.97% (QCM→R), which correspondingly contributes to better answer accuracy of 84.91% (QCMR→A).3 With those effective rationales, the phenomenon of hallucination is mitigated — 62.5% hallucination mistakes in Section 3.2 have been corrected (Figure 3(b)), as an example shown in Figure 2 (right part).4 The analysis so far compellingly shows that vision features are indeed beneficial for generating effective rationales and contributing to accurate answer inference.
>
>[...]
>
>Compared with existing UnifiedQA and GPT-3.5 methods that leverage image captions in the context to provide vision semantics, the results indicate that using image features is more effective.
It's clear that there's still a lot more to go in terms of representing the data in such a way that these networks can fully process them without factual errors, and this paper is a strong demonstration of the gains that can happen when you address this aspect specifically. Very promising stuff, and frankly, it's also kind of terrifying. Human obsolescence is frighteningly closer than I had already imagined...
SoylentRox t1_j8cb8c0 wrote
Obviously the next question is "what happens if you give it 1 Trillion parameters". (this would be 1355 times as many params.)
urinal_deuce t1_j8hsln7 wrote
More is always better.
Sandbar101 t1_j8byoss wrote
Goddamn
Cryptizard t1_j8d7xv0 wrote
Keep in mind that the “human” line here is data they got from assigning the questions to randos on Mechanical Turk. It is not saying that their model is better than scientists. Still a really cool result, I just know how people here love to jump to insane conclusions.
davidolson22 t1_j8c7m62 wrote
Many more tasks including grammar
[deleted] t1_j8byajw wrote
[deleted]
No_Ninja3309_NoNoYes t1_j8d9lii wrote
This is the principle of separation of concerns at work. Many focused capsules working together are stronger than one huge LLM. And it is easier to fine tune and prune individual capsules than one giant black box. It makes sense at inference time too. Eventually you could have a minimal model running locally with the sole purpose of figuring out which web services to contact for a given request.
user1342 t1_j8ds86t wrote
I wonder if the leap in amazon's AI was because they used gtp or something similar to help them make their new AI more efficient?
I always thought that was how we would see real acceleration, when the AI's start designing the next generation of themselves.
Denpol88 t1_j8dwlml wrote
This is huge.
FusionRocketsPlease t1_j8e71cc wrote
Why are there studies on GPT-3.5 in 2020 if it appeared in 2022?
Ok_Criticism_1414 OP t1_j8er8eu wrote
GPT -3.5 is fintuned version of GPT-3 that appeared in 2020. ChatGPT is the same technology and foundation as GPT 3 but better optimized.
SmoothPlastic9 t1_j8e9qrh wrote
Let me used it first then instead of some benchmark test
NarrowTea t1_j8f9l82 wrote
They just keep making them more efficient.
olivermyk t1_j8fnvwa wrote
the pdf isn’t accessible from the original link anymore. did anyone manage to download it? could you please re share?
thelonghauls t1_j8g04tz wrote
Speling too
urinal_deuce t1_j8hrp5o wrote
Does it pass the sience of copy paste?
PurpedSavage t1_j8ir1we wrote
Was this study tied to funding in anyway towards Amazon?
Akimbo333 t1_j8p643a wrote
The huge performance boost of a mere 738M model, appears to be due to it being a multimodal model. Which can use not only text but image and other means as well.
Black_RL t1_j8d913k wrote
Can it solve aging already…..?
No…..?
Ok then……
GayHitIer t1_j8dcv5i wrote
Solving aging will probably be a slow progression from repairing your body to reversing aging.
Don't stress it, if Aubrey de grey is just on to something it will happen sooner than later.
Black_RL t1_j8dd5np wrote
Hope so!
But some prefer to use this kind of tech to make more money instead of solving aging, I guess they’re counting on spending it in the afterlife…..
GayHitIer t1_j8de3p8 wrote
If we make a cure to aging it will probably get rolled out quickly like the corona vaccine.
It would be stupid to deny people to live longer.
Also economic wise it would save a lot of money for countries.
Black_RL t1_j8dedg7 wrote
It would save plenty of countries, some are aging really fast.
Already is a huge problem.
GayHitIer t1_j8dgbim wrote
I agree, The biggest problem is the pro aging trance, people think death gives meaning to life.
Death needs to be put down for good, Death itself takes away meaning if anything.
Black_RL t1_j8dhwp0 wrote
100% agreed! Death is what makes everything meaningless!
People think that after they die, someone is going to judge them regarding their achievements?
Death is emptiness, death is the vacuum, there’s nothing before nor after, it’s the complete absence of consciousness, if there’s no consciousness, there’s no meaning.
GayHitIer t1_j8dicpl wrote
True, while death doesn't scare me.
We need to put death in his place for good.
Consciousness is the most valuable thing in the universe.
If there's nobody to observe the universe it might as well not exist in the first place.
Black_RL t1_j8dimnt wrote
It doesn’t scare me too, if I’m dead I don’t exist, therefore I can’t care.
Aging on the other hand…..
Agreed! Consciousness is what gives meaning to stuff! Meaning is a construct of conscience.
earthsworld t1_j8dghso wrote
maybe /r/Futurology is a better sub for you?
GayHitIer t1_j8djtly wrote
I don't get linking r/Futurology?
Why can't we just accept people have different outlooks on life?
Sure some of them are doomers, but most of them are just skeptics, which makes sense.
Singularity sometimes sounds like some techno cult, at least let us discuss with them.
Also downvoting BlackRl for no reason and not giving any reason or discussion is dumb.
turnip_burrito t1_j8byoth wrote
Surpasses humans in this science test, across the board (natural science, social science, language science, etc).
Wow.
And outperforms GPT 3.5 with about 0.4% the parameter amount.
Wonder how it does on other tests?
Would this run on consumer graphics cards, then? Seems like it's in the ballpark to run on a single 3090, but without knowing the total requirements, I can't say.
Edit: "Our experiments are run on 4 NVIDIA Tesla V100 32G GPU" - paper
​
Paper link: https://arxiv.org/pdf/2302.00923.pdf#page=7