Comments

You must log in or register to comment.

maskedpaki t1_j8by3r7 wrote

This is like 2 weeks old. If it really does surpass gpt3 with under a billion parameters then why isn't this on headlines.

11

turnip_burrito t1_j8byoth wrote

Surpasses humans in this science test, across the board (natural science, social science, language science, etc).

Wow.

And outperforms GPT 3.5 with about 0.4% the parameter amount.

Wonder how it does on other tests?

Would this run on consumer graphics cards, then? Seems like it's in the ballpark to run on a single 3090, but without knowing the total requirements, I can't say.

Edit: "Our experiments are run on 4 NVIDIA Tesla V100 32G GPU" - paper

​

Paper link: https://arxiv.org/pdf/2302.00923.pdf#page=7

85

Ok_Criticism_1414 OP t1_j8bywbq wrote

because of ChatGPT hype ? Who knows. I think open AI already did it, the just dont showing to the puclic. Main thing i guess that Amazon made a different aproach integrating two modalities by prefintuning to be multimodal. You can read in the paper. + Looks like language + visual context gives a huge boost. But it already being done by Flamingo model so i gues the first is crucial.

2

bladecg t1_j8bznlz wrote

Maybe their model is just overfitting a lot to the test data? That’s always a thing in ML

36

NTIASAAHMLGTTUD t1_j8c0cj3 wrote

In the interest of skepticism, can anyone pour any cold water on this or is it as good as it sounds?

62

maskedpaki t1_j8c0eso wrote

I've seen so many things like this that actually end up surpassing gpt3 on some narrow benchmark with more optimised prompting rather than just being a better model overall

I hope I'm wrong this time

17

el_chaquiste t1_j8c1z83 wrote

If I understand well, seems the input set (a science exam with solved exercises and detailed responses) is smaller than GPT3.5's own, but it overperforms GPT3.5 and humans on solving problems similar to those from said exam by some percent, more if it has a multimodal training including visual data.

I honestly don't know if we should get overly excited over this or not, but it seems like it would allow the creation of smaller models focused on some scientific and technical domains, with better accuracy in their reponses than generalist LLMs.

33

blueSGL t1_j8c26i1 wrote

> Experimental Settings

> As the Multimodal-CoT task re- quires generating the reasoning chains and leveraging the vision features, we use the T5 encoder-decoder architec- ture (Raffel et al., 2020). Specifically, we adopt UnifiedQA (Khashabi et al., 2020) to initialize our models in the two stages because it achieves the best fine-tuning results in Lu et al. (2022a). To verify the generality of our approach across different LMs, we also employ FLAN-T5 (Chung et al., 2022) as the backbone in Section 6.3. As using im- age captions does not yield significant performance gains in Section 3.3, we did not use the captions. We fine-tune the models up to 20 epochs, with a learning rate of 5e-5. The maximum input sequence length is 512. The batch sizes for the base and large models are 16 and 8,respectively. Our experiments are run on 4 NVIDIA Tesla V100 32G GPUs.

So the GPUs were used in training, there is nothing to say what the system requirements will be for inference.

25

SoylentRox t1_j8cb8c0 wrote

Obviously the next question is "what happens if you give it 1 Trillion parameters". (this would be 1355 times as many params.)

9

SoylentRox t1_j8cblun wrote

Theoretically it should query a large number of models, and have a "confidence" based on how likely each model's answer is to be correct. Then return the most confidence answer.

11

Red-HawkEye t1_j8cd8mp wrote

God damn, Amazon has entered the game.

Just when you think you had seen it all, an atomic bomb like this one gets announced.

This is equivalent to Black freiza villain coming back in Dragon ball and one shotting the main characters.

Amazon GPT one shots ChatGPT and Google LaMBDA out of nowhere.

36

Yesyesnaaooo t1_j8cs7dl wrote

Feels like even if this isn't 'it' then the next stage is coming down the pipe really fucking soon.

If we simply consider the amount of human resources that went into reaching the moon, or breaking the atom both 'time sensitive' pursuits; and then weight that against the amount of human resource that's going into AI?

Well the race is on and there's more research going uni this than ever went into either of those projects.

It's going to take less than a decade for a winner to emerge.

18

DarkCeldori t1_j8cz4sg wrote

While on the topic of consumer h/w, ryzen ai xdna seems promising, as itll be able to easily make use of main system memory which will soon be able to easily reach 256GB. That can fit very large models and inference is usually far less computationally intensive than training.

3

RahnuLe t1_j8cz9sc wrote

These passages really stuck out to me.

>We speculate that such a phenomenon of hallucination is due to a lack of necessary vision contexts for performing effective Multimodal-CoT. To inject vision information, a simple way is to transform the paired image into a caption (Lu et al., 2022a) and then append the caption in the input of both stages. However, as shown in Table 3, using captions only yields marginal performance gains (↑0.59%). Then, we explore an advanced technique by incorporating vision features into the language model. Concretely, we feed the paired image to the DETR model (Carion et al., 2020) to extract vision features. Then we fuse the vision features with the encoded language representations before feeding to the decoder (more details will be presented in Section 4). Interestingly, with vision features, the RougeL score of the rationale generation has boosted to 96.97% (QCM→R), which correspondingly contributes to better answer accuracy of 84.91% (QCMR→A).3 With those effective rationales, the phenomenon of hallucination is mitigated — 62.5% hallucination mistakes in Section 3.2 have been corrected (Figure 3(b)), as an example shown in Figure 2 (right part).4 The analysis so far compellingly shows that vision features are indeed beneficial for generating effective rationales and contributing to accurate answer inference.
>
>[...]
>
>Compared with existing UnifiedQA and GPT-3.5 methods that leverage image captions in the context to provide vision semantics, the results indicate that using image features is more effective.

It's clear that there's still a lot more to go in terms of representing the data in such a way that these networks can fully process them without factual errors, and this paper is a strong demonstration of the gains that can happen when you address this aspect specifically. Very promising stuff, and frankly, it's also kind of terrifying. Human obsolescence is frighteningly closer than I had already imagined...

10

Cryptizard t1_j8d7xv0 wrote

Keep in mind that the “human” line here is data they got from assigning the questions to randos on Mechanical Turk. It is not saying that their model is better than scientists. Still a really cool result, I just know how people here love to jump to insane conclusions.

4

Black_RL t1_j8d913k wrote

Can it solve aging already…..?

No…..?

Ok then……

−3

No_Ninja3309_NoNoYes t1_j8d9lii wrote

This is the principle of separation of concerns at work. Many focused capsules working together are stronger than one huge LLM. And it is easier to fine tune and prune individual capsules than one giant black box. It makes sense at inference time too. Eventually you could have a minimal model running locally with the sole purpose of figuring out which web services to contact for a given request.

1

GayHitIer t1_j8dcv5i wrote

Solving aging will probably be a slow progression from repairing your body to reversing aging.

Don't stress it, if Aubrey de grey is just on to something it will happen sooner than later.

3

GayHitIer t1_j8de3p8 wrote

If we make a cure to aging it will probably get rolled out quickly like the corona vaccine.

It would be stupid to deny people to live longer.

Also economic wise it would save a lot of money for countries.

3

GayHitIer t1_j8dgbim wrote

I agree, The biggest problem is the pro aging trance, people think death gives meaning to life.

Death needs to be put down for good, Death itself takes away meaning if anything.

3

Black_RL t1_j8dhwp0 wrote

100% agreed! Death is what makes everything meaningless!

People think that after they die, someone is going to judge them regarding their achievements?

Death is emptiness, death is the vacuum, there’s nothing before nor after, it’s the complete absence of consciousness, if there’s no consciousness, there’s no meaning.

1

GayHitIer t1_j8dicpl wrote

True, while death doesn't scare me.

We need to put death in his place for good.

Consciousness is the most valuable thing in the universe.

If there's nobody to observe the universe it might as well not exist in the first place.

3

Black_RL t1_j8dimnt wrote

It doesn’t scare me too, if I’m dead I don’t exist, therefore I can’t care.

Aging on the other hand…..

Agreed! Consciousness is what gives meaning to stuff! Meaning is a construct of conscience.

2

GayHitIer t1_j8djtly wrote

I don't get linking r/Futurology?

Why can't we just accept people have different outlooks on life?

Sure some of them are doomers, but most of them are just skeptics, which makes sense.

Singularity sometimes sounds like some techno cult, at least let us discuss with them.

Also downvoting BlackRl for no reason and not giving any reason or discussion is dumb.

0

user1342 t1_j8ds86t wrote

I wonder if the leap in amazon's AI was because they used gtp or something similar to help them make their new AI more efficient?

I always thought that was how we would see real acceleration, when the AI's start designing the next generation of themselves.

1

NapkinsOnMyAnkle t1_j8e4p9m wrote

I've trained CNNs of up to 200m parameters on my laptops 3070 without any issues. I think it's only around 5gb of available VRAM.

This is a big concern of mine; AGI actually requires an insurmountable amount of VRAM to train in realistic timeframes thereby being essentially impossible. I mean, we could calculate all of these models by hand to train and then use to make predictions but it would take forever, like literally forever!

3

SmoothPlastic9 t1_j8e9qrh wrote

Let me used it first then instead of some benchmark test

1

Borrowedshorts t1_j8erk8h wrote

It's as good as it sounds, and you can't really fake performance on a dataset such as this. Multimodal models will change the game. I don't think multimodal models by themselves are the end game, but they appear to be poised to takeover state of the art performance for the foreseeable future.

1

NarrowTea t1_j8f9l82 wrote

They just keep making them more efficient.

1

ReadSeparate t1_j8fb4cr wrote

One can easily imagine a generalist LLM outputting an action token which represents prompting the specialized LLM, which then gets routed to the specialized LLM, then the response is formatted and put into context by the generalist.

1

olivermyk t1_j8fnvwa wrote

the pdf isn’t accessible from the original link anymore. did anyone manage to download it? could you please re share?

1

maskedpaki t1_j8g2v6v wrote

no actually seems like a pretty narrow science benchmark

​

if you told me the MMLU 0 shot was higher than 175 billion gpt 3.5 with under a billion parameters then id be absolutely shocked

1

PurpedSavage t1_j8ir1we wrote

Was this study tied to funding in anyway towards Amazon?

1

Akimbo333 t1_j8p643a wrote

The huge performance boost of a mere 738M model, appears to be due to it being a multimodal model. Which can use not only text but image and other means as well.

1