We’re all waiting for the day that a GPT-3 scale model is released which integrates text, video, images, and audio. We’ve seen some progress on this front - namely Gato. But nothing that has really wow’ed us yet like ChatGPT or LaMDA. PaLM is really the only exception to this rule, but it was images and text only.

I think we all know this is coming soon, I’m wondering if anyone here is aware of any indications of this actively being worked on, or has any predictions for release dates. Especially for a video model.

A model which can take any combination of video, audio, image, and text tokens as input and output would most likely be very, very remarkable, making ChatGPT look like a toy in comparison.

Comments

You must log in or register to comment.

adt t1_j831ml0 wrote on February 11, 2023 at 6:49 AM

There is an entire world outside of California...

Germany: Luminous 200B multimodal.

China: All of the ERNIE 260B cross-modal stuff.

^(Yeh, you need) ^(The Memo)^(!)

ReadSeparate OP t1_j8442mf wrote on February 11, 2023 at 2:33 PM

This is exactly the comment I was looking for when I made this thread, thanks so much

MysteryInc152 t1_j83uty8 wrote on February 11, 2023 at 1:13 PM

Only the 17b and 30b models are multimodal. Still pretty good though for sure.

We also have some recent advances that ground frozen language models to images. Namely BLIP-2 and fromage.

Akimbo333 t1_j8433cl wrote on February 11, 2023 at 2:25 PM

Wow!

Sashinii t1_j82ro4w wrote on February 11, 2023 at 4:57 AM

They're still being developed. When they're ready, they'll be released to the general public (granted, probably not by the big companies, but they'll be open source versions by Stability AI).

maskedpaki t1_j853khi wrote on February 11, 2023 at 6:16 PM

chatgpt will grow into a multimodal model im guessing. they are updating every couple of weeks and are charging real money now for plus. Its going to take off really quick.

MysteryInc152 t1_j85rgjx wrote on February 11, 2023 at 9:02 PM

Recently 2 papers were released that dealt with making frozen LLMs multimodal (with coffee and models released).

Blip-2 - https://arxiv.org/abs/2301.12597 https://huggingface.co/spaces/Salesforce/BLIP2

And fromage - https://arxiv.org/abs/2301.13823 https://github.com/kohjingyu/fromage