Submitted by ReadSeparate t3_10zcig2 in singularity

We’re all waiting for the day that a GPT-3 scale model is released which integrates text, video, images, and audio. We’ve seen some progress on this front - namely Gato. But nothing that has really wow’ed us yet like ChatGPT or LaMDA. PaLM is really the only exception to this rule, but it was images and text only.

I think we all know this is coming soon, I’m wondering if anyone here is aware of any indications of this actively being worked on, or has any predictions for release dates. Especially for a video model.

A model which can take any combination of video, audio, image, and text tokens as input and output would most likely be very, very remarkable, making ChatGPT look like a toy in comparison.

22

Comments

You must log in or register to comment.

adt t1_j831ml0 wrote

There is an entire world outside of California...

Germany: Luminous 200B multimodal.

China: All of the ERNIE 260B cross-modal stuff.

^(Yeh, you need) ^(The Memo)^(!)

17

ReadSeparate OP t1_j8442mf wrote

This is exactly the comment I was looking for when I made this thread, thanks so much

5

MysteryInc152 t1_j83uty8 wrote

Only the 17b and 30b models are multimodal. Still pretty good though for sure.

We also have some recent advances that ground frozen language models to images. Namely BLIP-2 and fromage.

3

Sashinii t1_j82ro4w wrote

They're still being developed. When they're ready, they'll be released to the general public (granted, probably not by the big companies, but they'll be open source versions by Stability AI).

15

maskedpaki t1_j853khi wrote

chatgpt will grow into a multimodal model im guessing. they are updating every couple of weeks and are charging real money now for plus. Its going to take off really quick.

1