MysteryInc152 OP t1_jaccf9c wrote on February 28, 2023 at 12:31 PM

#2,100,650

>A big convergence of language, multimodal perception, action, and world modeling is a key step toward artificial general intelligence. In this work, we introduce Kosmos-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). Specifically, we train Kosmos-1 from scratch on web-scale multimodal corpora, including arbitrarily interleaved text and images, image-caption pairs, and text data. We evaluate various settings, including zero-shot, few-shot, and multimodal chain-of-thought prompting, on a wide range of tasks without any gradient updates or finetuning. Experimental results show that Kosmos-1 achieves impressive performance on (i) language understanding, generation, and even OCR-free NLP (directly fed with document images), (ii) perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and (iii) vision tasks, such as image recognition with descriptions (specifying classification via text instructions). We also show that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. In addition, we introduce a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

blackkettle t1_jace0gh wrote on February 28, 2023 at 12:47 PM

#2,101,141

We’re moving fast now…

abnormal_human t1_jacjmrj wrote on February 28, 2023 at 1:37 PM

#2,103,004

Am I reading right that this is a 1.6B parameter model?

MysteryInc152 OP t1_jacjpk9 wrote on February 28, 2023 at 1:38 PM

#2,103,027

Replying to abnormal_human (#2,103,004)

Yeah

farmingvillein t1_jacq4fn wrote on February 28, 2023 at 2:28 PM

#2,105,092

The language-only performance was pretty meh, comparing the versions with and without images. We'll have to see whether scale up helps here (other research suggests yes?... But still need to see proof).

MysteryInc152 OP t1_jacswnq wrote on February 28, 2023 at 2:48 PM

#2,105,924

Replying to farmingvillein (#2,105,092)

There's pretty much no way it won't scale up.

zykezero t1_jacvr1g wrote on February 28, 2023 at 3:08 PM

#2,106,873

Finally kosmos has arrived. We need her help to fight the gnosis.

bigfish_in_smallpond t1_jacwddt wrote on February 28, 2023 at 3:12 PM

#2,107,059

Replying to blackkettle (#2,101,141)

The Internet is the primordial soup for agi

[deleted] t1_jacx9ai wrote on February 28, 2023 at 3:18 PM

#2,107,348

Replying to abnormal_human (#2,103,004)

That’s about x100 less than what I’d expected.

[deleted] t1_jacygxs wrote on February 28, 2023 at 3:26 PM

#2,107,714

Any idea when we will be able to use the model?

deliciously_methodic t1_jad1h8m wrote on February 28, 2023 at 3:46 PM

#2,108,641

Replying to MysteryInc152 (#2,105,924)

What does “scale up” mean in this context? I use “scale up” in a ML hardware context vs “scale out” to represent “making a cpu/GPU more powerful” vs “adding more gpus”, but I’m not clear if the analogy is used for AI models, scaling up and out. Or if you simply mean, “the model will get bigger”

MysteryInc152 OP t1_jad4h86 wrote on February 28, 2023 at 4:06 PM

#2,109,671

Replying to deliciously_methodic (#2,108,641)

I just mean a bigger model, that is more parameters.

Beli_Mawrr t1_jad4r9n wrote on February 28, 2023 at 4:08 PM

#2,109,764

Replying to [deleted] (#2,107,348)

That's almost in the realm of my computer can run it, no?

RetroPenguin_ t1_jad51qy wrote on February 28, 2023 at 4:10 PM

#2,109,855

Replying to abnormal_human (#2,103,004)

For the >10B closed source models, I’d be really curious how many of those weights are zero with fp16 precision.

abnormal_human t1_jad6qae wrote on February 28, 2023 at 4:21 PM

#2,110,401

Replying to Beli_Mawrr (#2,109,764)

Yeah, probably.

curiousshortguy t1_jad9s4t wrote on February 28, 2023 at 4:40 PM

#2,111,354

Replying to Beli_Mawrr (#2,109,764)

it is, you can probably do 2 to 8 billion on your average gaming pc, and 16 on a high end one

[deleted] t1_jadcm6k wrote on February 28, 2023 at 4:58 PM

#2,112,292

[removed]

dancingnightly t1_jadj7fa wrote on February 28, 2023 at 5:40 PM

#2,114,338

Replying to Beli_Mawrr (#2,109,764)

Edit: Seems like for this one yes. They do consider human instructions (similarish to the goal of a RLHF which requires more RAM), by adding them directly in the text dataset, as mentioned in 3.3 Language-Only Instruction Tuning-

For other models, like OpenAssistant coming up, one thing to note is that, although the generative model itself may be runnable locally, the reward model (the bit that "adds finishing touches" and ensures following instructions) can be much bigger. Even if the GPT-J underlying model is 11GB on RAM and 6B params, the RLHF could seriously increase that.

This models is in the realm of the smaller T5, BART and GPT-2 models released 3 years ago and runnable then on decent gaming GPUs

[deleted] t1_jadkcqd wrote on February 28, 2023 at 5:47 PM

#2,114,715

Replying to Beli_Mawrr (#2,109,764)

[deleted]

1azytux t1_jadmvbe wrote on February 28, 2023 at 6:03 PM

#2,115,486

can we download the model weights? is it open sourced? or maybe perform zero shot tasks by ourselves?

1azytux t1_jadp0aa wrote on February 28, 2023 at 6:17 PM

#2,116,131

Replying to [deleted] (#2,107,714)

do you know which foundation models we can use though, or are open sourced? It seems like every other model is either not available or their weights aren't released yet. It's case with, CoCa, Florence, Flamingo, BEiT3, FILIP, ALIGN. I was able to find weights for ALBEF.

farmingvillein t1_jadqg1l wrote on February 28, 2023 at 6:26 PM

#2,116,633

Replying to MysteryInc152 (#2,105,924)

You're missing the point here, or I wasn't clear--the question isn't whether performance will improve with more params (and potentially) data; no doubt there.

The question is whether a model trained at scale on text & images will outperform a model trained at scale solely on text, in the text-only domain (or similarly, the image-only).

To-date, all* of the public research (and Kosmos is no different) on multimodal models have showed, at best, multimodal models generally performing equal to unimodal variants in unimodal domains. And often they are a shade worse (like Kosmos).

(*=unless you count code+natural language.)

The holy grail, of course, is that the two help one another, so that your multimodal variant outperforms the unimodal variants on unimodal tasks. GPT-* gets better at talking to you because it has ingested all of the Youtube videos in the world, e.g.

If you can demonstrate that (and it certainly makes intuitive human sense that this could/should be true), then of course there is a giant truckload of image (including video!) and audio data you can slam into your text models to make text-based scenarios better (and similarly for images, etc.). (And it also more plausibly suggests that massive amounts of synthetic world exploration data could be accretive, too...)

There is a bunch of research (https://arxiv.org/abs/2301.03728 being one of the most exciting) suggesting that this can occur, with enough data/params, but no one has publicly demonstrated it. (And it'd surprise no one, probably, if this was part of GPT-4's or Gato-2's mix.)

farmingvillein t1_jadt897 wrote on February 28, 2023 at 6:43 PM

#2,117,540

Replying to deliciously_methodic (#2,108,641)

FWIW, I was trying to make a more subtle point than OP's response--see my other reply.

currentscurrents t1_jadte26 wrote on February 28, 2023 at 6:44 PM

#2,117,591

Replying to 1azytux (#2,116,131)

T5 and Flan-T5 have weights available.

[deleted] t1_jadxu2n wrote on February 28, 2023 at 7:12 PM

#2,119,029

Replying to 1azytux (#2,116,131)

I mean...

Google

Microsoft

ReasonablyBadass t1_jae7zhu wrote on February 28, 2023 at 8:16 PM

#2,122,401

Can't read the paper right now, can someone summarize: is it a new model or "just" the standard transformers but used on multi modal data? if it is new, what are the strucutral changes?

[deleted] t1_jaejynm wrote on February 28, 2023 at 9:33 PM

#2,126,846

Replying to dancingnightly (#2,114,338)

[removed]

[deleted] t1_jaekrcv wrote on February 28, 2023 at 9:38 PM

#2,127,170

[removed]

7734128 t1_jaemc4b wrote on February 28, 2023 at 9:49 PM

#2,127,804

Replying to RetroPenguin_ (#2,109,855)

Doesn't really change anything, does it? A zero still has an effect, so it has to be there, so I assume you mean that it could use less memory, right? But is that technically feasible to do in a practical manner? I can't imagine a practical way to have a tensor of split precision weights without ruinous reprocessing when trying to use the weights.

karius85 t1_jaeoyq7 wrote on February 28, 2023 at 10:06 PM

#2,128,884

Replying to 7734128 (#2,127,804)

Sparse matrices, but you would need quite a lot of zeros.

pawsibility t1_jaep5s5 wrote on February 28, 2023 at 10:07 PM

#2,128,966

Replying to abnormal_human (#2,103,004)

> The MLLM component has 24 layers with 2,048 hidden dimensions, 8,192 FFN intermediate size, and 32 attention heads, resulting in about 1.3B parameters. We use Magneto’s initialization for optimization stability. For faster convergence, the image representation is obtained from a pretrained CLIP ViT-L/14 model with 1,024 feature dimensions. The images are preprocessed into 224×224 resolution during training. We freeze the parameters of the CLIP model except for the last layer during training. The total number of parameters of KOSMOS-1 is about 1.6B.

If they use CLIP to generate image representations/embeddings as input to their model, isn't that kind of cheating when reporting numbers of parameters? Or is CLIP sufficiently small, and that's how they jumped from 1.3B to 1.6B?

AnOnlineHandle t1_jaeshwf wrote on February 28, 2023 at 10:30 PM

#2,130,377

Replying to curiousshortguy (#2,111,354)

Is there a way to convert parameter count into vram requirements? Presuming that's the main bottleneck?

AnOnlineHandle t1_jaesse4 wrote on February 28, 2023 at 10:32 PM

#2,130,498

Replying to pawsibility (#2,128,966)

The CLIP model in the Stable Diffusion 1.5 package is 480mb according to my directory where it was unpackaged by diffusers, though I don't know how that translate into parameter count.

currentscurrents t1_jaetvbb wrote on February 28, 2023 at 10:39 PM

#2,130,965

Replying to Beli_Mawrr (#2,109,764)

Definitely in the realm of running on your computer. Almost in the realm of running on high-end smartphones with TPUs.

currentscurrents t1_jaetyg1 wrote on February 28, 2023 at 10:40 PM

#2,130,994

Replying to dancingnightly (#2,114,338)

Can't the reward model be discarded at inference time? I thought it was only used for fine-tuning.

[deleted] t1_jaeu7ev wrote on February 28, 2023 at 10:42 PM

#2,131,087

Replying to AnOnlineHandle (#2,130,377)

[removed]

metal079 t1_jaeuymi wrote on February 28, 2023 at 10:47 PM

#2,131,396

Replying to AnOnlineHandle (#2,130,377)

Rule of thumb is vram needed = 2x per billion parameters, though I recall pygamillion which is 6B says it needs 16GB of ram so it depends.

lechatsportif t1_jaev941 wrote on February 28, 2023 at 10:49 PM

#2,131,541

Replying to bigfish_in_smallpond (#2,107,059)

Quote of the century.

curiousshortguy t1_jaf3aab wrote on February 28, 2023 at 11:47 PM

#2,135,591

Replying to AnOnlineHandle (#2,130,377)

Yeah, about 2-3. You can easily shove layers of the networks on disk, and then load even larger models that don't fit in vram BUT disk i/o will make inference painfully slow.

new_name_who_dis_ t1_jaf4lmy wrote on February 28, 2023 at 11:56 PM

#2,136,343

Replying to AnOnlineHandle (#2,130,377)

Each float32 is 4 bytes.

Comments