wojapa t1_jdl23pj wrote on March 25, 2023 at 4:17 AM

#2,348,335

Did they use RLHF?

Vegetable-Skill-9700 OP t1_jdl2fbp wrote on March 25, 2023 at 4:20 AM

#2,348,368

Replying to wojapa (#2,348,335)

I think it's just supervised training. Similar to alpaca, I guess

A1-Delta t1_jdl325g wrote on March 25, 2023 at 4:26 AM

#2,348,424

Replying to wojapa (#2,348,335)

GPT-J-6B fine tuned on Alpaca’s instruction dataset.

soggy_mattress t1_jdl4zkg wrote on March 25, 2023 at 4:46 AM

#2,348,640

I think of the 100b parameter models as analogous to the first room-sized computers that were built in the 70s. Seems the pattern is to first prove a concept, no matter how inefficiently, and then optimize it as much as possible.

Blacky372 t1_jdl62vl wrote on March 25, 2023 at 4:58 AM

#2,348,752

GPT-J-6B with instruction finetuning will surely not ever be better than GPT-4. With RLHF you may reach a similar response quality in some contexts for some types of instruction, but you will never match the vast amounts of proprietary data that ClosedAI fed into a probably 250+B parameter model with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain. This cannot be surpassed easily, unfortunately. But maybe future open source models will be of similar capabilities with advanced training techniques. I would definitely hope so.

Vegetable-Skill-9700 OP t1_jdl680d wrote on March 25, 2023 at 5:00 AM

#2,348,763

Replying to soggy_mattress (#2,348,640)

That's an interesting analogy!

blueSGL t1_jdl756z wrote on March 25, 2023 at 5:10 AM

#2,348,850

Replying to Blacky372 (#2,348,752)

> with specialized expert data from literally 50 experts in various fields that worked on the response quality in their domain.

Sounds like a future goal for Open Assistant.

If one were being unethical... create a bot to post the current Open Assistant answers to technical questions in small specialist subreddits and wait for Cunningham's_Law to come into effect. (I'm only half joking)

Sorry-Balance2049 t1_jdl7yn7 wrote on March 25, 2023 at 5:20 AM

#2,348,938

The databrick's blog post doesn't really show much eval on the model, only choice examples. It's more of a "hey we did this!" blog post.

Vegetable-Skill-9700 OP t1_jdl8hh5 wrote on March 25, 2023 at 5:26 AM

#2,348,980

Replying to Blacky372 (#2,348,752)

Agreed, it won't generalize as well as GPT-4, but it could achieve similar performance for a specialized task (say answering technical questions around a certain topic or writing social media posts for a certain entity, etc.).

ttkciar t1_jdl8i7w wrote on March 25, 2023 at 5:26 AM

#2,348,982

LLaMa-7B output is abysmally horrible. We might need less than 100B, but not too much less.

Vegetable-Skill-9700 OP t1_jdl8onp wrote on March 25, 2023 at 5:28 AM

#2,349,002

Replying to Sorry-Balance2049 (#2,348,938)

Agreed! I don't expect it to be as good as GPT-4 on all tasks, but maybe fine-tuning for specific tasks can help it achieve similar performance on test samples related to that task. wdyt?

Zealousideal_Low1287 t1_jdlm2c0 wrote on March 25, 2023 at 8:36 AM

#2,350,073

It seems that contrary to conventional wisdom, models with more parameters learn more efficiently. My personal ‘hunch’ is that training large models and then some form of distillation may become the standard thing to do.

Short_Change t1_jdlp0cw wrote on March 25, 2023 at 9:20 AM

#2,350,319

Replying to Vegetable-Skill-9700 (#2,348,763)

Actually if his analogy is true, we will have 20 trillion parameters in the future for modern consumption.

wojtek15 t1_jdlpai0 wrote on March 25, 2023 at 9:24 AM

#2,350,347

Replying to ttkciar (#2,348,982)

Exactly, I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT. From my testing even much bigger LLaMa-30B with Alpaca is far worse than ChatGPT, can't even get simplest programming and common knowledge tasks right, and GPT3 ChatGPT get them right without any problems every time. I have not tried LLaMa-65B with Alpaca yet, because it has not being trained yet AFAIK, but I doubt it will be very different. GPT3 ChatGPT is 175B, maybe some 100B model can match it, but not 6B or 7B model, if someone claim this, he clearly don't know what he is talking about.

Crystal-Ammunition t1_jdlpzsw wrote on March 25, 2023 at 9:34 AM

#2,350,393

Replying to Short_Change (#2,350,319)

At that point, the training data world have to almost completely be synthetic, right?

Yardanico t1_jdls342 wrote on March 25, 2023 at 10:05 AM

#2,350,543

Replying to wojtek15 (#2,350,347)

Yeah, I think there's a lot of overhyping going around "running ChatGPT-grade language models on consumer hardware". They can "follow" instructions they same way as ChatGPT, but obviously those models know far, far less than the ClosedAI models do, and of course they'll hallucinate much more.

Although it's not an entirely bad thing, at least the community will innovate more so we might get something interesting in the future from this "push" :)

LeN3rd t1_jdls5jy wrote on March 25, 2023 at 10:06 AM

#2,350,547

How big do models need to be until certain capabilities emerge? That is the actual question here, isn't it? Do smaller models perform as well in all tasks, or just the one they are trained for?

shanereid1 t1_jdlt38a wrote on March 25, 2023 at 10:19 AM

#2,350,617

Have you read about the lotto ticket hypothesis? It was a paper from a few years ago that showed that within a fully connected neural network there exists a smaller sub network that can perform equally as well, even when the subnetwork is as low as a few % of the size of the original network. AFAIK they only proved this for MLP and CNNs. Its almost certain that the power of these LLMs can be distilled in some fashion without significantly degrading performance.

[deleted] t1_jdltev5 wrote on March 25, 2023 at 10:24 AM

#2,350,651

Replying to Vegetable-Skill-9700 (#2,348,368)

[removed]

[deleted] t1_jdlwka7 wrote on March 25, 2023 at 11:07 AM

#2,350,881

Replying to Crystal-Ammunition (#2,350,393)

[removed]

SpiritualTwo5256 t1_jdlwq53 wrote on March 25, 2023 at 11:09 AM

#2,350,896

How many specialized neurons are there in a human brain?

tdgros t1_jdlxy8a wrote on March 25, 2023 at 11:24 AM

#2,351,032

Replying to shanereid1 (#2,350,617)

There are versions for NLP (and a special one for vision transformers), here is the BERT one from some of the same authors (Frankle & Carbin) https://proceedings.neurips.cc/paper/2020/file/b6af2c9703f203a2794be03d443af2e3-Paper.pdf

It is still costly, as it includes rewinding and finding masks, we probably need to switch to dedicated sparse computations to fully benefit from it.

hadaev t1_jdlym7s wrote on March 25, 2023 at 11:32 AM

#2,351,112

Replying to Crystal-Ammunition (#2,350,393)

Idk, internet is big.

jabowery t1_jdm16ig wrote on March 25, 2023 at 12:01 PM

#2,351,357

Algorithmic information theory: Smallest model that memorizes all the data is optimal. "Large" is only there because of the need to expand in order to compress. Think decompress gz in order to compress with bz2. Countering over-fitting with over-informing (bigger data) yields interpolation, sacrificing extrapolation.

If you understand all of the above you'll be light years beyond the current ML industry including the political/religious bias of "algorithmic bias experts".

badabummbadabing t1_jdm1poy wrote on March 25, 2023 at 12:06 PM

#2,351,427

Well, if you apply all of those tricks that these smaller models perform (to get decent performance) AND increase the parameter count, can you get an even better model? Who knows, "Open"AI might already apply these.

The question is not: "Do fewer than 100B parameters suffice to get a model that performs 'reasonably' for a March 2023 observer?"

Chinchilla scaling rules tell us some upper bounds to the number of parameters that we can expect to still yield an improvement given the amount of available training data (PaLM is too big for instance), but even that only tells us half of the story: How good can our models get, if we make do with sub-optimal training efficiency (see LLaMA)? What is the influence of data quality/type? What if we train (gasp) multiple epochs with the same training set?

harharveryfunny t1_jdm3bm4 wrote on March 25, 2023 at 12:23 PM

#2,351,572

It seems most current models don't need the number of parameters that they have. DeepMind did a study on model size vs number of training tokens and concluded that for each doubling of number of parameters the number of training tokens also needs to double, and that a model like GPT-3, trained on 300B tokens would really need to be trained on 3.7T tokens (a 10x increase) to take advantage of it's size.

To prove their scaling law, DeepMind built the 70B params Chinchilla model, and trained it on the predicted optimal 1.4T (!) tokens, and found it to outperform GPT-3.

https://arxiv.org/abs/2203.15556

_Repeats_ t1_jdm3h7a wrote on March 25, 2023 at 12:25 PM

#2,351,587

For enterprise use cases, you might need only a small model in the 1-3 billion range that answers specific queries. For general knowledge, it remains to be seen how big or small you can retrain them.

Disastrous_Elk_6375 t1_jdm4h39 wrote on March 25, 2023 at 12:35 PM

#2,351,712

Replying to wojtek15 (#2,350,347)

> I have seen many inaccurate claims, e.g. LLaMa-7B with Alpaca being as capable as ChatGPT

I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci. And that's an amazing progress from the raw outputs of the raw models.

WonderFactory t1_jdm4pk1 wrote on March 25, 2023 at 12:37 PM

#2,351,745

Replying to Blacky372 (#2,348,752)

How long though before LLMs perform at the same level as experts in a most fields? A year, two, three? When you get to that point you can generate synthetic data that's the same quality as human produced data. The Reflexion paper mentioned in another thread claims that giving GPT 4 the ability to test the output of its code produces expert level coding performance. This output could be used to train an open source model.

atheist-projector t1_jdm7hw1 wrote on March 25, 2023 at 1:04 PM

#2,352,020

Replying to soggy_mattress (#2,348,640)

Especialy when considr that sgd is a local minima we can probably do a whole lot better if we find a niced optimizer.

atheist-projector t1_jdm7mmi wrote on March 25, 2023 at 1:05 PM

#2,352,028

Replying to Blacky372 (#2,348,752)

I love the odea of calling them closedai.

Thats it j am doing it from now on

noobgolang t1_jdm7pvm wrote on March 25, 2023 at 1:06 PM

#2,352,042

Big is not always better (˵ ͡° ͜ʖ ͡°˵)

EmmyNoetherRing t1_jdma3em wrote on March 25, 2023 at 1:27 PM

#2,352,293

Replying to Crystal-Ammunition (#2,350,393)

Introspection? Cog-sci/classical AI like to use the term, not always in the best justified fashion I think. But when you’re hallucinating your own new training data it seems relevant.

alrunan t1_jdmbv4k wrote on March 25, 2023 at 1:42 PM

#2,352,479

Replying to harharveryfunny (#2,351,572)

The chinchilla scaling laws is just used to calculate the optimal scale for dataset and model size for a particular training budget.

You should read the LLaMA paper.

[deleted] t1_jdmcma9 wrote on March 25, 2023 at 1:48 PM

#2,352,557

Replying to soggy_mattress (#2,348,640)

[deleted]

harharveryfunny t1_jdmd38s wrote on March 25, 2023 at 1:52 PM

#2,352,606

Replying to alrunan (#2,352,479)

>You should read the LLaMA paper.

OK - will do. What specifically did you find interesting (related to scaling or not) ?

devl82 t1_jdmf6b3 wrote on March 25, 2023 at 2:09 PM

#2,352,822

Replying to Yardanico (#2,350,543)

no they is no overhype, you just don't understand what Alpaca is trying to do & I am sure others will also reply similar

alrunan t1_jdmm3lw wrote on March 25, 2023 at 3:02 PM

#2,353,645

Replying to harharveryfunny (#2,352,606)

The 7B model is trained on 1T tokens and performs really well for its number of parameters.

phb07jm t1_jdmp7kc wrote on March 25, 2023 at 3:24 PM

#2,354,049

Replying to Short_Change (#2,350,319)

I think this will prove prophetic

gamerx88 t1_jdmql8y wrote on March 25, 2023 at 3:34 PM

#2,354,213

Answer is probably not. DeepMind's Chinchilla paper shows that many of those 100B+ LLMs are oversized for the amount of data used to pre-train them.

gamerx88 t1_jdmr4n2 wrote on March 25, 2023 at 3:38 PM

#2,354,273

Replying to wojtek15 (#2,350,347)

My observations are similar to yours, but I think Stanford's claim was that it rivalled text-davinci-003's dialogue or chat capabilities, and only in a single turn setting.

gamerx88 t1_jdmrlhh wrote on March 25, 2023 at 3:41 PM

#2,354,335

Replying to wojapa (#2,348,335)

No, check their git repo. They used HF transformer's AutoFromCausalLM in their training script. It's supervised fine-tuning.

wrossmorrow t1_jdmsbvf wrote on March 25, 2023 at 3:46 PM

#2,354,405

Replying to shanereid1 (#2,350,617)

Probably related https://arxiv.org/abs/2106.09685

Cherubin0 t1_jdmt5el wrote on March 25, 2023 at 3:52 PM

#2,354,480

I think have the particular knowledge inside the model is a bad approach. I think it would make much more sense that the model knows how to search and reason about the found data.

zbyte64 t1_jdmvaak wrote on March 25, 2023 at 4:07 PM

#2,354,725

Replying to Blacky372 (#2,348,752)

Sounds like we're realizing that a model is only as good as the experts that wrote the training data.

currentscurrents t1_jdmyjrb wrote on March 25, 2023 at 4:31 PM

#2,355,061

Replying to Crystal-Ammunition (#2,350,393)

Bigger models are more sample efficient, so it should need less data.

But - didn't the Chinchilla paper say bigger models need more data? Yes, but that's only true because right now compute is the limiting factor. They're intentionally trading off more data for less model size.

As computers get faster and models bigger, data will increasingly become the limiting factor, and people will trade off in the opposite direction instead.

currentscurrents t1_jdmzphs wrote on March 25, 2023 at 4:39 PM

#2,355,216

Replying to gamerx88 (#2,354,213)

That's true, but only for the given compute budget used in training.

Right now we're really limited by compute power, while training data is cheap. Chinchilla and LLaMA are intentionally trading more data for less compute. Larger models still perform better than smaller ones given the same amount of data.

In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

londons_explorer t1_jdn0t7k wrote on March 25, 2023 at 4:47 PM

#2,355,357

Paper after paper has shown that bigger model outperforms smaller model.

Sure, you can use tricks to make a small model work better. But apply those same tricks to a big model, and it works even better.

gamerx88 t1_jdn1dd3 wrote on March 25, 2023 at 4:51 PM

#2,355,427

Replying to currentscurrents (#2,355,216)

> In the long run I expect this will flip; computers will get very fast and data will be the limiting factor.

I agree but I think data is already a limiting factor today, with the largest (that is public knowledge) models at 175B. The data used to train these models supposedly already cover a majority of the open internet.

Puzzleheaded_Acadia1 t1_jdn6ugl wrote on March 25, 2023 at 5:29 PM

#2,356,046

Replying to soggy_mattress (#2,348,640)

I see a future where LLMs or llamas that are multimodels or any other new kind artificial intelligence run on esp32 level of hardware i don't know how that will work but I'm pretty sure we are heading there

Thebadwolf47 t1_jdnbfya wrote on March 25, 2023 at 6:01 PM

#2,356,484

Replying to Short_Change (#2,350,319)

wasn't he rather comparing the parameters to the volume of the first computer and not their transistor count?

fiftyfourseventeen t1_jdngwum wrote on March 25, 2023 at 6:40 PM

#2,357,077

Replying to wrossmorrow (#2,354,405)

Eh.... Not really, that's training a low rank representation of the model, not actually making it smaller.

fiftyfourseventeen t1_jdnhbn0 wrote on March 25, 2023 at 6:43 PM

#2,357,108

Replying to Yardanico (#2,350,543)

OpenAI is also doing a lot of tricks behind the scenes, so it's not really fair to just type two things into both, because they are getting nowhere near the same prompt. Llama is promising but it just needs to be properly instruction tuned

drinkingsomuchcoffee t1_jdnhxri wrote on March 25, 2023 at 6:47 PM

#2,357,172

Huge models are incredibly wasteful and unoptimized. Someday, someone is going to sit down and create an adaptive algorithm that expands or contracts a model during the training phase and we're going to laugh at how stupid we were.

Impressive-Ad6400 t1_jdnjakm wrote on March 25, 2023 at 6:57 PM

#2,357,303

Uhm, this is probably incorrect as an analogy, but, do we humans actually need those 75 billion neurons on our brains?

I mean, there are lots of people who have lost a brain hemisphere for different reasons, and yet, they live happy lives.

However, what they lose is flexibility. This means they have a hard time when faced to new situations and have difficulties adapting to them.

I can't be certain, but it's possible that the number of parameters in large language models can account for their flexibility. That is why you can throw anything to chatGPT and it will answer, within the scope given by its restrictions.

I'm not sure either if enlarging the number of parameters will give us emergent properties or if it will only slow down data processing. Blue whales have immense brains, but they aren't necessarily smarter than us. And this is because a larger brain means larger distances for neurons to connect, slower response times and increased energetic expenditure.

I could be wrong, though. Electronic brains don't have the same limitations of physical brains, so maybe increasing their size won't affect their output.

MrFlamingQueen t1_jdnmkby wrote on March 25, 2023 at 7:21 PM

#2,357,636

Replying to drinkingsomuchcoffee (#2,357,172)

🤫🤫 Shhhhh, this is my research area.

[deleted] t1_jdnmu4i wrote on March 25, 2023 at 7:22 PM

#2,357,661

Replying to alrunan (#2,353,645)

[deleted]

I_will_delete_myself t1_jdnrr46 wrote on March 25, 2023 at 7:58 PM

#2,358,202

Replying to Crystal-Ammunition (#2,350,393)

At that point we will run out of data. It will require more data efficient methods.

farmingvillein t1_jdntw7b wrote on March 25, 2023 at 8:14 PM

#2,358,470

Replying to Sorry-Balance2049 (#2,348,938)

pure marketing.

not even weights...due to the ToS issues with the fine-tune set, presumably.

farmingvillein t1_jdnuvnf wrote on March 25, 2023 at 8:21 PM

#2,358,592

Replying to Disastrous_Elk_6375 (#2,351,712)

> I believe you might have misunderstood the claims in Alpaca. They never stated it is as capable as ChatGPT, they found (and you can confirm this yourself) that it accurately replicates the instruction tuning. That is, for most of the areas in the fine-tuning set, a smaller model will output in the same style of davinci.

This is a misleading summary of the paper.

They instruction tune and then compare Alpaca versus GPT-3.5, and say that Alpaca is about equal on the tasks it compares (which, to be clear, is not equivalent to a test of "broad capability").

Yes, you are right that they don't make a statement that it is categorically more capable than ChatGPT, but they do state that their model is approximately as capable as GPT3.5 (which is of course not a 1:1 to chatgpt), on the diverse set of tasks tested.

It is very much not just a paper showing that you can make it output in the same "style".

[deleted] t1_jdnw01y wrote on March 25, 2023 at 8:29 PM

#2,358,712

Replying to currentscurrents (#2,355,216)

[deleted]

farmingvillein t1_jdnwda6 wrote on March 25, 2023 at 8:32 PM

#2,358,741

Replying to londons_explorer (#2,355,357)

> But apply those same tricks to a big model, and it works even better.

In general, yes, although there are many techniques that help small models that do not help large ones.

That said, agree with your overall point. I think the only reason we won't see model sizes continue to inflate is if 1) there are substantial underlying architecture discoveries (possible!) or 2) we really hit problems with data availability. But synthetic + multi-modal probably gives us a ways to go there.

londons_explorer t1_jdo4kj3 wrote on March 25, 2023 at 9:33 PM

#2,359,580

Replying to farmingvillein (#2,358,741)

Think how many hard drives there are in the world...

All of that data is potential training material.

I think a lot of companies/individuals might give up 'private' data in bulk for ML training if they get a viable benefit from it (for example, having a version of ChatGPT with perfect knowledge of all my friends and neighbours, what they like and do, etc. would be handy)

LahmacunBear t1_jdo7k0w wrote on March 25, 2023 at 9:55 PM

#2,359,866

Here’s a thought — 175B in GPT3 original, the best stuff thrown at it, performs as it did. ChatGPT training tricks, suddenly same size performs magnitudes better. I doubt that currently the LLMs are fully efficient, i.e. just as with GPT3 to 3.5, with the same size we can continue to get much better results, and therefore current results with much smaller models.

itshouldjustglide t1_jdoazux wrote on March 25, 2023 at 10:22 PM

#2,360,221

Replying to currentscurrents (#2,355,061)

Don't bigger models need more data so that all of the neurons can be trained so as to reduce unnecessary noise and randomness?

blose1 t1_jdoj8kl wrote on March 25, 2023 at 11:25 PM

#2,361,054

Replying to WonderFactory (#2,351,745)

GPT models struggle with out of distribution programming tasks, which means it can't create novel ideas, I tested this myself many times and it's not a prompt engineering issue. I think LLMs could act as great teachers but not researchers, teachers just teach what we already know, researchers create novel knowledge that teachers use.

Wilfred86 t1_jdot7gb wrote on March 26, 2023 at 12:39 AM

#2,362,057

Replying to shanereid1 (#2,350,617)

Is this like pruning in the brain?

ganzzahl t1_jdovu3h wrote on March 26, 2023 at 1:00 AM

#2,362,368

Replying to itshouldjustglide (#2,360,221)

I'm also very interested in this – does anyone have papers similar to Chinchilla, but without the training FLOPs restriction, and instead comparing identical dataset sizes?

An aside: I feel like I remember some older MT papers where LSTMs outperformed Transformers for some low resource languages, but I think that's outdated – using transfer learning, multilingual models and synthetic data, I'm fairly certain Transformers always outperform nowadays.

light24bulbs t1_jdoxig4 wrote on March 26, 2023 at 1:13 AM

#2,362,550

Your brain has 86 billion neurons. They're very expensive to your body to run.

You need them to be as smart as you are.

Edit: nevermind, made a false equivalence. This blog is a good explanation of how many "parameters" our brain uses for language. https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

YoloSwaggedBased t1_jdp9cge wrote on March 26, 2023 at 2:46 AM

#2,363,800

Replying to drinkingsomuchcoffee (#2,357,172)

I can't find it now, but I've read a paper that essentially proposed this, at least for inferencing. You essentially have a model output and task loss after every n layers of the model. At training time, you produce outputs up to the end of the architecture and then at inference time utilise some heuristic to measure how much accuracy loss you're willing to sacrifice for layer wise model reduction.

EmbarrassedHelp t1_jdpbi45 wrote on March 26, 2023 at 3:05 AM

#2,364,031

Replying to Impressive-Ad6400 (#2,357,303)

Human brains contain a lot of neurons for life support, connective wiring, maintenance, and other stuff that a digital brain wouldn't require. Human brains have also been structurally optimized by evolution, and are distilled via synaptic pruning.

drinkingsomuchcoffee t1_jdpg1cb wrote on March 26, 2023 at 3:46 AM

#2,364,532

Replying to YoloSwaggedBased (#2,363,800)

The problem is learned features aren't factored nicely into a minimal set of parameters. For example, identifying if an image is a cat may be 1000s of parameters over n layers, where it may actually be expressed as 10 parameters over fewer layers. A small model does this automatically, as it's obviously physically constrained. A large model has no such constraint, so it is wasteful. There's probably many solutions to get the best of both worlds at training time, but it's by no means an easy problem. And the current distillation methods or retraining feel clunky. We actually want the big model to use all its parameters efficiently and not waste them, which it's likely doing if much more compact models can get similar results. It's probably extremely wasteful if it requires an order of magnitude in size to get a few percentage points improvement. Compare that to biological entities where an order of magnitude size increase results in huge cognitive improvements.

PilotThen t1_jdpn8eb wrote on March 26, 2023 at 5:00 AM

#2,365,287

I'm down the rabbit hole of finding the best model to build on and learn with this weekend.

Currently poking at PygmalionAI/pygmalion-1.3b

Beware: The different size pygmalion model are finetuned from different pretrained models, so have inherited different licenses.

I like my results with 6b better but 1.3b has the better license (apgl-3.0)

PilotThen t1_jdpnoul wrote on March 26, 2023 at 5:05 AM

#2,365,345

Replying to ganzzahl (#2,362,368)

I didn't find a paper but I think that is sort of what EleutherAI was doing with their pythia models.

You'll find the models on huggingface and I'd say that they are also interesting from an opensource perspective because of their license (apache-2.0)

(Also open-assistent seems to be building on top of them.)

PilotThen t1_jdppmpl wrote on March 26, 2023 at 5:27 AM

#2,365,535

Replying to currentscurrents (#2,355,216)

There's also the point that they optimise for computer power at training time.

In mass deployment computer power at inference time starts to matter.

Poseidon_22 t1_jdpyo9u wrote on March 26, 2023 at 7:29 AM

#2,366,479

Apparently, for linear improvement in accuracy, we would need exponentially more parameters. Gpt-4 with more than 1 trillion parameters would need to be trained on 6,700gpus for a whole year!

minhrongcon2000 t1_jdr6xtv wrote on March 26, 2023 at 3:34 PM

#2,371,804

Right now yes! Most of the papers published recently (like Chinchilla, GPT, etc.) show a scaling law on the number of data wrt the number of params in a model. If you want a no-brain training with little preprocessing, bigger models are mostly better. However, if you have sufficient data, then the number of params needed may be mitigated. However, I feel like the number of parameters decreases really slow when the data size grows. So yeah, we still somehow need larger model (of course, this also depends on the scenario where you apply LLM, for example, you don't really need that big of a model for an ecom app)

austintackaberry t1_jdrau92 wrote on March 26, 2023 at 4:02 PM

#2,372,272

Replying to farmingvillein (#2,358,470)

Yes, that's correct.

>[@matei_zaharia] The code is at https://github.com/databrickslabs/dolly. You can also contact us for weights, just want to make sure people understand the restrictions on the fine tuning data (or you can get that data from Stanford and train it yourself).

https://twitter.com/matei_zaharia/status/1639357850807054336?s=20

frequenttimetraveler t1_jds91rc wrote on March 26, 2023 at 8:04 PM

#2,376,185

Replying to light24bulbs (#2,362,550)

Number of synapses is more akin to parameters

frequenttimetraveler t1_jds97ia wrote on March 26, 2023 at 8:05 PM

#2,376,205

Why don't we ask gpt4 to optimize itself

light24bulbs t1_jdsapie wrote on March 26, 2023 at 8:15 PM

#2,376,401

Replying to frequenttimetraveler (#2,376,185)

My mistake, correct you are:

https://www.beren.io/2022-08-06-The-scale-of-the-brain-vs-machine-learning/

This post recons we using 800b parameters for language processing.

andreichiffa t1_jdvojfg wrote on March 27, 2023 at 3:20 PM

#2,393,145

Replying to shanereid1 (#2,350,617)

It's a common result from the domain of flat minima - to train the model needs to be overparametrized to avoid getting stuck in local minima and to smooth the loss landscape.

However the overparameterization at the training stage can be trimmed at the inference stage.

CacheMeUp t1_jdxvq8t wrote on March 28, 2023 at 12:04 AM

#2,404,391

Replying to hadaev (#2,351,112)

Perhaps the challenge is not the size of the internet (it's indeed big and easy to generate new content), but rather the uniqueness and novelty of the information. Anecdotally, looking at the first page of Google results often shows various low-informativeness webpages, where only a few sentences provide information and the rest is boilerplate, disclaimers, generic advice or plain spam.

AllowFreeSpeech t1_je3rjmv wrote on March 29, 2023 at 5:00 AM

#2,439,545

Replying to currentscurrents (#2,355,061)

20:1 ratio of tokens:params

Comments