badabummbadabing t1_je9cdf7 wrote on March 30, 2023 at 11:07 AM

Reply to comment by Nhabls in [D] Training a 65b LLaMA model by Business-Lead2679

They just had their Series B funding, they should upscale their resources soon.

badabummbadabing t1_jdm1poy wrote on March 25, 2023 at 12:06 PM

Reply to [D] Do we really need 100B+ parameters in a large language model? by Vegetable-Skill-9700

Well, if you apply all of those tricks that these smaller models perform (to get decent performance) AND increase the parameter count, can you get an even better model? Who knows, "Open"AI might already apply these.

The question is not: "Do fewer than 100B parameters suffice to get a model that performs 'reasonably' for a March 2023 observer?"

Chinchilla scaling rules tell us some upper bounds to the number of parameters that we can expect to still yield an improvement given the amount of available training data (PaLM is too big for instance), but even that only tells us half of the story: How good can our models get, if we make do with sub-optimal training efficiency (see LLaMA)? What is the influence of data quality/type? What if we train (gasp) multiple epochs with the same training set?

badabummbadabing t1_jar3uab wrote on March 3, 2023 at 1:27 PM

Reply to [N] EleutherAI has formed a non-profit by StellaAthena

Going forward, under which licences are you going to release your code/weights/data?

badabummbadabing t1_jajdjmr wrote on March 1, 2023 at 9:17 PM

Reply to comment by jturp-sc in [D] OpenAI introduces ChatGPT and Whisper APIs (ChatGPT API is 1/10th the cost of GPT-3 API) by minimaxir

Honestly, I have become a lot more optimistic regarding the prospect of monopolies in this space.

When we were still in the phase of 'just add even more parameters', the future seemed to be headed that way. With Chinchilla scaling (and looking at results of e.g. LLaMA), things look quite a bit more optimistic. Consider that ChatGPT is reportedly much lighter than GPT3. At some point, the availability of data will be the bottleneck (which is where an early entry into the market can help getting an advantage in terms of collecting said data), whereas compute will become cheaper and cheaper.

The training costs lie in the low millions (10M was the cited number for GPT3), which is a joke compared to the startup costs of many, many industries. So while this won't be something that anyone can train, I think it's more likely that there will be a few big players (rather than a single one) going forward.

I think one big question is whether OpenAI can leverage user interaction for training purposes -- if that is the case, they can gain an advantage that will be much harder to catch up to.

badabummbadabing t1_ja8bzhc wrote on February 27, 2023 at 4:27 PM

Reply to comment by LetterRip in [D] Training a UNet-like architecture for semantic segmentation with 200 outcome classes. by Scared_Employer6992

https://cardiacmr.hms.harvard.edu/files/cardiacmr/files/isensee_etal_nature2021_nnunet.pdf Check Figure 4. Architecture barely matters on average.

badabummbadabing t1_ja7yxbg wrote on February 27, 2023 at 2:58 PM

Reply to comment by Scared_Employer6992 in [D] Training a UNet-like architecture for semantic segmentation with 200 outcome classes. by Scared_Employer6992

Don't use batch normalization. Lots of U-Nets use e.g. instance normalisation. A batch size of 1 should be completely fine (but you will need to play with the learning rate upon changing this). Check the 'no new U-Net' (aka NN-Unet) paper by Fabian Isensee for the definitive resource on what matters in U-Nets.

badabummbadabing t1_ja7yb9y wrote on February 27, 2023 at 2:54 PM

Reply to [D] Training a UNet-like architecture for semantic segmentation with 200 outcome classes. by Scared_Employer6992

The problem might be the number of output channels at high resolution. Instead of computing the final layer's activations and gradients in parallel for each channel, you should be able to sequentially compute each channel's loss and add their gradients in the end. This is easy, because the loss decomposes as a sum over the channels (and thus, also the channels' gradients).

In pytorch, this whole thing should then be as simple as running the forward and backward passes for the channels of the final layer sequentially (before calling optimizer.step() and optimizer.zero_grad() once). You will probably also need to retain_graph=True on every backward call, otherwise the activations in the preceding layers will be deleted before you get to the next channel.

badabummbadabing t1_j95kmxk wrote on February 19, 2023 at 12:24 PM

Reply to comment by MysteryInc152 in [D] Toolformer implementation using only few-shot prompting by MysteryInc152

This is absolutely wild.

badabummbadabing t1_j76tfqt wrote on February 4, 2023 at 2:59 PM

Reply to comment by jimmymvp in [D] Normalizing Flows in 2023? by wellfriedbeans

Fully agree from a technical perspective with you.

The difference is that at best, you only get the likelihood under your model of choice. If that happens to be a bad model of reality (which I'd argue is the case more often than not with NFs), you might be better off just using some approximate likelihood (or ELBO) of a more powerful model.

But I am not an expert in MCMC models, so I might be talking out of my depth here. I was mainly using these models for MAP estimation.

badabummbadabing t1_j6ypbxs wrote on February 2, 2023 at 8:51 PM

Reply to comment by Imonfire1 in [N] Microsoft integrates GPT 3.5 into Teams by bikeskata

You mean you want to see more than the same random four people at once? I don't think there is a use case for that.

badabummbadabing t1_j6wfsok wrote on February 2, 2023 at 11:06 AM

Reply to comment by jimmymvp in [D] Normalizing Flows in 2023? by wellfriedbeans

Exact likelihoods are what attracted me to normalizing flows once, too. But I soon found them too hard to train to yield any useful likelihoods. The bijectivity constraint (meaning that your 'latent' space is just as large as your data space) seems like too much of a restriction in practice. For my application, switching to variational models and just accepting that I'll only get lower bounds on the likelihood got me further in the end. Diffusion models would be a more 'modern' option in this regard as well.

Are you aware of any applications, where people actually use NFs for likelihoods? I am aware of some research papers, but I'd say that their experiments are too much of a contrived example to convince me that this will ever find its way into an actual application.

badabummbadabing t1_iu3ubql wrote on October 28, 2022 at 10:51 AM

Reply to [D] Do companies actually care about their model's training/inference speed? by GPUaccelerated

My company cares about training time. We are iterating on one main model, and training a series of experiments in a shorter amount of time allows you to have a faster iteration cycle. I think many people absolutely underappreciate this. There are also critical training times which you may need to hit in order to make real use of that. For example, if your training time is below on the order of 2 days, you may be able to get 2 (or even 3) iteration cycles in per week. A training time of 3 days reduces this to 1-2 iteration cycles per week. A training time of 4 days means that you can only realistically achieve 1 iteration cycle per week.

Another way of thinking about this is that doubling your training speed also doubles the amount of hardware you have at your disposal, and halves the cost per experiment.

badabummbadabing t1_is4vg29 wrote on October 13, 2022 at 9:39 AM

Reply to [D] Are GAN(s) still relevant as a research topic? or is there any idea regarding research on generative modeling? by aozorahime

GANs may be losing some ground to diffusion models in generative tasks, but the idea of playing an adversarial game with a learnable loss function is more general than generating pretty pictures.

badabummbadabing t1_ir9bv9x wrote on October 6, 2022 at 8:49 AM

Reply to comment by fromnighttilldawn in [Discussion] Best performing PhD students you know by Light991

It did have an error in their convergence proof (which was later rectified by other people). But

this was only applicable to convex cost functions anyway (General convergence proof in this sense are impossible for general nonconvex problems like neural net training)
Adam is literally the most used optimiser for neural network training, it would be crazy to deny its significance due to a technical error in a proof in an irrelevant (for this application) regime

Regarding "whatever Hinton was doing": Are you talking about RMSprop? Sure, it's another momentum optimizer. There are many of them.