HateRedditCantQuitit

HateRedditCantQuitit t1_jcmdot7 wrote

I think of context as a end-to-end connected version of retrieval. You can backprop from loss to retrieved info, but you also want to backprop from loss to the non-retrieved info, which would basically be equivalent to having it all in context (in a handwavy way). Which is to say that just having more context is a simple solution.

I think everyone knows increasing context length is not 100% sufficient, but it sure is a simple convenient solution.

3

HateRedditCantQuitit t1_j7l30f2 wrote

I hate getty as much as anyone, but I'm going to go against the grain and hope they win this. Imagine if instead of getty vs stability, it was artstation vs facebook or something. The same legal principles must apply.

In my ideal future, we'd have things like

- research use is free, but commercial use requires opt-in consent from content creators

- the community adopts open licenses like e.g. copyleft (if you use a GPL9000 dataset, the model must be GPL too, or whatever) or some other widely used opt-in license.

5

HateRedditCantQuitit t1_j7l1m4c wrote

>If you can't train models on copyrighted data this means that they can't learn information from the web outside of specific openly-licensed websites like Wikipedia. This would sharply limit their usefulness.

That would be great. It could lead to a future with things like copyleft data, where if you want to train on open stuff, your model legally *must* be open.

1

HateRedditCantQuitit t1_j647xm6 wrote

I'm not sure how long you've been around, but before BPE came along, large vocabularies were actually quite a pain in the ass. You can find lots of literature around it before maybe 2016 (can't remember exact dates to look and I'm feeling lazy).

IIRC, a big issue was the final prediction layer. Say you're predicting a sequence 4k tokens long. Then you have 4k times vocab-size predictions. With a 50k token vocab, that's 200M predictions in memory (roughly 1 gig with floats). Lets say we want to equally compress 20x more languages, so we get 1M tokens (speaking super duper roughly), which means nearly 20GB just to represent the logits. If we wanted to handle a 40k long sequence, it's the difference between 20GB and 200GB of logits.

That said, BPE just takes in sequences of more-simple tokens. If you want to feed it unicode, go ahead. If you want to feed it something else, that will work too. It seems like you're mostly frustrated that LLM investments are focused on english right now, which is valid. Tech investments in general have a strong silicon valley bias, and a zillion people want to recreate that elsewhere. But that's a very hard economic question.

1

HateRedditCantQuitit t1_j60rtsa wrote

This isn't the whole answer, but GANs are super hard to train, while diffusion models are an instance of some much more well understood methods (MLE, score matching, variational inference). That leads to a few things:

- It's more reliable to converge (which leads to enthusiasm)

- It's easier to debug (which leads to progress)

- It's better understood (which leads to progress)

- It's simpler (which leads to progress)

- It's more modular (which leads to progress)

Hypothetically, it could even be that the best simple GAN is better than the best simple diffusion model, but it's easier to iterate on diffusion models, which means we'd still be more able to find the good ways to do diffusion.

tl;dr when I worked on GANs, I felt like a monkey hitting a computer with a wrench to make it work, while when I work on diffusion models, I feel like a mathematician deriving Right Answers™.

60

HateRedditCantQuitit t1_j60qzvg wrote

I always see diffusion/score models contrasted against VAEs, but is there really a good distinction? Especially given latent diffusion and IAFs and all the other blurry lines. I feel like any time you're doing forward training & backwards inference trained with an ELBO objective, it should count as a VAE.

3

HateRedditCantQuitit t1_j5r5f69 wrote

You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.

Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.

Look up linear bottlenecks.

3

HateRedditCantQuitit t1_j24zv0q wrote

You’re still implicitly saying that you’re 100% certain that it’s either a cat or a dog, which is wrong. If a horse picture has p(cat)=1e-5 and p(dog) = 1e-7, that should also be fine, right? And if you normalize those such that p(cat) + p(dog) = 1, you end up with basically p(cat)=1. Testing for (approximately) p(cat) = p(dog) when it can be neither is a messy way to go about doing calibration.

It’s just a long way of saying that having the probabilities not sum to one is fine.

4

HateRedditCantQuitit t1_j0jgtkj wrote

The root of your problem is that you are stuck assigning 100% confidence to the prediction that your horse is a cat or a dog. It’s perfectly rational that you might have p(cat)=1e-5 and p(dog)=1e-7 for a horse picture, right?

So when you normalize those, you get basically p(cat)=1.

Try binary classification of cat vs not cat and dog vs not dog. Don’t make them sum to one.

5

HateRedditCantQuitit t1_j0c0og4 wrote

This paper has some interesting points we might agree or disagree with, but the headline point seems important and much more universally agreeable:

We have to be much more precise in how we talk about these things.

For example this comment section is fully of people arguing whether current LLMs satisfy ill-defined criteria. It’s a waste of time because it’s just people talking past each other. To stop talking past each other, we should consider whether they satisfy precisely defined criteria.

9

HateRedditCantQuitit t1_j049e1g wrote

I just sent this to chatgpt, and it worked fine:

​

>What are the locations present in the following sentence?
>
>“I flew from SF to NY today, with a layover in Blorpington.”
>
>Please respond in a JSON list of the form
>
>```
>
>{
>
> “locations”: […]
>
>}
>
>```

10