HateRedditCantQuitit
HateRedditCantQuitit t1_j7nbrkz wrote
Reply to comment by EmbarrassedHelp in [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement by Wiskkey
Not necessarily. If it turns out, for example, that language generation models trained on GPL code must be GPL, then it means that there's a possible path to more open models, if content creators continue creating copyleft content ecosystems.
HateRedditCantQuitit t1_j7l30f2 wrote
Reply to [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement by Wiskkey
I hate getty as much as anyone, but I'm going to go against the grain and hope they win this. Imagine if instead of getty vs stability, it was artstation vs facebook or something. The same legal principles must apply.
In my ideal future, we'd have things like
- research use is free, but commercial use requires opt-in consent from content creators
- the community adopts open licenses like e.g. copyleft (if you use a GPL9000 dataset, the model must be GPL too, or whatever) or some other widely used opt-in license.
HateRedditCantQuitit t1_j7l1m4c wrote
Reply to comment by currentscurrents in [N] Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement by Wiskkey
>If you can't train models on copyrighted data this means that they can't learn information from the web outside of specific openly-licensed websites like Wikipedia. This would sharply limit their usefulness.
That would be great. It could lead to a future with things like copyleft data, where if you want to train on open stuff, your model legally *must* be open.
HateRedditCantQuitit t1_j6upt7k wrote
Reply to comment by mongoosefist in [R] Extracting Training Data from Diffusion Models by pm_me_your_pay_slips
It's funny that the top comment right now is that it shouldn't be surprising, because whenever the legal argument comes in, the most common defense is that these models categorically don't memorize.
HateRedditCantQuitit t1_j6k9goi wrote
HateRedditCantQuitit t1_j647xm6 wrote
Reply to comment by madmax_br5 in [D] Moving away from Unicode for more equal token representation across global languages? by madmax_br5
I'm not sure how long you've been around, but before BPE came along, large vocabularies were actually quite a pain in the ass. You can find lots of literature around it before maybe 2016 (can't remember exact dates to look and I'm feeling lazy).
IIRC, a big issue was the final prediction layer. Say you're predicting a sequence 4k tokens long. Then you have 4k times vocab-size predictions. With a 50k token vocab, that's 200M predictions in memory (roughly 1 gig with floats). Lets say we want to equally compress 20x more languages, so we get 1M tokens (speaking super duper roughly), which means nearly 20GB just to represent the logits. If we wanted to handle a 40k long sequence, it's the difference between 20GB and 200GB of logits.
That said, BPE just takes in sequences of more-simple tokens. If you want to feed it unicode, go ahead. If you want to feed it something else, that will work too. It seems like you're mostly frustrated that LLM investments are focused on english right now, which is valid. Tech investments in general have a strong silicon valley bias, and a zillion people want to recreate that elsewhere. But that's a very hard economic question.
HateRedditCantQuitit t1_j621uj8 wrote
Reply to comment by Zealousideal_Low1287 in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
Isn't reconstructing the input exactly what the denoising objective does?
HateRedditCantQuitit t1_j60rtsa wrote
Reply to [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
This isn't the whole answer, but GANs are super hard to train, while diffusion models are an instance of some much more well understood methods (MLE, score matching, variational inference). That leads to a few things:
- It's more reliable to converge (which leads to enthusiasm)
- It's easier to debug (which leads to progress)
- It's better understood (which leads to progress)
- It's simpler (which leads to progress)
- It's more modular (which leads to progress)
Hypothetically, it could even be that the best simple GAN is better than the best simple diffusion model, but it's easier to iterate on diffusion models, which means we'd still be more able to find the good ways to do diffusion.
tl;dr when I worked on GANs, I felt like a monkey hitting a computer with a wrench to make it work, while when I work on diffusion models, I feel like a mathematician deriving Right Answers™.
HateRedditCantQuitit t1_j60qzvg wrote
Reply to comment by dojoteef in [D] Why are GANs worse than (Latent) Diffusion Models for text2img generation? by TheCockatoo
I always see diffusion/score models contrasted against VAEs, but is there really a good distinction? Especially given latent diffusion and IAFs and all the other blurry lines. I feel like any time you're doing forward training & backwards inference trained with an ELBO objective, it should count as a VAE.
HateRedditCantQuitit t1_j5r5f69 wrote
Reply to comment by [deleted] in [D] are two linear layers better than one? by alex_lite_21
You can represent any `m x n` matrix with the product of some `m x k` matrix with a `k x n` matrix, so long as k >= min(m, n). If k is less than that, you're basically adding regularization.
Imagine you have some optimal M in Y = M X. Then if A and B are the right shape (big enough in the k dimension), they can represent that M. If they aren't big enough, then they can't learn that M. If the optimal M doesn't actually need a zillion degrees of freedom, then having a small k bakes that restriction into the model, which would be regularization.
Look up linear bottlenecks.
HateRedditCantQuitit t1_j5r2kt5 wrote
If you have Y = A B X, then is M = A B full rank? If not, then they're not even equivalent.
HateRedditCantQuitit t1_j5hymmu wrote
Reply to [D] With more compute could it be easy to quickly un Mask all the people on Reddit by using text correlations to non masked publicly available text data? by Loquzofaricoalaphar
Could you? Probably, but with a nontrivial error rate. Should you? No, that would make YTA.
HateRedditCantQuitit t1_j42ogtm wrote
Reply to [D] Can someone point to research on determining usefulness of samples/datasets for training ML models? by HFSeven
Not exactly what you’re asking, but active learning has a lot to say on data point usefulness.
HateRedditCantQuitit t1_j3850b6 wrote
Reply to comment by Cpt_shortypants in [D] Is it a time to seriously regulate and restrict AI research? by Baturinsky
Bombs are just physics, but I’m glad we regulate it.
HateRedditCantQuitit t1_j2sinhq wrote
I think people often see this sort of p >> N data in genetics?
ESL II has a whole chapter on p >> N problems (ch 18) https://hastie.su.domains/ElemStatLearn/
HateRedditCantQuitit t1_j29yn1r wrote
First, learn foundations. Linear algebra, vector calculus, probability, statistics. Try going through Kevin Murphy's books. They're relatively self contained. If you reach a dependency that they don't cover, pick up a textbook on it.
HateRedditCantQuitit t1_j24zv0q wrote
Reply to comment by arcxtriy in [D] SOTA Multiclass Model Calibration by arcxtriy
You’re still implicitly saying that you’re 100% certain that it’s either a cat or a dog, which is wrong. If a horse picture has p(cat)=1e-5 and p(dog) = 1e-7, that should also be fine, right? And if you normalize those such that p(cat) + p(dog) = 1, you end up with basically p(cat)=1. Testing for (approximately) p(cat) = p(dog) when it can be neither is a messy way to go about doing calibration.
It’s just a long way of saying that having the probabilities not sum to one is fine.
HateRedditCantQuitit t1_j24cg3y wrote
Reply to comment by arcxtriy in [D] SOTA Multiclass Model Calibration by arcxtriy
If you train a model on a dataset of dogs and cats, then show it a picture of a horse, do you want p(dog)+p(cat) = 1?
HateRedditCantQuitit t1_j1v0fto wrote
Reply to comment by derpderp3200 in [D] Has any research been done to counteract the fact that each training datapoint "pulls the model in a different direction", partly undoing learning until shared features emerge? by derpderp3200
The whole SGD & optimizer field is kinda this. Think about how momentum and the problem you’re talking about interact, for a small example.
HateRedditCantQuitit t1_j1eq54d wrote
Take a look at Elicit. http://elicit.org/
It’s a more focused scope, but very effective at what it does.
HateRedditCantQuitit t1_j0jgtkj wrote
The root of your problem is that you are stuck assigning 100% confidence to the prediction that your horse is a cat or a dog. It’s perfectly rational that you might have p(cat)=1e-5 and p(dog)=1e-7 for a horse picture, right?
So when you normalize those, you get basically p(cat)=1.
Try binary classification of cat vs not cat and dog vs not dog. Don’t make them sum to one.
HateRedditCantQuitit t1_j0f3ilt wrote
Reply to comment by evil0sheep in [R] Talking About Large Language Models - Murray Shanahan 2022 by Singularian2501
If you give me a precise enough definition of what you mean by ”understanding” we can talk, but otherwise we’re not discussing what gpt does, we’re just discussing how we think english ought to be used.
HateRedditCantQuitit t1_j0c0og4 wrote
This paper has some interesting points we might agree or disagree with, but the headline point seems important and much more universally agreeable:
We have to be much more precise in how we talk about these things.
For example this comment section is fully of people arguing whether current LLMs satisfy ill-defined criteria. It’s a waste of time because it’s just people talking past each other. To stop talking past each other, we should consider whether they satisfy precisely defined criteria.
HateRedditCantQuitit t1_j049e1g wrote
I just sent this to chatgpt, and it worked fine:
​
>What are the locations present in the following sentence?
>
>“I flew from SF to NY today, with a layover in Blorpington.”
>
>Please respond in a JSON list of the form
>
>```
>
>{
>
> “locations”: […]
>
>}
>
>```
HateRedditCantQuitit t1_jcmdot7 wrote
Reply to comment by Spiritual-Reply5896 in [D] PyTorch 2.0 Native Flash Attention 32k Context Window by super_deap
I think of context as a end-to-end connected version of retrieval. You can backprop from loss to retrieved info, but you also want to backprop from loss to the non-retrieved info, which would basically be equivalent to having it all in context (in a handwavy way). Which is to say that just having more context is a simple solution.
I think everyone knows increasing context length is not 100% sufficient, but it sure is a simple convenient solution.