suflaj

suflaj t1_j3haky4 wrote

Why would it be used? It doesn't begin to compare to CUDA and cuDNN. Nothing really does. And Vulkan specifically is made for graphics pipelines, not for general purpose compute. To be cross compatible, it usually sends compute to be done on the CPU.

It's not that there is a consipiracy to use proprietary nvidia software - there just isn't anything better than it.

13

suflaj t1_j3emtbh wrote

> I would love to know how to do this! I can run GPT2 locally, and that would be fantastic level of zero-shot learning to be able to play around with.

It depends on how much you can compress the prompts. GPT2 is severely limited by memory. This means that you would need to train it on already condensed prompts. But in reality, it has the same (albeit not as refined) capabilities as ChatGPT.

> But it still followed my instructions

Well, it turns out that following instructions can be reduced to a symbol manipulation task. Again, you're giving it too much credit. I do agree that it is wide, but it is not as wide as Google or Wikipedia, which would represent humanity I guess.

> it is successfully generalizing zero-shot to new NLP tasks.

As are lesser models. Transformer based models are fairly successful at it and we have hypothesized this since GPT2, and confirmed it with GPT3. But one thing: technically it generalized few-shot to a new NLP task. It hallucinates on zero shot problems generally or states that it doesn't know. Ask it, for an example, what a "gebutzeripanim" is. I made that up just now.

As for the task you gave it, you cannot claim it is zero shot, as you cannot prove its components were not in the database. Unless you want to say that you're pretty sure the prompt you gave it was not in the database, but hey, that can apply to all generative models, that's what generalization is. But there are tasks it fails on because it just cannot do some things. Ask it to integrate or derive certain functions and you'll quickly see what I mean.

It can tell you all you want to know about integration, it can tell you all the rules perfectly, but it simply cannot apply them as well.

2

suflaj t1_j3ek7d0 wrote

> This is a complex and poorly-defined task

Not at all. First of all, ChatGPT does not understand complexity. It would do you well not to think of it like there is some hierarchy. Secondly, there is no requirement of it needing to be well defined. From what I could gather, ChatGPT requires you to convince it it is not giving out an opinion, and then it can hallucinate pretty much anything.

Specifically the task you gave it is likely implicitly present in the dataset, in the sense that the dataset allowed the model to learn the connections between the words you gave it. I hate to break your bubble, but the task is also achievable even with GPT2, a much less expressive model, since it can be represented as a prompt.

It will be easier to see the shortcomings there, but to put it simply, ChatGPT also has them, ex. it does not by default in the genral case differentiate between uppercase and lowercase letters even if it might be relevant for the task. Such things are too subtle for it. Once you realize the biases it has in this regard you being to see through the cracks. Or generally once you give it a counting task, it says it can count but it is not always successful in it.

What is fascinating is the amount of memory ChatGPT has. It is compared to other models very big. But it is limited and it is not preserved outside of the session.

I would say that the people hyping it up probably just do not understand it that well. LLMs are fascinating, yes, but not ChatGPT specifically, it's how malleable the knowledge is. I would advise you to not understand it, because then the magic stays alive. I had a lot of fun for the first week when I was using it, but I never even use it nowadays.

I would also advise you to approach it more critically. I would advise you to first look into how blatantly racist and sexist it is. With that, you can see the reflection of its creators in it. And most of all, I would advise you to focus on its shortcomings. They are easy to find once you start talking to it more like you'd talk with a friend. They will help you use it more effectively.

2

suflaj t1_j3e9yim wrote

Well, my first paragraph covers that.

> So, just use code generation as an example, it is conceivable that it generates a piece of code, then it actually executes the code and then learn about its accuracy, performance, etc. And hence it is self - taught.

It doesn't do that. It learns how to have a conversation. The rest is mostly a result of learning things through learning how to model language. Don't give it too much credit. As said previously, it cannot extrapolate.

1

suflaj t1_j3e3piv wrote

Based on the techniques ChatGPT uses we cannot formally prove that it can generalize without infinite width. Even our training process amounts to mostly teaching the model to compress knowledge. ChatGPT made some strides by partially introducing something similar to reinforcement learning, but reinforcement learning itself is not enough to extrapolate or come up with new concepts.

All the big names in AI claim that stochastic gradient descent techniques and our current direction are fascinating, but ultimately a dead end. Certainly the area has been stale for several years and has degenerated into a dick measuring contest, only instead of dicks you measure parameters, TPUs and metrics on benchmark datasets. Blame transformers which were in a sense us getting a taste of the forbidden fruit, but you know what followed after that.

Of course, out of this you do get some advances useful for the industry, but nothing really of note in the general picture. And it seems to me that lately all these big models that imitate knowledge really well are generating negative sentiment in the population, which may ruin AI.

2

suflaj t1_j3e1h8f wrote

Not by a long shot.

ChatGPT in practice is a politically-biased conversational Google and Wikipedia summarizer with a bit of polite talk. And it is less broad than both of them.

It is truly fascinating how DEEP it can go, ex. translating arbitrary code in almost correct assembly, even recent one like M1, but that's that. It cannot reason fully, it cannot extrapolate, and most importantly, it has fairly old training data to compete with the speed of NLP research.

But it's nifty to chat with if none of your colleagues have the time.

12

suflaj t1_j3bubtm wrote

Another problem you will likely have is your very small convolutions. Basically, output channels of 8 and 16 are probably only enough to solve MNIST. You should then probably use something more like 32 and 64, and use larger kernels and strides to hopefully reduce reliance on the linears to do the work for you.

Finally, you are not using nonlinear activations between layers. Your whole network essentially acts like one smaller convolutional layer with a flatten and softmax.

1

suflaj t1_j3bt2eq wrote

That learning rate is about 100 times higher than you give to Adam for that batch size. That weight decay is also about 100 times higher, and if you want to use weight decay with Adam, you should probably use the AdamW optimizer (which is more or less the same thing, just fixes the interaction between Adam and weight decay)

Also, loss is not something that determines how much a model has learned. You should check out validation F1, or whatever metrics are relevant for the performance of your model.

1

suflaj t1_j319k9o wrote

It's basically just a higher abstraction layer for PyTorch. It's completely separate but works in tandem with PyTorch.

I use LightningModules (analogous to torch.nn.Module) as basically decorators over ordinary PyTorch models. So you have your model class, and then you create a LightningModule which is instantiated with said model, where you implement ex. what optimizers and schedulers you use, how your training, evaluation and testing goes, what metrics you track and when etc.

But once you're done with R&D you can just use ordinary PyTorch as-is, that's why I like it. It doesn't make setting stuff up for production different in any way, but it makes plenty of stuff during R&D effortless. It has some smelly parts but IMO they're not a dealbreaker, just take a day or two to learn it.

4

suflaj t1_j2yhjdw wrote

My man, I only recently convinced myself to start using PyTorch Lightning, no way I'd be able to switch to some other new hip marginally better framework, when it was this hard to start using something that speeds stuff up 10x.

Unless there are clear benefits to switching to some other new technology, it's not worth it.

5

suflaj t1_j2w69wh wrote

I would ask myself why one would consider transformers useful for any task. They seem to transfer knowledge really well. If that is the only thing that makes them viable for a given task, ex. time series forecasting, then it becomes obvious how simpler models can outperform.

But then the question becomes - are transformers the easiest models to transfer knowledge on for a given task? For time series forecasting, I do believe that is the case. For ex. CV, I am still not convinced.

If you're then bothered by their overhead, distill them to a simpler model. I don't think there's a better alternative architecture family for finetuning on tasks. Remember that transformers do not necessarily need to appear in the final product, but they can be a really good intermediate proxy for getting to that final product.

5

suflaj t1_j25xvso wrote

Unless you have money for an actual workstation like the top end Dell Precisions, Thinkpads or Razer Blades, you should probably not get a laptop to do deep learning on. Those macs will not be that much faster than running stuff on CPU, even if you do get some Metal API to run on it.

8

suflaj t1_j16i1ci wrote

Pytorch is easier to read and write and there are more resources for it. Tensorflow is easier to deploy and (sometimes) more performant.

Although Pytorch 2.0 is in testing and should be out soon. It makes up for whatever performance gap there is between the two apparently.

Regarding model performance it probably doesn't matter which one you choose, but looking at how your LSTM formulation of the solution is just plain wrong, Pytorch will be easier to use for more complex networks, although in practice to fully utilize resources you need to know both.

0

suflaj t1_j13pqhe wrote

While you can run large models (layer by layer, batch by batch, dimension by dimension or element by element), the problem is getting to the weights. No one said you need to transform your input to the output in one go. All that is important is that there is no single operation that would make you go OOM.

Theoretically, there is no network where a linear combination would exceed modern memory sizes, but this doesn't mean that such a strategy would be fast. At the base level, all you need is 3 registers (2 for addition and multiplication, 1 to keep sum aggregate) and enough memory to store the network weights.

6

suflaj t1_j0ztusm wrote

> Not the case in C++. Am I wrong here?

Probably. It seems you "have" to do these things because you want speed. But if you want speed, then you'll have to do them in C++ as well.

> I am not taking about researchers, I am talking more about businesses.

This applies to businesses more than anything. Your manager does not give a single fuck about the purity and the performance of your code before its deployed. Until then the only thing that matters is that the product is developed before your competitors get the contracts for as low of a cost as possible.

And when code is deployed, it will often not even be in C++. A lot of the time you have to port it to C because there are not C++ compilers for a platform, or you will keep it in ONNX format and then deploy on some runtime to keep maintenance easy.

8

suflaj t1_j0zthsd wrote

Looking at your post history, there are plenty of things I could make fun of. Dehumanize you even.

But instead of stooping to your level, all I will say is - I frequently program in C and (x86 & aarch64) Assembly, but I recognise that many of my R&D colleagues are not engineers, and that their strengths can be better utilised if they focus on the things they are good at.

2

suflaj t1_j0zq31h wrote

What if I told you that even if you were using C/C++, you'd still need to be using library functions? Because the code, ultimately, doesn't run natively, it calls Fortran, Assembly and CUDA libraries.

You cannot directly program in whatever CUDA compiles to because it's proprietary and GPU model-specific, so why bother? Researchers chose Python not because they like snakey-boys or enjoy British comedy, they chose it because it is adequate to do research in, unlike C or C++, which are horrible to work with and too hideous to read and understand even if a pro writes them, let alone some researcher.

Ultimately Python code is easier to maintain and work on, and there are more developers as opposed to C/C++, so of course companies will use it over whatever C++ API exists for DL libraries.

As for your Rust/Go question, although Go has some potential it has no community to work with. It is also harder to use than Python. There is almost no benefit of using Go over Python even if the decision was to be made now, let alone transfer, other than Go's nice concurrency model. Now, why would you use that when from joblib import delayed, Parallel does the trick? So far, the biggest problem Python has with concurrency is its lack of a good shared memory API, which is probably going to be fixed in a year or so now that it is part of Python. But this lack of API does not significantly impact Python, because you'd do this kind of stuff via a C module anyways.

As for Rust it will probably never become a serious contender for research and development because it is even more hideous and complex than C/C++ are. It is also slower, so, what's the point? Unless you want to double the average wages of people in DL and kill 90% of jobs since barely anyone can use Rust effectively.

15

suflaj t1_j0upxwa wrote

And in doing this, everyone who wasn't an activist became one by definition.

The comment I made, especially the way you explained the reactions to it, is a self-fulfilling prophecy. So I would hope the sub is just raided by anti-Elon bots, rather than the conclusion being there are more than a handful of hypocritical activist ML researchers.

−3

suflaj t1_j0jbo92 wrote

Depends on what you mean by confidence. With softmax, you model probability. You can train your network to give out near 100% probabilities per class, but this tells you nothing about how confident it is.

Instead, what you could do is the following:

  • get a prediction
  • define the target of your prediction as the resolved label
  • calculate loss on this
  • now define your target as the inverse of the initial one
  • calculate the loss again
  • divide the 2nd one by the first one

Voila, you've got a confidence score for your sample. However, this will only give you a number that will be comparable to other samples. This will not give you a % of confidence. You do know, however, that the higher the confidence score, the closer the confidence is to 100%. And you know the smaller your score is, the closer the confidence to 0%. Based on the ranges of your loss, you can probably figure out how to map it to whatever range you want.

For a multi-class problem, you could just sum the loss ratio between all classes other than your predicted class over the loss of your prediction. So, if you had a classifier that classifies an imagine into dog, cat, and fish, and your softmax layer spits out 0.9, 0.09, 0.01, your confidence score would be loss(cat)/loss(dog) + loss(fish)/loss(dog).

−4