suflaj

suflaj t1_j57ce83 wrote

This looks like something for XGBoost. In that case you're looking at the XGBRegressor class.

Your X are the first 4 features, your Y are the 3 outputs. You will need to convert the medication to a one-hot vector representation, and the diet will presumably be enumerated into whole numbers sorted by healthiness.

3

suflaj t1_j557l5p wrote

Initially I knew that the biggest difference between previous approaches was the base network. Previously it was a resnet, now a transformer. The transformer is free to rearrange features. Because classification worked well, I know that it wasn't the features themselves. In fact, even less present classes were solved perfectly. So I suspected it was the arrangement of features, since classification is done by a linear layer, which is also free to permute features however it wants.

Then after trying out every implemented convolutional detector and getting the same results, I was even more suspicious. What nailed it down was tracking how the features changed. Anyways, as I was training the detector more and more I saw that the transformer's pool features changed as well. But when I froze the transformer network weights and tried a different task, performance didn't change in a statistically significant way.

When I looked at the activations I saw that on the transformer part, they do not correlate to the locations spatially. So, based on how CNNs work, I knew that as opposed to resnet-based feature extractors, they're giving shuffled outputs.

And finally, because I observed double descent I called upon previous work that hypothesized that the phenomenon might be the restructuring of the network itself. Because I confirmed that the restructuring happening in the transformer part didn't change the performance, I could hypothesize that the restructuring is likely related to spatial properties. I could not confirm whether it would ever converge or generalize, as the increases were from like 0.40 MAP50 to 0.50 MAP50, while I was contesting 0.93 MAP50 scores that were still quite flawed despite being state of the art. And outside out the metrics it was not that obvious that the performance was so much better - even my mentor said "Wow, it works well". Until I showed him the messed up results.

4

suflaj t1_j5521eb wrote

Well yes and no. The variable manipulation was just to prove that the implementation won't work. I also had to go deeper into the theoretical reasons why it wouldn't work.

This is something you can't (easily) prove with the implementation (you don't even have the guarantee that the implementation is correct), but you can disprove the hypothesis that it is due to a specific component. I used this counterproof as a basis why it is not due to the changed components, but the base network. Then I had to compare 2 different tasks on the same data to prove that poor performance is not tied to the actual base network or subcomponents being too weak, but rather how the information of the base network is used by the subcomponents. Once I proved that the different, but similarly difficult task works, I had proof that it's not the data, nor the modules, but either the task or information flow. I knew the task is not flawed or too hard because smaller networks solved the problem (I was just aiming for better than solved).

Specifically, I argued that transformers don't necessarily feature the spatial bias CNNs have and as such make it harder for the convolutional detectors to work with arbitrarily permuted features. I also showed that with sufficiently prolonged training, the detectors would become better, but I concluded that at that rate, it would be more viable to pretrain everything from scratch, for what I didn't have the budget.

I also confirmed double descent behaviour, which made all of this out of scope for my graduate thesis. Consult with your mentor/colleagues to make sure you're not going out of scope, either.

4

suflaj t1_j54vkhl wrote

Well you would at minimum need to explain why it didn't meet your expectations. You need to elaborate on what grounds you hypothesized what you hypothesized, and why that was a wrong basis or elaborate on what happened that you didn't predict would.

I also, ex., assumed that vision transformers would get better performance on the task I had. But when they didn't (they were sometimes outperformed by YOLO v1), I investigated why, and laid out the proof why it was not human error (aside from my judgement), as well as suggestions on how to proceed next. To do that I rerun the experiment many times, changed hyperparameters, swapped out detectors, all to narrow down that it wasn't actually the inadequate settings, but the arch and specific model themselves.

6

suflaj t1_j4kmugm wrote

Unless the task is not present in the human language distribution it learned to mimic and in your prompt, it will not be able to do it.

While counting is one task that shows that it doesn't actually understand anything, there are many more, among those it doesn't outright refuse to answer to. Some examples are math in general (especially derivatives and integration), logic to some extent or pretty much anything too big for its memory (my assumption is it is able to fit a hundred or two hundred sentences before it forgets things).

For things not present in your prompt, it is also heavily biased. For example, even though it claims it doesn't give out opinions, it prefers Go as a programming language, AWD for cars, hydrogen and EVs for fuel technology (possibly because of its eco-terrorist stances), the color red... These biases might be preventing it from doing some tasks it usually should be able to do.

For example, if you ask it to objectively tell you what the best car is, it might say Toyota Mirai, even though it's actually a terrible car to have even in California, the best place to have one. You might be thinking that its thinking is broken, but in reality, the biases screwed it over.

1

suflaj t1_j46gu2z wrote

I would proceed with caution because smaller models are generally not that easy to finetune. In fact, the whole point of a larger model is that it not only contains a lot of information, but that it is fairly easy to adapt to new tasks because it has plenty of "space" to restructure itself. A smaller model trying to restructure itself is more likely to diverge or not be able to adapt to the task at all.

It would be more viable in that case to run the larger model layer by layer, finetune it, and then distill onto a smaller one. That way you use the maximum potential of a larger model to adapt to a different task, and you distill it into whatever you need.

3

suflaj t1_j43urpb wrote

Sure, but it is not considered synonymous. When people say ML, they usually mean linear regression, bayesian optimization and gradient boosting, not necessarily artificial neural networks with backpropagation and some version of gradient descent.

Expert learning is also a subset of ML, yet they are not considered synonymous.

The same way we say ML is distinct from AI because it implies learning, we hold DL to be distinct from ML because these are not exactly statistical methods and it's mostly alchemy, and we hold expert systems as distinct from ML because it's just a fancy way of saying rule-based AI and it doesn't imply there's any learning involved.

One must realize that mathematical relations do not perfectly map onto human language and communication. Similarly to how a skirt is a part of a dress, yet we consider them different things, subsets of ML are not considered ML itself in language and communication.

−11

suflaj t1_j43eanz wrote

Well depends on what usefulness is.

If you can prove that all of your samples belong to the same distribution, then simply looking up which have the greatest gradient norm will be a measure of how useful they are for the model. Another approach is looking at how much their contribution would be in improving the performance of other samples, but then your dataset becomes a dependent variable.

But obviously this is dependent on the current weights, the loss function and various other biases. This is because gradient norm is proportional to the error, and so the samples for which the model predicts the most erroneous result will end up being most useful, given the perfect LR for it.

1

suflaj t1_j4308sf wrote

Ah, wasn't aware they published the weights. But if that's too big I am not aware of anything significantly smaller that would retain most of the performance.

It should be relatively easy to pretrain a significantly smaller network yourself given the pretrained resnet weights with good enough sampling and a month or so training...

2

suflaj t1_j3vg5tm wrote

I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.

There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.

1

suflaj t1_j3u4smq wrote

Google's BERT use is not a commercial, consumer product, it is an enterprise one (Google uses it and runs it on their hardware), they presumably use the large version or something even larger than the pretrained weights available on the internet and to achieve latencies they have they are using datacentres and non-trivial distribution schemes for it, not just consumer hardware.

Meanwhile, your average CPU will need anywhere from 1-4 seconds to do one inference pass in onnx runtime, of course much less on a GPU, but to be truly cross platform you're targetting JS in most cases, which means CPU and not a stack as mature as what Python/C++/CUDA have.

What I'm saying is:

  • people have said no to paid services, they want free products
  • consumer hardware has not scaled nearly as fast as DL
  • even ancient models are still too slow to run on consumer hardware after years of improvement
  • distilling, quantizing and optimizing them seems to get them to run just fast enough to not be a nuisance, but is often too tedious to work out for a free product
−1

suflaj t1_j3twskh wrote

Half is not enough. We're thinking in the order of 100x or even more. Do not forget that even ordinary BERT is not really commercially viable as-is.

I mean sure you can use them to get a nicer distribution for your dataset. But at the end of the day the API is too slow to train any "real" model, and you can already probably collect and generate data for smaller models yourself. So as a replacement for lazy people - sure, I think ChatGPT by itself probably has the potential to solve most repetitive questions people have on the internet. But it won't be used like that at scale so ultimately it is not useful.

If it wasn't clear enough by now, I'm not skeptic because of what LLMs are, but how they simply do not scale up to real-world requirements. Ultimately, people do not have datacenters at home, and OpenAI and other vendors do not have the hardware for any actual volume of need other than a niche, hobbyist one. And the investment to develop something like ChatGPT is too big to justify for that use.

All of this was ignoring the obvious legal risks from using ChatGPT generations commercially!

−1

suflaj t1_j3tq0u2 wrote

Sure you could. But the cost is so much it probably outweighs the benefits. And that is even if you made training stable (we already know based on recurrent networks, GANs and even transformers that they're not particularly stable). Hooking it up to the repl would make the task essentially reinforcement learning. And if you know something about reinforcement learning, you know that it generally doesn't work because the environment the agent has to traverse is too difficult to learn anything - what Deepmind managed to achieve with their chess and go engines is truly remarkable, but these are THEIR achievements despite the hardships RL introduces. This is not the achievement of RL. Meanwhile ChatGPT is mostly an achievement of a nice dataset, a clever task and deep learning. It is not that impressive from an engineering standpoint (other than syncing up all the hardware to preprocess the data and train it)

Unless LLMs are extremely optimized in regards to latency and cost, or unless compute becomes even more cheaper (not likely), they have no practical future for the consumer.

So far, it's still a dick measuring contest, as if a larger model and dataset will make much of a difference. I do not see much interest in making them more usable or accessible, I see only effort in beating last year's paper and getting investors to dump more money into a bigger model for next year. I also see ChatGPT as being a cheap marketing scheme all the while it's being used for some pretty nefarious things, some of them being botted Russian or Ukrainian war propaganda.

So you can forget the repl idea. Who would it serve? Programmers have shown they are not willing to pay for something like GitHub Copilot. Large companies can always find people to hire and do programming for them. Unless these are strides in something very expensive, like formal verification, it's not something a large company, the one that has the resources to research LLMs, would go into.

Maybe the next step is training it on WolframAlpha. But at that point you're just catching up to almost 15 year old software. Maybe that "almost 15 year old" shows you how overhyped ChatGPT really is for commercial use.

0

suflaj t1_j3oepus wrote

You understimate how hard cross-platform is to achieve. Especially with GPUs. There is no GPGPU API standard, first and foremost, so ensuring cross-platform is a tedious task which essentially either means creating an API that has to accomodate every GPU, or writing "drivers" for every different GPU. GPUs can be vastly different between generations and models, unlike x86 and x86-64 CPU architectures which have mostly stayed the same for several decades now.

Vulkan Compute is nowhere near reaching feature parity with CUDA and cuDNN. ROCm's CUDA is way better and still too much of a pain to install and keep.

Furthermore, open standards mean nothing when a graphics vendor can just gimp the API, like, ironically, nvidia already does with Vulkan.

There is an open variant called OpenCL. But it will probably never be as mature as CUDA, even though 3.0 is apparently making great strides. There is absolutely no reason to push for Vulkan due to how cancerous developing anything in it is.

3

suflaj t1_j3ld7gy wrote

Exactly. It's not exactly meant to be a zero shot model, but rather a conversational agent. If it were a zero shot model, here is what it could say:

  • gebutzeripanim is a new phrase for me
  • based on its start it sounds german
  • based on its middle it sounds vaguely korean
  • based on its end it sounds vaguely turkish
  • it might be made up slang

But in conversation you wouldn't just start decomposing stuff you never heard about. Here are what it says if you try to fish out these answers:

  • I'm sorry, but I'm not familiar with the term "gebutzeripanim." Could you provide more context or clarify what you are referring to?
  • Without more context or information about the word, it is difficult for me to determine whether it has a connection to the German language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me.
  • Without more context or information about the word, it is difficult for me to determine whether it has a connection to the Korean language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me. It is also possible that the word includes elements from multiple languages or that it is a hybrid of several languages. Without more information, it is difficult to say for sure.
  • Without more context or information about the word, it is difficult for me to determine whether it has a connection to the Turkish language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me. It is also possible that the word includes elements from multiple languages or that it is a hybrid of several languages. Without more information, it is difficult to say for sure.
  • It is not possible for me to determine whether "gebutzeripanim" is a made-up or invented word, as I do not have sufficient information about the word. It is possible that "gebutzeripanim" is a slang term that is not widely known or used, or that it is a word that has been created recently and has not yet gained widespread recognition. Without more context or information, it is difficult to say for sure what the word means or how it is used.

It mostly bases it answers on the fact that it doesn't know. And given that it uses past answers as part of the prompts you can see that the second time I ask it for a language it assumes it can be made out of more languages, but not because it understands it, only because I mentioned it.

If you ask it in a new session whether it is made out of words or phrases from several languages, it answers with

> I'm sorry, but I am unable to find any information about a word spelled "gebutzeripanim." It is possible that this is a made-up word or a word from a language that I am not familiar with. Can you provide any context or additional information about the word that might help me to better understand it?

Since it basically needs to explicitly see things in training, it's not really a zero-shot, but rather a few-shot model. There are instances where it seems like it can connect the dots but you can't really say it happens in the general case...

2