suflaj
suflaj t1_j55ogya wrote
Well you should learn them even if only to know how to transform models from them to PyTorch.
suflaj t1_j557l5p wrote
Reply to comment by No_Possibility_7588 in How to proceed scientifically when your hypothesis is falsified? by No_Possibility_7588
Initially I knew that the biggest difference between previous approaches was the base network. Previously it was a resnet, now a transformer. The transformer is free to rearrange features. Because classification worked well, I know that it wasn't the features themselves. In fact, even less present classes were solved perfectly. So I suspected it was the arrangement of features, since classification is done by a linear layer, which is also free to permute features however it wants.
Then after trying out every implemented convolutional detector and getting the same results, I was even more suspicious. What nailed it down was tracking how the features changed. Anyways, as I was training the detector more and more I saw that the transformer's pool features changed as well. But when I froze the transformer network weights and tried a different task, performance didn't change in a statistically significant way.
When I looked at the activations I saw that on the transformer part, they do not correlate to the locations spatially. So, based on how CNNs work, I knew that as opposed to resnet-based feature extractors, they're giving shuffled outputs.
And finally, because I observed double descent I called upon previous work that hypothesized that the phenomenon might be the restructuring of the network itself. Because I confirmed that the restructuring happening in the transformer part didn't change the performance, I could hypothesize that the restructuring is likely related to spatial properties. I could not confirm whether it would ever converge or generalize, as the increases were from like 0.40 MAP50 to 0.50 MAP50, while I was contesting 0.93 MAP50 scores that were still quite flawed despite being state of the art. And outside out the metrics it was not that obvious that the performance was so much better - even my mentor said "Wow, it works well". Until I showed him the messed up results.
suflaj t1_j5521eb wrote
Reply to comment by No_Possibility_7588 in How to proceed scientifically when your hypothesis is falsified? by No_Possibility_7588
Well yes and no. The variable manipulation was just to prove that the implementation won't work. I also had to go deeper into the theoretical reasons why it wouldn't work.
This is something you can't (easily) prove with the implementation (you don't even have the guarantee that the implementation is correct), but you can disprove the hypothesis that it is due to a specific component. I used this counterproof as a basis why it is not due to the changed components, but the base network. Then I had to compare 2 different tasks on the same data to prove that poor performance is not tied to the actual base network or subcomponents being too weak, but rather how the information of the base network is used by the subcomponents. Once I proved that the different, but similarly difficult task works, I had proof that it's not the data, nor the modules, but either the task or information flow. I knew the task is not flawed or too hard because smaller networks solved the problem (I was just aiming for better than solved).
Specifically, I argued that transformers don't necessarily feature the spatial bias CNNs have and as such make it harder for the convolutional detectors to work with arbitrarily permuted features. I also showed that with sufficiently prolonged training, the detectors would become better, but I concluded that at that rate, it would be more viable to pretrain everything from scratch, for what I didn't have the budget.
I also confirmed double descent behaviour, which made all of this out of scope for my graduate thesis. Consult with your mentor/colleagues to make sure you're not going out of scope, either.
suflaj t1_j54vkhl wrote
Well you would at minimum need to explain why it didn't meet your expectations. You need to elaborate on what grounds you hypothesized what you hypothesized, and why that was a wrong basis or elaborate on what happened that you didn't predict would.
I also, ex., assumed that vision transformers would get better performance on the task I had. But when they didn't (they were sometimes outperformed by YOLO v1), I investigated why, and laid out the proof why it was not human error (aside from my judgement), as well as suggestions on how to proceed next. To do that I rerun the experiment many times, changed hyperparameters, swapped out detectors, all to narrow down that it wasn't actually the inadequate settings, but the arch and specific model themselves.
suflaj t1_j4wndsx wrote
Reply to comment by Acceptable-Cress-374 in [D] Do you know of any model capable of detecting generative model(GPT) generated text ? by CaptainDifferent3116
Using a black box model for this kind of stuff looks like a nice way to get sued
suflaj t1_j4pdd6o wrote
Reply to comment by elf7979 in Is 100 mega byte text corpus big enought to train? by elf7979
You're closer but not yet quite there - the smaller Google News Dataset W2V is trained on is 10 GB. The full one used is around 300GB IIRC
suflaj t1_j4l1i5l wrote
Likely not enough, at least not for what is considered good. But I fail to see why you'd want to trian it yourself, there are plenty of readily available w2v weights or vocabularies.
suflaj t1_j4kmugm wrote
Reply to [D] ChatGPT can't count by CosmicTardigrades
Unless the task is not present in the human language distribution it learned to mimic and in your prompt, it will not be able to do it.
While counting is one task that shows that it doesn't actually understand anything, there are many more, among those it doesn't outright refuse to answer to. Some examples are math in general (especially derivatives and integration), logic to some extent or pretty much anything too big for its memory (my assumption is it is able to fit a hundred or two hundred sentences before it forgets things).
For things not present in your prompt, it is also heavily biased. For example, even though it claims it doesn't give out opinions, it prefers Go as a programming language, AWD for cars, hydrogen and EVs for fuel technology (possibly because of its eco-terrorist stances), the color red... These biases might be preventing it from doing some tasks it usually should be able to do.
For example, if you ask it to objectively tell you what the best car is, it might say Toyota Mirai, even though it's actually a terrible car to have even in California, the best place to have one. You might be thinking that its thinking is broken, but in reality, the biases screwed it over.
suflaj t1_j46gu2z wrote
Reply to comment by alkibijad in [D] Is there a distilled/smaller version of CLIP, or something similar? by alkibijad
I would proceed with caution because smaller models are generally not that easy to finetune. In fact, the whole point of a larger model is that it not only contains a lot of information, but that it is fairly easy to adapt to new tasks because it has plenty of "space" to restructure itself. A smaller model trying to restructure itself is more likely to diverge or not be able to adapt to the task at all.
It would be more viable in that case to run the larger model layer by layer, finetune it, and then distill onto a smaller one. That way you use the maximum potential of a larger model to adapt to a different task, and you distill it into whatever you need.
suflaj t1_j45ogcz wrote
Reply to comment by sabertoothedhedgehog in [D] Has ML become synonymous with AI? by Valachio
They really do not without further context.
suflaj t1_j43urpb wrote
Reply to comment by TeamRocketsSecretary in [D] Has ML become synonymous with AI? by Valachio
Sure, but it is not considered synonymous. When people say ML, they usually mean linear regression, bayesian optimization and gradient boosting, not necessarily artificial neural networks with backpropagation and some version of gradient descent.
Expert learning is also a subset of ML, yet they are not considered synonymous.
The same way we say ML is distinct from AI because it implies learning, we hold DL to be distinct from ML because these are not exactly statistical methods and it's mostly alchemy, and we hold expert systems as distinct from ML because it's just a fancy way of saying rule-based AI and it doesn't imply there's any learning involved.
One must realize that mathematical relations do not perfectly map onto human language and communication. Similarly to how a skirt is a part of a dress, yet we consider them different things, subsets of ML are not considered ML itself in language and communication.
suflaj t1_j43gqqp wrote
Reply to [D] Has ML become synonymous with AI? by Valachio
They're not synonymous, ex. DL is not considered ML, and of course there is other AI that is not a strict subset of ML., ex. expert systems
suflaj t1_j43eanz wrote
Reply to [D] Can someone point to research on determining usefulness of samples/datasets for training ML models? by HFSeven
Well depends on what usefulness is.
If you can prove that all of your samples belong to the same distribution, then simply looking up which have the greatest gradient norm will be a measure of how useful they are for the model. Another approach is looking at how much their contribution would be in improving the performance of other samples, but then your dataset becomes a dependent variable.
But obviously this is dependent on the current weights, the loss function and various other biases. This is because gradient norm is proportional to the error, and so the samples for which the model predicts the most erroneous result will end up being most useful, given the perfect LR for it.
suflaj t1_j4308sf wrote
Reply to comment by manOnPavementWaving in [D] Is there a distilled/smaller version of CLIP, or something similar? by alkibijad
Ah, wasn't aware they published the weights. But if that's too big I am not aware of anything significantly smaller that would retain most of the performance.
It should be relatively easy to pretrain a significantly smaller network yourself given the pretrained resnet weights with good enough sampling and a month or so training...
suflaj t1_j42i6pu wrote
Nope. Authors experimented with it but said performance is lost. You can try to replace the transformers with ResNet50, but you'll have to do it yourself AFAIK.
suflaj t1_j3vg5tm wrote
Reply to comment by xenotecc in [D] Have you ever used Knowledge Distillation in practice? by fredlafrite
I think it's a matter of trial and error. The best ratios I've seen were 1:25, but these concerned transformer networks, which are much more sparse than resnets.
There are some tricks, but it depends on the model. Ex. for transformers, it's not just enough to imitate the last layer. I suspect that it's the same for resnets, given they're deep residual networks just like transformers.
suflaj t1_j3u4smq wrote
Reply to comment by Think_Olive_1000 in [D] Will NLP Researchers Lose Our Jobs after ChatGPT? by singularpanda
Google's BERT use is not a commercial, consumer product, it is an enterprise one (Google uses it and runs it on their hardware), they presumably use the large version or something even larger than the pretrained weights available on the internet and to achieve latencies they have they are using datacentres and non-trivial distribution schemes for it, not just consumer hardware.
Meanwhile, your average CPU will need anywhere from 1-4 seconds to do one inference pass in onnx runtime, of course much less on a GPU, but to be truly cross platform you're targetting JS in most cases, which means CPU and not a stack as mature as what Python/C++/CUDA have.
What I'm saying is:
- people have said no to paid services, they want free products
- consumer hardware has not scaled nearly as fast as DL
- even ancient models are still too slow to run on consumer hardware after years of improvement
- distilling, quantizing and optimizing them seems to get them to run just fast enough to not be a nuisance, but is often too tedious to work out for a free product
suflaj t1_j3twskh wrote
Reply to comment by Think_Olive_1000 in [D] Will NLP Researchers Lose Our Jobs after ChatGPT? by singularpanda
Half is not enough. We're thinking in the order of 100x or even more. Do not forget that even ordinary BERT is not really commercially viable as-is.
I mean sure you can use them to get a nicer distribution for your dataset. But at the end of the day the API is too slow to train any "real" model, and you can already probably collect and generate data for smaller models yourself. So as a replacement for lazy people - sure, I think ChatGPT by itself probably has the potential to solve most repetitive questions people have on the internet. But it won't be used like that at scale so ultimately it is not useful.
If it wasn't clear enough by now, I'm not skeptic because of what LLMs are, but how they simply do not scale up to real-world requirements. Ultimately, people do not have datacenters at home, and OpenAI and other vendors do not have the hardware for any actual volume of need other than a niche, hobbyist one. And the investment to develop something like ChatGPT is too big to justify for that use.
All of this was ignoring the obvious legal risks from using ChatGPT generations commercially!
suflaj t1_j3tq0u2 wrote
Reply to comment by Think_Olive_1000 in [D] Will NLP Researchers Lose Our Jobs after ChatGPT? by singularpanda
Sure you could. But the cost is so much it probably outweighs the benefits. And that is even if you made training stable (we already know based on recurrent networks, GANs and even transformers that they're not particularly stable). Hooking it up to the repl would make the task essentially reinforcement learning. And if you know something about reinforcement learning, you know that it generally doesn't work because the environment the agent has to traverse is too difficult to learn anything - what Deepmind managed to achieve with their chess and go engines is truly remarkable, but these are THEIR achievements despite the hardships RL introduces. This is not the achievement of RL. Meanwhile ChatGPT is mostly an achievement of a nice dataset, a clever task and deep learning. It is not that impressive from an engineering standpoint (other than syncing up all the hardware to preprocess the data and train it)
Unless LLMs are extremely optimized in regards to latency and cost, or unless compute becomes even more cheaper (not likely), they have no practical future for the consumer.
So far, it's still a dick measuring contest, as if a larger model and dataset will make much of a difference. I do not see much interest in making them more usable or accessible, I see only effort in beating last year's paper and getting investors to dump more money into a bigger model for next year. I also see ChatGPT as being a cheap marketing scheme all the while it's being used for some pretty nefarious things, some of them being botted Russian or Ukrainian war propaganda.
So you can forget the repl idea. Who would it serve? Programmers have shown they are not willing to pay for something like GitHub Copilot. Large companies can always find people to hire and do programming for them. Unless these are strides in something very expensive, like formal verification, it's not something a large company, the one that has the resources to research LLMs, would go into.
Maybe the next step is training it on WolframAlpha. But at that point you're just catching up to almost 15 year old software. Maybe that "almost 15 year old" shows you how overhyped ChatGPT really is for commercial use.
suflaj t1_j3oepus wrote
Reply to comment by jacobgorm in [D] Why is Vulkan as a backend not used in ML over some offshoot GPU specification? by I_will_delete_myself
You understimate how hard cross-platform is to achieve. Especially with GPUs. There is no GPGPU API standard, first and foremost, so ensuring cross-platform is a tedious task which essentially either means creating an API that has to accomodate every GPU, or writing "drivers" for every different GPU. GPUs can be vastly different between generations and models, unlike x86 and x86-64 CPU architectures which have mostly stayed the same for several decades now.
Vulkan Compute is nowhere near reaching feature parity with CUDA and cuDNN. ROCm's CUDA is way better and still too much of a pain to install and keep.
Furthermore, open standards mean nothing when a graphics vendor can just gimp the API, like, ironically, nvidia already does with Vulkan.
There is an open variant called OpenCL. But it will probably never be as mature as CUDA, even though 3.0 is apparently making great strides. There is absolutely no reason to push for Vulkan due to how cancerous developing anything in it is.
suflaj t1_j3od3r6 wrote
Reply to comment by smileawe3211 in [D] Do cloud gpu's run while my laptop is switched off? by smileawe3211
What instance
suflaj t1_j3ocp85 wrote
Depends how you access it.
Generally yes but if you access it via SSH you'll need to screen or tmux it since disconnecting from the session will also shut down your script.
suflaj t1_j3ld7gy wrote
Reply to comment by blose1 in [D] Will NLP Researchers Lose Our Jobs after ChatGPT? by singularpanda
Exactly. It's not exactly meant to be a zero shot model, but rather a conversational agent. If it were a zero shot model, here is what it could say:
- gebutzeripanim is a new phrase for me
- based on its start it sounds german
- based on its middle it sounds vaguely korean
- based on its end it sounds vaguely turkish
- it might be made up slang
But in conversation you wouldn't just start decomposing stuff you never heard about. Here are what it says if you try to fish out these answers:
- I'm sorry, but I'm not familiar with the term "gebutzeripanim." Could you provide more context or clarify what you are referring to?
- Without more context or information about the word, it is difficult for me to determine whether it has a connection to the German language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me.
- Without more context or information about the word, it is difficult for me to determine whether it has a connection to the Korean language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me. It is also possible that the word includes elements from multiple languages or that it is a hybrid of several languages. Without more information, it is difficult to say for sure.
- Without more context or information about the word, it is difficult for me to determine whether it has a connection to the Turkish language. "Gebutzeripanim" is not a word that I am familiar with, and a search of online resources did not yield any results. It is possible that this is a made-up or invented word, or that it is used in a specific context that is not familiar to me. It is also possible that the word includes elements from multiple languages or that it is a hybrid of several languages. Without more information, it is difficult to say for sure.
- It is not possible for me to determine whether "gebutzeripanim" is a made-up or invented word, as I do not have sufficient information about the word. It is possible that "gebutzeripanim" is a slang term that is not widely known or used, or that it is a word that has been created recently and has not yet gained widespread recognition. Without more context or information, it is difficult to say for sure what the word means or how it is used.
It mostly bases it answers on the fact that it doesn't know. And given that it uses past answers as part of the prompts you can see that the second time I ask it for a language it assumes it can be made out of more languages, but not because it understands it, only because I mentioned it.
If you ask it in a new session whether it is made out of words or phrases from several languages, it answers with
> I'm sorry, but I am unable to find any information about a word spelled "gebutzeripanim." It is possible that this is a made-up word or a word from a language that I am not familiar with. Can you provide any context or additional information about the word that might help me to better understand it?
Since it basically needs to explicitly see things in training, it's not really a zero-shot, but rather a few-shot model. There are instances where it seems like it can connect the dots but you can't really say it happens in the general case...
suflaj t1_j3igfzr wrote
Yes, it's the only way to get high throughput high performance models ATM.
With KD and TensorRT you can get close to 100x throughput (compared to eager TF/PyTorch on full model) with 1% performance hit on some models and tasks.
suflaj t1_j57ce83 wrote
Reply to [D] Not sure if time series or multiple classifications? by spiritualquestions
This looks like something for XGBoost. In that case you're looking at the
XGBRegressor
class.Your X are the first 4 features, your Y are the 3 outputs. You will need to convert the medication to a one-hot vector representation, and the diet will presumably be enumerated into whole numbers sorted by healthiness.