suflaj

suflaj t1_j7u2qyt wrote

You want eco mode to run cooler and more efficient. As I said, the bottleneck is in the GPU, specifically its memory bandwidth, not in whatever the CPU can transfer. Modern CPUs can easily handle 3 high end GPUs at the same time, not just 2.

PCI speed has not been a bottleneck for several years, and will probably never be a bottleneck again with this form factor of GPUs. The GPU MEMORY is the bottleneck nowadays.

EDIT: And as someone else has said, yeah, you can use fast NVMEs as swap to avoid loading from disk. There used to be Optane for this kind of stuff, but well, that's dead.

2

suflaj t1_j7s57dy wrote

At the moment a 7950X in eco mode combined with a ROG Strix X670E seems to be the best combo.

It running in 8x mode on PCI-E gen 4 doesn't really matter, according to benchmarks the performance difference is a few %. It will take a lot of time in 16x because it's pretty much the same speed. It will not get significantly faster with a different mobo, you're limited by the GPU itself, not the interface.

3

suflaj t1_j7cpf1d wrote

Azure

This is due to 2 issues both of these have and Azure mitigates to an extent:

  • they both lack humanity, i.e. they can at most be convincing as human prompt readers, but not anything else
  • those without a better ear and headphones probably do not notice a certain ring those two have, which a human voice cannot replicate - it might be that this effect is added to make the voices sharper, but ultimately it will make people like me, as well as robovoice detectors be able to more easily distinguish them as TTS
1

suflaj t1_j7ckp0d wrote

Yes. Although impressive in the number of languages and voices, it does not match Azure's more expressive prosody. I have listened to far too many robocalls, so that kind of magic is gone for me.

Someone else might consider it more humanlike, as it's all subjective. Have they published benchmark scores yet?

1

suflaj t1_j7c73af wrote

Make no mistake - there is no TTS more humanlike than Azure ATM, but the exact voice was likely fiddled around with a bit to get the exact pronunciation, or ran through a filter.

2 days ago I was comparing all the state-of-the-art TTS', and while Google's Neural2 came close to the video, it does not feature similar voices to the one in the video.

1

suflaj t1_j731s6u wrote

I mean kernels in the sense of functions.

> Why wouldn't GPU parallelization make inference faster?

Because most DL models are deep, and not exactly wide. I've explained already, deep means a long serial chain. Not parallelizable outside of data parallelism, which doesn't speed up inference, and model parallelism (generally not implemented, and has heavy IO costs).

Wide models and how they become equivalent to deep ones are unexplored, although they are theoretically just as expressive.

1

suflaj t1_j71c8j2 wrote

Generally, no. It would be better to just use all the classes you need now, and then use masks to regulate which classes are being tested at a given moment. The thing you are suggesting, even when done correctly, would not let the model learn about the relationships between different classes.

With neural network surgery, it's trivial to downscale, but fairly hard to upscale.

One thing you could test, ex. is try to cluster your images with vanilla pretrained resnet features. Then, once you need to add new classes, you can look at which images from the new class are the most similar to the ones from existing classes, and you can maybe get away with only finetuning it on that subset, instead of the whole dataset.

Obviously, finalization will include doing at least one epoch on the whole dataset, but that might not be viable to do n times, while the similarity method will be, you can just adjust the similarity threshold.

3

suflaj t1_j6zq1k9 wrote

Well one reason I could think of why is custom kernels. To really get the most out of your model performance, you will likely be optimizing the kernels you use for your layers, sometimes fusing them. A GPU can't adapt to that as well. The best you can do is use TensorRT to optimize for a speficic model of GPU, but why do that when you can create ex. the optimal CNN kernel in hardware on an FPGA? On a GPU you can only work with the hardware that came with the GPU.

That being said, this is in regard to processing, not necessarily scaling it up. And maybe it makes sense for inference, where it would be nice making a processor that is made specifically to run some architecture and which doesn't necessarily process things in large batches.

But for training, obviously nothing is going to beat a GPU/TPU cluster because of pricing and seemingly infinite scaling of GPUs. If money is not a problem you can always just buy more GPUs and your training will be faster. But parallelization will probably not make your inference faster, since the "deep" in DL refers to the long serial chain of processing, and that's where a hardware implementation of the optimized model makes sense.

Ideally, though, you'd want a TPU, not FPGA processors. TPUs are cheaper and you can use them for research as well.

5

suflaj t1_j6hfkdj wrote

> BN is used to reduce covariate shift, it just happened to regularize.

The first part was hypothesized, but not proven. It is a popular belief, like all other hypotheses why BN works so well.

> Dropout as a regularizing technique didn't become big before ResNet (2014 vs. 2015).

What does becoming big mean? Dropout was introduced in 2012 and used ever since. It was never big in the sense that you would always use it.

It is certainly false that Dropout was used because of ResNets or immediately after them for CNNs, as the first paper proving that there is benefit in using Dropout for convolutional layers was in 2017: https://link.springer.com/chapter/10.1007/978-3-319-54184-6_12

> I doubt what you're saying is true, that they're effectively the same.

I never said that.

0

suflaj t1_j6eqh0b wrote

It depends. If it only learned A to B we say it is overfit. If you give it enough different A to Bs, it might learn to generalize, and then for any A to B pair it will be able to find the path.

If it learned on paths without obstacles, it will not be able to deal with obstacles. Which means that it will go right through them, or run into them, if your environment does not alloe an agent to go through them.

2

suflaj t1_j63bf1q wrote

Well for starters, it would probably have worse performance due to so many redundant features, and it would be much slower.

Remember that the embedding layer carries loads of overhead, as we're talking V * d matrices. So for a corpus of 250k and embedding vector of 768, ex., we're talking about 192M parameters just for the embedding layer. Maybe you can save some space by having a sparse embedder, but find me a free implementation of sparse layers that work as well as dense ones. Other than that, the 192M parameters are, before compression techniques, equivalent to 768M. And that's just in memory, and the gradient, unless sparsified, will be 768M PER BATCH.

This is without mentioning that you would likely need to increase the embedding dim to account for the 8 times times bigger vocabulary.

2

suflaj t1_j5zlq6k wrote

Aside from what others have mentioned, let's assume that we don't have a symmetrical situation, i.e. that the range of the function we're learning, as well as the domain of weights and biases, is [0, inf>. Then it makes more sense to add bias than to subtract it, as it will lead to smaller weights and less chance to overflow or for the gradients to explode.

It makes more sense to subtract the biases if in the scenario described above, you want a more expressive layer, but with less numerical stability. This is because a subtractive bias allows the weights to be of greater magnitude, which in terms gives you more effective range for the weights.

But note that neural networks are not done with integer weights, and in some libraries there is no autograd for integers even.

3

suflaj t1_j5wgdsj wrote

One thing people haven't mentioned is you could create synthetic images via 3D modelling. If you can get someone to set up realistic 3D models of those microchips, and then randomly generate cracks, you can get a pretty good baseline model you can then finetune on real data.

There are companies which could do that too but I'm not that sure that the price would be approachable, or if outsourcing it is a viable solution given trade secrets. But ex. Datagen is a company that can do it.

1

suflaj t1_j5r5bfw wrote

For learning rate you should just use a good starting point based on the batch size and architecture and relegate everything else to the scheduler and optimizer. I don't think there's any point messing with the learning rate once you find one that doesn't blow up your model, just use warmup or plateau schedulers to manage it for you after that.

Since you mentioned Inception I believe that unless you are using quite big batch sizes, your starting LR should be the magical 3e-4 for Adam or 1e-2 for SGD, and you would just use a ReduceOnPlateau scheduler with ex. patience of 3 epochs, cooldown of 2, factor of 0.1 and probably employ EarlyStopping if metric doesn't improve after 6 epochs.

2

suflaj t1_j5qb32y wrote

There is this: https://www.microsoft.com/en-us/research/blog/%C2%B5transfer-a-technique-for-hyperparameter-tuning-of-enormous-neural-networks/

However, it's unlikely to help in your case. The best thing you can do is grid search if you know something about the problem, or just random search. I prefer random search even if I'm am expert for the problem, ESPECIALLY with ML models.

But I'm curious how it takes a long time. You don't have to train the whole dataset. Take 10% for training and 10% for validation, or less if that dataset is huge. You just need enough data to learn something. Then your optimal hyperparameters are a good enough approximation.

Also, it might help to just not tune redundant hyperparameters. Layer sizes are usually such, as is almost any hyperparameter in the Adam family of optimizers besides learning rate and to a lesser extent first momentum. Which ones are you optimizing?

4

suflaj t1_j5mnzq3 wrote

Why would this matter?

If such examples are present in the training set and adequately expressed, then the model will learn whatever it needs to learn from those words.

If they are not in the training set, you should not expect the model to understand them the same way you do.

I realize this defeats the point of generalization, but LLMs learn to mimic generalization through exposure, not by actually learning to understand the underlying principles. These models do not analyze text like we humans do, but they have been shown to outperform the average human despite that.

Ultimately to do what you are doing you would need to have a tokenizer that has all the syntactical knowledge embedded within itself for a given subset of the language that will be the input. Wasn't AlexNet, a decade ago, enough to convince you to always relegate these kinds of tasks to the DL model, which will always beat a human provided it has the capacity and the data?

0

suflaj t1_j57te64 wrote

It's not necessarily better, but it will help you if your data is not really abundant...

For an example, if you look at it as regression, then the model uses your features and tries to figure out how correlated they are with the grade. Your grade is continuous and monotonous, meaning that if the features contribute in "sane" ways to the grade, it will map easily.

If you consider it a classification problem, then each class has basically its own degree of freedom. This could cause your model to be overconfident, whereas with the regression solution at the very least your model is going to try and fit it to a continuous monotonous function.

With the regression task, you are implicitly telling your model that grade 2 is better than 1 and worse than 3. But with a classification model, because each class can be independent, your model can only learn this implicitly through data. Which means that if your data is insufficient for the model to learn it, it won't work, whereas with a regression task, if your data is insufficient, it might still interpolate correctly.

1

suflaj t1_j57gnky wrote

Well this is a regression task, not classification. You could classify 1, 2, 3 and 4 for each output, but it seems like they are continuous. You can always just truncate your result, ex. with y = max(1, min(4, ceil(x + 0.5))). With classification you could argmax a class, but then you'll overfit more easily. You would probably benefit from the bias coming from the regression task itself telling the algorithm that 2 is close to 3 and 1, but far away from 4.

1