sigmoid_amidst_relus

sigmoid_amidst_relus t1_j1qbo3x wrote

MFCCs are more prone to noise than melspectrograms.

They're better for classical methods (primarily due to their dimensionality in my opinion) but most recent papers don't use MFCCs: they either go raw waveform or melspectrograms.

You have a 400-600 hour dataset, and utterance level labels. For the task at hand option 2 is the best, and there's a lot of variations of it you can try.

You can experiment with a frame-wise setup: repeat the utterance level label for every frame computed from the utterance, train on a frame and it's corresponding label at a time. Or take a sequence of frames and their label. Or take crops of different sizes from each utterance. There's a lot of options. At test time, just aggregate labels for a single utterance.

Use whatever DNN you want to use, I suggest starting simple with CNNs.

Either way, you'll have to experiment and see how this affects performance.

It's possible that an utterance level "Mental state" classifier might end up leveraging semantic information because your dataset doesn't have enough speakers. It's even easier to do that if your dataset is an "acted" dataset. You'll end up doing well on the benchmark but your model is not learning jackshit. Therein lies a big problem with the speech emotion classification domain. If the model doesn't do this, it will be aggregating actual emotional state information over time anyway, so why not be more efficient.

Mental state is not reflected just in the entire utterance, if a person is say, sad, it's reflected throughout their voice on a sub-speech segment level.

Edit

Also, apart from all this jazz, there's also the option of using features from large pretrained models and training downstream classifiers on them.

Also, finally, read papers (esp if you're in the industry). I've basically regurgitated current state of the field. All this information is there. I'm not being passive aggressive or cheeky here; if you're in the industry, I know where you're coming from, but you'll have to roll your sleeves up and read papers. If you're in research, again, I know where you're coming from, but it's the way you gotta do

2

sigmoid_amidst_relus t1_j0wsqyz wrote

3090 is not as good as an A100 in terms of pure performance.

It's much better than an A100 in perf/$

A single consumer-grade deep learning node won't scale past 3x 3090s without diminishing returns until and unless all you work with are datasets that fit in your memory or have a great storage solution. Top end prosumer and server grade platforms will do fine with up to 4-6x in a non-rack mounted setting, but not without custom cooling. The problem isn't just how well you can feed the gpus; 3090s are simply not designed to work at such high node densities like server end cards are. That's why companies are happy to pay pretty penny for A100s and other server grade cards (even if we ignore the need for certifications and Nvidia mandates): infrastructure and running costs of a good quality server facility far outweigh GPU costs and money lost to potential downtime.

Connecting multi-node setups is done through high bandwidth interconnects, like mellanox infiniband stuff.

Most mining farms don't run GPUs on full pcie x16 as mining isn't memory intensive, so you're not going to scale as well as that.

You can very well scale to 64x GPU "farm" easily, but it's going to be a pain in a consumer-grade only setup, esp in terms of interconnects and stuff, not to mention terribly space and cooling inefficient.

3

sigmoid_amidst_relus t1_ix8gx9z wrote

Although you've gotten some good answers, here are some things I've learned in the past 1.5 years working with transformers on audio and speech data.

  1. Learning rate schedule is more important with audio data that is more "in the wild", i.e. large variations in SNR.
  2. Is your music data loudness normalized? Might help. Although following step 3 should take care of it.
  3. While not centring data to zero mean and std works, standardizing has proven critical for consistent training runs for spectral data for my setup. Without it, while there was not much difference in best runs, my model would give very different results for different seeds. I'd recheck that your data is mean/std normed correctly, and if you aren't doing it, you should. You can do it on either per-instance or dataset level (computing mean/std statistics over the entire dataset), and standardize every frequency bin independently or not, based on your use case.
  4. Keep an eye on your gradient norms during training to check if your learning rate schedule is appropriate or not.
  5. Use linear warmup. Also, try using Adam or AdamW if you're not. SGD will need significantly more hyperparam tuning for transformers.
  6. Just in case you're doing this, do not use different depthwise learning rates if training from scratch.
3