iamnotlefthanded666 t1_iw7vatj wrote on November 13, 2022 at 5:14 PM

Reply to comment by IntelArtiGen in [D] When was the last time you wrote a custom neural net? by cautioushedonist

Can you elaborate (task, input, output, architecture) on the audio spectrogram auto encoder thing if you don't mind?

IntelArtiGen t1_iw860x6 wrote on November 13, 2022 at 6:25 PM

Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
Input: video (multiple images and sounds in a continuous stream + real-time constraint)
Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it

So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.

iamnotlefthanded666 t1_iwdsa30 wrote on November 14, 2022 at 10:05 PM

Thanks on elaborative answer.