IntelArtiGen t1_iw860x6 wrote on November 13, 2022 at 6:25 PM

Reply to comment by iamnotlefthanded666 in [D] When was the last time you wrote a custom neural net? by cautioushedonist

Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
Input: video (multiple images and sounds in a continuous stream + real-time constraint)
Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it

So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.

iamnotlefthanded666 t1_iwdsa30 wrote on November 14, 2022 at 10:05 PM

Thanks on elaborative answer.