Submitted by cautioushedonist t3_yto34q in MachineLearning
iamnotlefthanded666 t1_iw7vatj wrote
Reply to comment by IntelArtiGen in [D] When was the last time you wrote a custom neural net? by cautioushedonist
Can you elaborate (task, input, output, architecture) on the audio spectrogram auto encoder thing if you don't mind?
IntelArtiGen t1_iw860x6 wrote
- Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
- Input: video (multiple images and sounds in a continuous stream + real-time constraint)
- Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
- Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it
So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.
iamnotlefthanded666 t1_iwdsa30 wrote
Thanks on elaborative answer.
Viewing a single comment thread. View all comments