Viewing a single comment thread. View all comments

IntelArtiGen t1_iw860x6 wrote

  • Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
  • Input: video (multiple images and sounds in a continuous stream + real-time constraint)
  • Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
  • Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it

So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.

3