IntelArtiGen t1_iw5gjpi wrote on November 13, 2022 at 2:38 AM

When needed, I usually take an existing architecture and only adapt small parts of it to solve my task. I also wrote a custom autoencoder layer by layer for audio spectrograms (I didn't find an existing model which could do it with my constraints), and a model to convert embeddings from one self-supervised model to another self-supervised model (it's not a complex architecture) while the three models train simultaneously.

Tbh I would prefer to use existing architectures because re-designing an architecture is long to do, to optimize, and to train, but existing models are often very adapted to one task and perform bad on unexpected new tasks. Also you may have constraints (like real-time, memory efficiency etc.) which are not taken into account in easy-to-reuse published models.

Images have pretrained CNNs, but if you want a model to perform self-supervised continual learning and real-time inference on images with just one RTX, it can be harder to find an existing optimized solution for this task.

iamnotlefthanded666 t1_iw7vatj wrote on November 13, 2022 at 5:14 PM

Can you elaborate (task, input, output, architecture) on the audio spectrogram auto encoder thing if you don't mind?

IntelArtiGen t1_iw860x6 wrote on November 13, 2022 at 6:25 PM

Task: reproduce how humans learn new words from image and sounds. I used 3 models. For the autoencoder the task was just to rebuild the input (loss is a distance between original spectrogram and rebuilt spectrogram)
Input: video (multiple images and sounds in a continuous stream + real-time constraint)
Input of the audio autoencoder is the sound from the mic (the mel spectrogram of that sound), output is the mel-spectrogram (autoencoding task).
Architecture: for audio I just used convolutions to compress the spectrogram and transposed convolutions to re-build it

So I just stacked multiple convolutions and "deconvolution", I ran some hyperparameter optimization but the architecture is not SOTA (it wasn't the goal), I just needed a model which could autoencode mel-spectrogram of human voices in realtime. I wanted to use a vocal synthetizer but they didn't fit my constraints.

iamnotlefthanded666 t1_iwdsa30 wrote on November 14, 2022 at 10:05 PM

Thanks on elaborative answer.