Submitted by valdanylchuk t3_xy3zfe in MachineLearning

LM-based; in contrast to other recent audio generation experiments which worked from transcribed text or midi notes, AudioLM works directly based on the audio signal, resulting in outstanding consistency and high fidelity sound.

Google blog post from yesterday: https://ai.googleblog.com/2022/10/audiolm-language-modeling-approach-to.html

Demo clip on Youtube: https://www.youtube.com/watch?v=_xkZwJ0H9IU

Paper: https://arxiv.org/abs/2209.03143

Abstract:

>We introduce AudioLM, a framework for high-quality audio generation with long-term consistency. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure, and we propose a hybrid tokenization scheme to achieve both objectives. Namely, we leverage the discretized activations of a masked language model pre-trained on audio to capture long-term structure and the discrete codes produced by a neural audio codec to achieve high-quality synthesis. By training on large corpora of raw audio waveforms, AudioLM learns to generate natural and coherent continuations given short prompts. When trained on speech, and without any transcript or annotation, AudioLM generates syntactically and semantically plausible speech continuations while also maintaining speaker identity and prosody for unseen speakers. Furthermore, we demonstrate how our approach extends beyond speech by generating coherent piano music continuations, despite being trained without any symbolic representation of music.

100

Comments

You must log in or register to comment.

Flag_Red t1_irfrost wrote

This really does pass the audio-continuation-turing test.

20

E_Snap t1_irg5x3z wrote

That is incredible. And to think I was watching musicians joke about graphic artists’ job security just a few days ago.

29

progressgang t1_irggpsb wrote

It’s not the achievements themselves that astound me with ML, it’s the rate at which they happen. The cycle between “prediction by expert in field” and “we created something to fulfil that prediction” gets shorter and shorter at a crazy rate. The craziest part is, the vast majority are entirely unaware of this happening. There is, without doubt, a massive opportunity to capitalise on that information gap.

25

jazmaan t1_irgo3n5 wrote

Funny thing is when I first got into AI Art and ML it was through a question I asked on Reddit almost two years ago. And its still my dream.

"Would it be possible to train an AI on high quality recordings of Jimi Hendrix live in concert, and then have the AI listen to a crappy audience bootleg and make it sound like a high quality recording?"

AI Art was still in its infancy back then but the people who offered their opinions on my question were the same ones on the cutting edge of VqGAN+Clip. It still looks like the answer to my question is "Someday but probably not within the next five years". But hope springs eternal! Someday that crappy recording of Jimi in Phoenix (one of the best sets he ever played) may be transformed into something that sounds as good as Jimi at Woodstock!

13

valdanylchuk OP t1_iri3civ wrote

…and prepare a suitable dataset, and train the model. Those are huge parts of the effort.

With big companies teasing stuff like this (AlphaZero, GPT-3, DALL-E, etc.) all the time, I wonder if it is possible for the open community to come up with some modern day equivalent of GNU/GPL with a non-profit GPU time donation fund to make practical open source replicas of important projects.

3

PC-Bjorn t1_isnicrn wrote

Soon, we might be upscaling beyond higher bitrate, -depth and fidelity and into multi channel reproductions, or maybe even into individual streams for each instrument and actor on stage as well as a volumetric model for the stage layout itself, allowing us to render the experience as how it would be when experienced from any coordinate on - or around - the stage.

Pair that with a realtime, hardware-accelerated reproduction of the visual experience of being there, based on a network trained on photos from the concert and we'll all be able to go to Woodstock in 1969.

2