Viewing a single comment thread. View all comments

TwitchTvOmo1 OP t1_j8vrlmj wrote

I agree. Checked almost every sample from VALL-E and Eleven Labs is simply more realistic. More varied and natural inflections in the tone of voice.

The 1 thing that VALL-E seems to do better is the voice cloning. It also keeps the original sound noisescape in the cloned result (noise profile, EQ profile, etc). But it's debatable whether that should be called a feature or a bug. One could argue that getting a crystal clear pro-level recording quality on the cloned voice is the desired outcome.

Of course if your scope of application is fooling people with the cloned voice, then yeah you care about preserving the noise/EQ profile of the original sample too.

I also didn't like the "emotions" settings much as the outputs weren't very natural.

3