Mysterious_Ad_8286 t1_j8t15jq wrote on February 16, 2023 at 7:18 PM

Microsoft has their own text to speech(Wall-E) which is significantly better than even elevenai models, so they would probably use that. But they probably are already testing out as many possibilites as they can dream up internally

PM_ME_A_STEAM_GIFT t1_j8teq6t wrote on February 16, 2023 at 8:41 PM

FYI it's VALL-E. The other one is the movie.

Stijn t1_j8typvu wrote on February 16, 2023 at 10:47 PM

That VALL-E is uncanny.

flyinSpaghetiMonstr t1_j8vbdph wrote on February 17, 2023 at 5:03 AM

Thanks for the link but I honestly think that Elevenlabs sounds better. You can still hear the roboticy sounding voice to it. What is good about it is trying to add emotion to it but some of them like amused sounded pretty rough.

TwitchTvOmo1 OP t1_j8vrlmj wrote on February 17, 2023 at 8:12 AM

I agree. Checked almost every sample from VALL-E and Eleven Labs is simply more realistic. More varied and natural inflections in the tone of voice.

The 1 thing that VALL-E seems to do better is the voice cloning. It also keeps the original sound noisescape in the cloned result (noise profile, EQ profile, etc). But it's debatable whether that should be called a feature or a bug. One could argue that getting a crystal clear pro-level recording quality on the cloned voice is the desired outcome.

Of course if your scope of application is fooling people with the cloned voice, then yeah you care about preserving the noise/EQ profile of the original sample too.

I also didn't like the "emotions" settings much as the outputs weren't very natural.