gantork t1_j8t7gq8 wrote on February 16, 2023 at 7:56 PM

I expect this by the end of the year at the latest. Was just reading about a Whisper implementation that works in real time with no delay (it can do 1hr of audio in 10 seconds), could be really useful for something like this.

TwitchTvOmo1 OP t1_j8t7w4n wrote on February 16, 2023 at 7:59 PM

The only limitation I see currently isn't how long it takes to generate audio. I'm sure that will be taken care of. It's how long it takes by a LLM to generate a response. I haven't tried Bing yet but with ChatGPT it's always 5+ seconds.

For a "realistic" conversation with an AI to be immersive, you need realistic response time. Which would be under 0.5 seconds. Not sure if any LLM can handle that by the end of the year.

ShowerGrapes t1_j8tdfe8 wrote on February 16, 2023 at 8:33 PM

i'm not sure if has to be in real time. if you think about it, people use all different ways to fill up some time before they finally, after innumerable little pauses, sidebars and parentheticals (like this) they get to the point. i'm guessing it will have to be some complex "manager" neural network that interacts in real-time "small talk" while it translates, parses and discretely separates data in order to facilitate responses. a sufficiently complex one that is able to adjust its simpler UI neural net, one that can "learn" and remember who it was talking to, an imperfect state that occasionally will make mistakes, would be functionally no different from a human being in whatever medium of interaction other than reality alpha. a vr avatar of its iwn design would be icing on the cake.

it will also be functionally a higher being at that point. we're organizing a religion to get the jump on it over in the /r/CircuitKeepers sub.

TwitchTvOmo1 OP t1_j8tdung wrote on February 16, 2023 at 8:36 PM

>i'm not sure if has to be in real time. if you think about it, people use all different ways to fill up some time before they finally, after innumerable little pauses, sidebars and parentheticals (like this) they get to the point

Definitely. What I'm saying is, if we want full immersion, that at the very least it will need to be able to respond as fast as a human. And that is often nearly instant in natural conversations.

And of course even when it gets to the point where it can have instant responses, to keep the realism it will have a logic system where it decides how long it should pretend that it's "thinking" before it starts voicing out its response, according to the nature of the conversation and how long a regular human would need to think to respond to a particular statement.

ShowerGrapes t1_j8tfcvz wrote on February 16, 2023 at 8:45 PM

the easiest woudl go R2 instead of c3p0, give him some cute "animations" while it's waiting for a response or maybe just a hoarse "working...." like in the original star trek.

htaming t1_j8thgic wrote on February 16, 2023 at 8:58 PM

Replika has a great voice and AR talking interface.

blueSGL t1_j8urq1q wrote on February 17, 2023 at 2:16 AM

> I haven't tried Bing yet but with ChatGPT it's always 5+ seconds. > > > > For a "realistic" conversation with an AI to be immersive, you need realistic response time.

"just a second..."

"keyboard clacking.... mouse clicks.... another mouse click.... more keyboard noises"

"Sorry about all this the system is being slow today, can I put you on hold"

5 seconds is faster than some agents I've dealt with (not their fault, computer systems can be absolute shit at times)

FpRhGf t1_j8v9gfh wrote on February 17, 2023 at 4:45 AM

Do you mean 5+ seconds to finish the entire text? Because ChatGPT's generation was always instant and fast for me until they had constant server overload from the traffic. The time it took to generate entire paragraphs was faster than any TTS reading it in 2x speed.

The slow response nowadays is just an issue stemming from too many people using it at the same time and prioritising the paid version over the free one. ChatGPT was already good in its response time during the first few weeks. But I've yet to hear a TTS that can generate audio right off the bat without waiting for a few seconds.