I am working on a speech to text project and I want to get different voices recognised to know which person said what and note it down as a conversation to text with names of speakers . I did not found any parameter to actually distinguish human voices mathematically . Is there a way to do so . There can be any number of people in conversation .

Comments

You must log in or register to comment.

VectorSpaceModel t1_iqwjnnd wrote on October 3, 2022 at 4:52 PM

#22,407

Google open sourced this

https://arxiv.org/abs/1810.04719

https://ai.googleblog.com/2018/11/accurate-online-speaker-diarization.html?m=1

theLanguageSprite t1_iqxs1u4 wrote on October 3, 2022 at 9:37 PM

#24,123

Replying to VectorSpaceModel (#22,407)

would you have to train this on your data before you ran inference? or could the model be used out of the box?

gulab__jamun t1_iqxugpd wrote on October 3, 2022 at 9:54 PM

#24,230

You can use pyannote python library. It will identify different speakers from audio and will create small audio files with those speakers.

VectorSpaceModel t1_iqy4s4w wrote on October 3, 2022 at 11:10 PM

#24,590

Replying to theLanguageSprite (#24,123)

No clue. Read up!

VectorSpaceModel t1_iqy50kf wrote on October 3, 2022 at 11:12 PM

#24,597

Replying to theLanguageSprite (#24,123)

Looks like you can get an put of the box here: https://github.com/google/uis-rnn

DBCon t1_iqyonbz wrote on October 4, 2022 at 1:43 AM

#25,492

Without knowing much about the subject, my immediate thought goes to spectral analysis.

Start with creating a spectrogram of the waveform. Essentially get the spectral components of the audio over time, much like running an FFT at different time steps. Then, identify the fundamental frequency of speech, which is probably close to the dominant frequency in the signal. A speaker’s fundamental frequency will likely stay within a small bandwidth. Maybe 50 Hz. If you have two similar speakers, you will probably have to look at secondary and tertiary dominant frequencies. There may even be an advantage to breaking the signals down using PCA first. You can additionally make a matched spectral filter that is sensitive to specific speakers.

You will need some logic to tell when speakers are done speaking or if multiple speakers are speaking over each other. An ML model can help with this to reduce processing overhead.

A quick google search shows that the study of unsupervised ML models for speaker detection has been around for a while. While spectral and Fourier analysis has been optimized for decades, emerging ML methods might be more reliable for highly complex auditory environments.