I am trying to create speaker-aware transcripts from (multiple) audio files of a podcast. Right now I'm using OpenAI Whisper for the transcripts and pyannote.audio for speaker diarization (speaker segmentation + centroid clustering)

In order to speed up the process (diarization time doesn't seem to scale linearly), I'd like to fit the centroids with the first audio file, and use those to predict the speakers (clusters of the speaker embeddings) of the other audio files, as the speakers don't change across episodes.

However, the default pyannote.audio diarization pipeline refits the clusters for each audio file. Do you know of any other Python framework that allows reusing the fitted clusters, or any way pyannote.audio allows this? Is this even possible? Any other way to achieve the desired results?

hayder978 t1_j4soixh wrote on January 18, 2023 at 12:00 AM

How much time takes to carry out speaker diarization per 1hr audio?

2blazen OP t1_j4u5jf7 wrote on January 18, 2023 at 7:16 AM

With my RTX 3060 it takes 3m50s to diarize 1 hour, 20m to do 3 hours (although can be reduced to 16m by presetting the number of speakers - I didn't check 1h segment like this, also keep in mind it takes time to load the models into vram), however 5 hour episodes keep getting my process killed after around 40m. It's probably a memory issue, and could even happen during the segmentation, but reusing clusters is a common issue on Github, it wouldn't just be for my usecase

[D] Speaker diarization: reusing fitted speaker embedding clusters?

Comments

hayder978 t1_j4soixh wrote on January 18, 2023 at 12:00 AM

2blazen OP t1_j4u5jf7 wrote on January 18, 2023 at 7:16 AM