Submitted by 2blazen t3_10bsef1 in MachineLearning
I am trying to create speaker-aware transcripts from (multiple) audio files of a podcast. Right now I'm using OpenAI Whisper for the transcripts and pyannote.audio for speaker diarization (speaker segmentation + centroid clustering)
In order to speed up the process (diarization time doesn't seem to scale linearly), I'd like to fit the centroids with the first audio file, and use those to predict the speakers (clusters of the speaker embeddings) of the other audio files, as the speakers don't change across episodes.
However, the default pyannote.audio diarization pipeline refits the clusters for each audio file. Do you know of any other Python framework that allows reusing the fitted clusters, or any way pyannote.audio allows this? Is this even possible? Any other way to achieve the desired results?
hayder978 t1_j4soixh wrote
How much time takes to carry out speaker diarization per 1hr audio?