Skip to content

Conversation

@erichaydel
Copy link

This adds the already existing tokenizer language to the outputs of each segment.

This allows faster-whisper to be used as a language detect in multilingual mode without any extra processing burden in both regular and batched mode, and uses the primary language if multilingual mode is off.

segments, _ = model.transcribe("audio.mp3", multilingual=True)

for segment in segments:
    print("[%.2fs -> %.2fs] (%s) %s" % (segment.start, segment.end, segment.language, segment.text))

[0.02s -> 21.42s] (es) Esta mañana, la primera comisión continuará adoptando medidas sobre los proyectos de resolución y de decisión restantes.

@erichaydel erichaydel force-pushed the add-language-to-segments branch from 0156140 to 0fcdfa4 Compare March 28, 2025 01:27
@erichaydel erichaydel force-pushed the add-language-to-segments branch from 0fcdfa4 to ba620ba Compare March 28, 2025 01:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant