Add language output to each segment #1274

erichaydel · 2025-03-28T01:25:21Z

This adds the already existing tokenizer language to the outputs of each segment.

This allows faster-whisper to be used as a language detect in multilingual mode without any extra processing burden in both regular and batched mode, and uses the primary language if multilingual mode is off.

segments, _ = model.transcribe("audio.mp3", multilingual=True)

for segment in segments:
    print("[%.2fs -> %.2fs] (%s) %s" % (segment.start, segment.end, segment.language, segment.text))

[0.02s -> 21.42s] (es) Esta mañana, la primera comisión continuará adoptando medidas sobre los proyectos de resolución y de decisión restantes.

erichaydel force-pushed the add-language-to-segments branch from 0156140 to 0fcdfa4 Compare March 28, 2025 01:27

Add language output to each segment

ba620ba

erichaydel force-pushed the add-language-to-segments branch from 0fcdfa4 to ba620ba Compare March 28, 2025 01:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add language output to each segment #1274

Add language output to each segment #1274

Uh oh!

erichaydel commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add language output to each segment #1274

Are you sure you want to change the base?

Add language output to each segment #1274

Uh oh!

Conversation

erichaydel commented Mar 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant