2023-11-13: Transcribing Audio using SeamlessM4T
Introduction
There are so many applications for speech-to-text capabilities. Online meeting tools like Microsoft Teams use speech-to-text capabilities to transcribe meetings. Transcriptions of even live meetings may be performed to automate note taking. Video streaming websites and applications transcribe audio to support closed captioning (CC). Music files are transcribed to provide lyrics to support your favorite karaoke night. Transcribing podcasts, audiograms, and other video and audio files posted to social media may also be performed in the process of web scraping.
There are a lot of Python libraries available for individuals wishing to incorporate speech-to-text capabilities in their own applications and research. Whisper, developed by OpenAI, is one such library that can not only perform transcriptions, but translations into multiple languages as well. Other major companies like Google and IBM have released their own libraries that also provide these capabilities. One advantage these models developed by large companies often have over others is the amount of data that they have been trained with.
A recently published paper described a new model developed by Meta named SeamlessM4T where M4T stands for Massively Multilingual & Multimodal Machine Translation. SeamlessM4T has promoted transcription and translation capabilities for up to 100 different languages. This model, trained with 1 million hours of speech audio, was designed to improve speech-to-text transcription and translation quality. With that said, in this blog I will take a closer look at the SeamlessM4T model's performance by comparing its transcription of a YouTube video to that generated by the OpenAI Whisper model.
Implementation
Video to Audio Conversion
Whisper Transcription Implementation
SeamlessM4T Transcription Implementation
The SeamlessM4T module was downloaded the SeamlessM4T code from Github in a zip file, extracted it, then from the root directory I used pip to install it as the authors describe in their readme file:
pip install .
Results
Here at Old Dominion University, we are learning how to be the protectors of the digital world. The tools and skills that I learned here at ODU is what ultimately landed me my full-time job here at Centaur HealthCare. To be able to protect people in cybersecurity is my main passion. The professors that I've interacted with, they're all very helpful. They send emails out about opportunities that you can go to like internships. One of my favorite professors brings a personal touch and an energy that really makes you as a student want to engage and be a part of the classroom. |
- Word Error Rate (WER) - the percentage of words that were incorrectly predicted.
- Match Error Rate (MER) - the percentage of words that were incorrectly predicted and inserted.
- Word Information Lost (WIL) - the percentage of incorrectly predicted words between a set of ground-truth sentences and a set of hypothesis sentences.
- Word Information Preserved (WIP) - the percentage of correctly predicted words between a set of ground-truth sentences and a set of hypothesis sentences.
- Character Error Rate (CER) - the percentage of characters that were incorrectly predicted.
Metric | Whisper | SeamlessM4t | |
---|---|---|---|
|
My Implementation | HuggingFace | |
Time (sec) | 189.0 | 50.19 | NA |
WER | 3.84% | 32.69% | 56.73% |
MER | 3.80% | 32.69% | 54.62% |
WIL | 5.68% | 48.22% | 67.93% |
WIP | 94.31% | 51.77% | 32.06% |
CER | 1.39% | 23.25% | 38.98% |
None of the models were able to properly spell Sentara in the audio recording, which contributed to the error rates. Although I did used the medium and not the large version of the models, I was a little surprised that the SeamlessM4T results were not better. As a sanity check, I uploaded my test audio file to the HuggingFace SeamlessM4T model test site. Using the transcription it produced, I generated a new set of metrics. As shown in the last column of Table 1, there were significant errors even in the transcription produced by that site's implementation.
The tools and skills that I've learned to be the protector of the digital world will eventually land me in my full-time job here at the Center for Health Care to be able to protect people in cybersecurity, the professors that I've interacted with, they're all very helpful, they're all e-mails about opportunities that you can go to, like internships, and one of my favorite professors brings a personal touch and energy. |
Although the main focus was not execution times, I did provide them in Table 1. These times did not include the times required to load the models into memory, which is especially long the first time it is performed. However, I used the CPU (instead of a GPU) when generating running these models and the execution times are largely hardware dependent. Changing the device type to cuda to take advantage of a GPU would be expected to dramatically improve those times.
Comments
Post a Comment