2023-11-13: Transcribing Audio using SeamlessM4T



There are so many applications for speech-to-text capabilities. Online meeting tools like Microsoft Teams use speech-to-text capabilities to transcribe meetings. Transcriptions of even live meetings may be performed to automate note taking. Video streaming websites and applications transcribe audio to support closed captioning (CC). Music files are transcribed to provide lyrics to support your favorite karaoke night. Transcribing podcasts, audiograms, and other video and audio files posted to social media may also be performed in the process of web scraping. 

There are a lot of Python libraries available for individuals wishing to incorporate speech-to-text capabilities in their own applications and research. Whisper, developed by OpenAI, is one such library that can not only perform transcriptions, but translations into multiple languages as well. Other major companies like Google and IBM have released their own libraries that also provide these capabilities. One advantage these models developed by large companies often have over others is the amount of data that they have been trained with. 

A recently published paper described a new model developed by Meta named SeamlessM4T where M4T stands for Massively Multilingual & Multimodal Machine Translation. SeamlessM4T has promoted transcription and translation capabilities for up to 100 different languages.  This model, trained with 1 million hours of speech audio, was designed to improve speech-to-text transcription and translation quality. With that said, in this blog I will take a closer look at the SeamlessM4T model's performance by comparing its transcription of a YouTube video to that generated by the OpenAI Whisper model.


Code was developed using Python in a Linux environment (Ubuntu 22.04.3).  Linux was selected largely because SeamlessM4T is not currently supported in other operating environments.  The FFmpeg command line tool also needed to be installed in the Linux environment in order for the models to function properly.

Video to Audio Conversion

I selected a YouTube video for this test case, which is 30 secs long and in the English language.  I selected a shorter file to make things easier for this example since literature I reviewed indicated that recordings longer than 60 seconds would need to be split for processing through the SeamlessM4T model. I also found this to be true when reviewing the logic in the source code in Github. 

I created a function in Python to convert the selected YouTube video to a mp3 audio file.  This was accomplished using the pytube library, which simplified the conversion process as shown in Fig 1.

Fig. 1. Python Method for YouTube Video Conversion

I captured the transcript provided by YouTube in a text file to serve as the "ground truth" transcription that I later used to compute performance metrics. I will provide more details about that later in this blog.

Whisper Transcription Implementation

The Whisper model can be tested using the Hugging Face transformer package. However, I installed the OpenAI Whisper package as follows:

pip install openai-whipser 
The version of the openai-whisper package installed was released on September 18, 2023, so the company had obviously made some recent updates. I could then import the whisper library into my test file without any issues. Whisper provides five configurations of varying model sizes to select from. Although I also tested the large mode, this blog will only focus on the medium model.   
Fig. 2. Python Implementation Using the Whisper Model

The smaller models may not be quite as accurate, but if you can afford a little less accuracy, execution times and resource requirements can be greatly reduced. Whisper makes implementing the code to perform the actual audio transcriptions very simple.  A transcription can be completed in two lines that 1. loads the desired specific model and 2. calls the transcribe method as depicted in Figure 2.

SeamlessM4T Transcription Implementation

The SeamlessM4T module was downloaded the SeamlessM4T code from Github in a zip file, extracted it, then from the root directory I used pip to install it as the authors describe in their readme file:

pip install .

After installing the SeamlessM4T library, I was able to import the Translator module from it. SeamlessM4T has two specific model options to select from:  seamlessM4T_medium and seamlessM4T_large. I found them both to be much more resource intensive than the Whisper models, requiring me to add significantly more memory to the virtual machine I was working with in order to install the library (including dependencies), download the seamlessM4T_medium model as well as the vocoder model it needed, and run the code.
One drawback to using the seamlessM4T library is that the audio file had to be resampled before running. Without this step, the audio file could not be decoded. SeamlessM4T also currently limits the audio file duration to 60 seconds, which should not be an issue for this particular case. However, I noticed that the initial results were poor with an word error rate and character error rate of 77% and 63%, respectively. Therefore, I split the original file into 4 individual files using FFmpeg and passed each one to the seamlessM4T model's predict function separately as shown in Figure 3. This produced better results and even reduce the transcription time by 3 secs.  However, the results were still not quite as good as that produced by the Whisper model.

Fig. 3. Python Implementation Using the SeamlessM4T Model


Through just a visual inspection, we can see in Figure 4 that the Whisper transcription was very accurate with the exception of the proper noun as highlighted.

Here at Old Dominion University, we are learning how to be the protectors of the digital world. The tools and skills that I learned here at ODU is what ultimately landed me my full-time job here at Centaur HealthCare. To be able to protect people in cybersecurity is my main passion. The professors that I've interacted with, they're all very helpful. They send emails out about opportunities that you can go to like internships. One of my favorite professors brings a personal touch and an energy that really makes you as a student want to engage and be a part of the classroom.
Fig. 4. Audio Transcription using the Whisper Medium Model

The JiWER Python library was also used to evaluate the results of the two approaches in comparison to the ground truth. Several metrics were generated using this library as follows:
  1. Word Error Rate (WER) - the percentage of words that were incorrectly predicted.
  2. Match Error Rate (MER) - the percentage of words that were incorrectly predicted and inserted.
  3. Word Information Lost (WIL) - the percentage of incorrectly predicted words between a set of ground-truth sentences and a set of hypothesis sentences. 
  4. Word Information Preserved (WIP) - the percentage of correctly predicted words between a set of ground-truth sentences and a set of hypothesis sentences. 
  5. Character Error Rate (CER) - the percentage of characters that were incorrectly predicted.
The performance measures computed using JiWER are found in Table 1 below.  Minor post-processing was performed on the results from each model to convert all characters to lowercase and remove punctuation before computing the metrics. As shown, Whispers results were very good, especially given that I did not use the largest model and there was not any preprocessing performed on the audio file itself to improve the results.

Table 1. Model Performance Measures
Metric Whisper SeamlessM4t

My Implementation HuggingFace
Time (sec) 189.0 50.19 NA
WER 3.84% 32.69% 56.73%
MER 3.80% 32.69% 54.62%
WIL 5.68% 48.22% 67.93%
WIP 94.31% 51.77% 32.06%
CER 1.39% 23.25% 38.98%

None of the models were able to properly spell Sentara in the audio recording, which contributed to the error rates.  Although I did used the medium and not the large version of the models, I was a little surprised that the SeamlessM4T results were not better.  As a sanity check, I uploaded my test audio file to the HuggingFace SeamlessM4T model test site.  Using the transcription it produced, I generated a new set of metrics.  As shown in the last column of Table 1, there were significant errors even in the transcription produced by that site's implementation.

The tools and skills that I've learned to be the protector of the digital world will eventually land me in my full-time job here at the Center for Health Care to be able to protect people in cybersecurity, the professors that I've interacted with, they're all very helpful, they're all e-mails about opportunities that you can go to, like internships, and one of my favorite professors brings a personal touch and energy.

Fig. 5. Audio Transcription using the HuggingFace SeamlessM4T site

Although the main focus was not execution times, I did provide them in Table 1.  These times did not include the times required to load the models into memory, which is especially long the first time it is performed. However, I used the CPU (instead of a GPU) when generating running these models and the execution times are largely hardware dependent. Changing the device type to cuda to take advantage of a GPU would be expected to dramatically improve those times.


OpenAI's Whisper model and Meta's SeamlessM4T are two of many options available for programmatically transcribing audio. I encourage you to try them yourselves, including the larger versions of the models to further evaluate the results and determine the best approach for your application needs.

--Bathsheba Farrow (@sheisheba)