arbisoft brand logo
arbisoft brand logo
Contact Us

AI in audio: Exploring Whisper and Its Variants with English Audio Samples Analysis

Ushnah's profile picture
Ushnah AbbasiPosted on
13-14 Min Read Time
https://d1foa0aaimjyw4.cloudfront.net/AI_in_audio_Exploring_Whisper_and_Its_Variants_with_English_Audio_Samples_Analysis_1_28f2c29365.png

Artificial Intelligence basically is the development of machines capable of doing tasks which required human intelligence before. AI has revolutionized every domain of our industries, including the audio industry, where innovations in ai and data science services are driving state-of-the-art advancements.

 

AI in audio is deeply rooted in our daily usage, and it has enabled powerful applications like voice assistants, real-time transcriptions, speech recognition, language translation, and sound processing. 

 

Among the various breakthroughs, one of the most powerful models in the audio domain is Whisper. It is an open-source model developed by OpenAI, designed for speech recognition and transcription. 

 

In this blog, we are going to explore whisper in depth along with its variants and adaptations. We will analyze them using audio samples in the English language and compare their performances. 

 

Whisper and its Variants

The first version of Whisper was released by OpenAI in September 2022 for the purpose of performing Automatic Speech Recognition (ASR) with human-level capabilities, especially for the English language. 

 

Whisper was trained on a massive 680,000 hours of data, which made it robust to accents, background noise, and technical language. It’s multilingual, i.e., it can convert audio to text in multiple languages. Moreover, Whisper has the capability of translating any language into English. 

 

Therefore, it can process complex audio inputs, handle diverse languages, and perform well even in challenging acoustic conditions. Whisper also can identify the language of an audio. 

 

So basically it can:

1. Perform ASR, i.e., convert audio into text

2. Perform language translation, i.e. convert audio from any language to English

3. Detect the language of the audio 

 

Model Architecture

Whisper is an end-to-end transformer-based model, exemplifying the advanced deep learning solutions that drive modern speech recognition with its robust encoder-decoder architecture.

 

unnamed.png

Source: Paper “Robust Speech Recognition via Large-Scale Weak Supervision

 

As shown in the image above, the Whisper model comprises two parts: the encoder and the decoder.

 

1. The encoder is responsible for taking the audio as input and generating its encodings. It has multiple transformer blocks stacked, each having self-attention mechanisms. The audio input is first converted into a Log-Mel spectrogram, which captures frequency information over time and is a standard feature for audio processing.

 

2. The decoder is responsible for generating the output sequence, either as part of the transcription or translation task, using the encoded audio features from the encoder. The decoder uses a multi-head attention mechanism, cross-attention for attending to the encoder output, and self-attention within the generated text sequence.

 

In the final output layer, special tokens are used to instruct a single model about the output of each task: ASR, translation in English, and language detection.

 

The Different Variants of Whisper

To allow the users to select the model according to their requirements and use cases, OpenAI has released multiple variants of the Whisper model. They are different from each other on the basis of size, resource utilization, speech, and accuracy. 

 

There are 6 variants released up till now: tiny, base, small, medium, large (large, large-v2, large-v3), and turbo. 

Tiny

  • Parameters: 39 Million
  • Required VRAM: ~1GB
  • Relative speed: Approximately 10 times faster than a large model
  • It has an English-only model as well, named tiny.en 

Base

  • Parameters: 74 Million
  • Required VRAM: ~1GB
  • Relative speed: Approximately 7 times faster than a large model
  • It also has an English-only model named base.en

Small

  • Parameters: 244 Million
  • Required VRAM: ~2GB
  • Relative speed: Approximately 4 times faster than a large model
  • Its English-only model is named as small.en

Medium

  • Parameters: 769 Million
  • Required VRAM: ~5GB
  • Relative speed: Approximately 2 times faster than a large model
  • English-only model is also available, named as medium.en

Large

  • Parameters:  1550 Million
  • Required VRAM: ~10 GB
  • Relative speed: Slowest among all variants 
  • Two more variants:
    • Large v2:
      • The same model size and parameters as a large model 
      • Trained for 2.5 times more epochs
      • Trained with SpecAugment (a data augmentation method for ASR), stochastic depth, and BPE dropout for regularization
    • Large v3
      • The same model size and parameters as a large model
      • A new language token for Cantonese was introduced
      • 128 Mel frequency bins were used in input instead of 80
      • Trained on 4 million hours of pseudo-labeled audio collected using large-v2 and 1 million hours of weakly labeled audio. It was trained for 2.0 epochs over this mixture dataset

Turbo

  • Parameters: 809 Million
  • Required VRAM: ~6GB
  • Relative speed: Approximately 8 times faster than the large model
  • An optimized version of large v3; it has 4 decoder layers instead of 32 (all large models have 32 decoder layers)
  • It was trained on 2 more epochs for the same data on which large series were trained on

 

Below are the architectural details of these models. 

 

unnamed (1).png

Source: Paper Robust Speech Recognition via Large-Scale Weak Supervision

 

Here are some available benchmarks:

 

unnamed (2).png

Source: https://github.com/openai/whisper/discussions/2363 

 

 

 

unnamed (3).png

 

Source: https://github.com/openai/whisper/discussions/2363

 

From the available benchmarks by OpenAI, we can see that Whisper Large-v3 delivers the best performance. For most languages, both Large-v2 and Large-v3 have a lower “Word Error Rate” (WER) compared to the Turbo model. However, for some languages, the Turbo model achieves a WER close to that of Large-v2. The second figure indicates that the Turbo model has an average WER similar to Medium and Large-v2, but its inference speed is closer to that of the Base and Tiny models.

 

The Turbo model is a good choice when a user requires high inference speed alongside good ASR accuracy. On the other hand, Tiny and Base models are significantly faster, making them ideal for applications where speed is a higher priority than accuracy. For scenarios involving low-resource languages or cases where high transcription quality is essential, Whisper Large-v2 and Large-v3 are more suitable.

 

Analysis of All Models

Let’s test the models for transcription using English audio.

 

The Ground Truth

The actual transcription of an audio file with a duration of 1 minute and 5 seconds is as follows:

Today, we are going to talk about the power of AI in everyday life. 

 

Hello and welcome!

 

Today, let’s take a moment to explore how artificial intelligence is transforming the way we live, work, and connect.

 

Think about it—every time you unlock your phone with your face, ask a smart assistant for the weather, or get recommendations for your next favorite show, AI is quietly working behind the scenes.

 

But AI isn’t just about convenience. It’s solving big problems too. From diagnosing diseases early to helping farmers grow more food with less waste, the possibilities are endless.

 

As AI evolves, so do the questions we ask: How can we ensure it’s ethical? How do we make it accessible to everyone? These are challenges worth solving.

 

So, as we continue this journey into the future, remember: that AI isn’t just about machines getting smarter. It’s about making our lives better.

 

Thanks for listening—and here’s to the incredible world of possibilities ahead!

 

Experimentation Summary

The results of the experimentation, conducted on a T4 GPU with 15 GB VRAM, are summarized in the following table. For WER calculations, all punctuation and line breaks were removed from both the ground truth and model results to ensure a fair comparison.

Model 

WER 

(%)

Inference time 

(secs)

GPU VRAM 

(GB)

Tiny

11.32

2.25 

0.4

Base

10.69

2.41

0.6

Small

8.8

4.30

1.7

Medium

8.17

8.26

4.6

Large 

8.17

12.25

9.4

Large v2

8.17

13.34

9.5

Large v3

8.17

13.53

9.5

turbo

11.94

3.20

5.1

 

As seen in the table above, the WER is lowest for the Medium, Large, Large-v2, and Large-v3 models for this English audio. The WER of the Turbo model is closer to that of the Tiny model. Consistent with the official claims, we also observe the lowest inference time for the Tiny model, which increases as the model size grows. The inference time of the Turbo model falls between that of the Base and Small models.

 

Adaptations of Whisper Model

Faster-Whisper

It is an implementation of Whisper developed by OpenAI. In this implementation, Ctranslate2 is used. Ctranslate2 is essentially a faster inference engine for transformer models. It is claimed to make the inference time four times faster while maintaining the same accuracy. This model also uses fewer resources. Both segment-level timestamps and word-level timestamps are available in the output.

HuggingFace Implementation

The HuggingFace implementation provides users with the option to use batches, which speeds up inference time at the cost of increased GPU resources. It is compatible with all the decoding options of Whisper, giving users control over parameters such as temperature, beam size, etc.

 

It also offers the option to use Flash Attention 2, which further accelerates inference. Additionally, there is an option to use a chunked long-form approach. This approach chunks the audio into smaller overlapping parts, processes each part separately, and then stitches them together, unlike the sequential processing of the Whisper model. Users can obtain both segment-level and word-level timestamps in the output.

Insanely Faster Whisper

It is a lightweight CLI that can be used to run Whisper models locally on a user’s device. It is built upon the HuggingFace implementation of Whisper. The use of batching and Flash Attention makes it even faster than other implementations. It also provides support for MPS (Macs with Apple Silicon). It is claimed that 2.5 hours of audio can be transcribed in less than 98 seconds! Users can obtain both segment-level and word-level timestamps in the output.

WhisperX

WhisperX uses Faster-Whisper as the transcription model. Unlike the typical approach where word-level timestamps are based on utterances and spoken sentences, WhisperX employs a separate model (wav2vec2) to align spoken audio with text (a process known as forced alignment). This implementation also supports batched inference, but at the cost of increased GPU resource usage.

 

However, there are some limitations, such as the inability to provide timestamps for numerical digits like "$70" or "1999," because such terms are not typically part of the alignment model. Additionally, a language-specific alignment model is required for optimal performance.

WhisperLive

WhisperLive is an application designed to provide near-live transcriptions of audio from a microphone or an audio file. It works by running a server on the host machine (on a specific port) and clients on the user machine. The user can set limits on the number of clients a host server can handle simultaneously, as well as the maximum timeout.

 

By default, the host initializes a separate Whisper model for each client. However, if the user enables the 'single model mode,' it ensures that the same model is shared across all clients.

Whisper.cpp

Whisper.cpp is a lightweight and efficient implementation, particularly useful for resource-limited applications. It stores the model weights in GGML format, enabling it to run in C/C++ environments, thereby eliminating the need for Python environments or GPU resources. It can be easily executed on macOS, embedded systems, and CPUs.

 

Comparison

The author of Faster-Whisper conducted a comparison analysis of several models using a 13-minute audio file, as shown below:

 

unnamed (5).png

 

Source: https://github.com/SYSTRAN/faster-whisper

 

It can be seen that Faster-Whisper is the fastest among all the models mentioned in the image.

 

Conclusion

As discussed throughout the blog, Whisper models have emerged as state-of-the-art in the ASR domain. Many developers have created their implementations based on these models. Additionally, several API services are available for Whisper inference. OpenAI itself offers an API for transcription, which uses the Whisper Large v2 model. Groq provides APIs for using Whisper models, delivering blazing-fast inference speeds. Organizations can leverage machine learning solutions to implement similar advanced models. There are also many other third-party paid and free options available.

 

Various applications have now been built on Whisper models. Developers have even created distilled versions of Whisper and JAX implementations.

 

In conclusion, each Whisper model is suited to specific use cases. There is a trade-off between accuracy and speed, and users must select models based on their needs. For the best and fastest results, with control over decoding options and ease of implementation, Faster-Whisper would be the ideal choice. However, if highly accurate word-level timestamps are not a priority, WhisperX can be a great alternative.

...Loading

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.

Newsletter

Join us to stay connected with the global trends and technologies