INDUSTRIES

Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
Discover More
- "Working with Arbisoft has felt less like hiring a vendor and more like gaining a team of trusted colleagues. Their developers don’t just build what we ask, they think alongside us, offer smart suggestions, and care deeply about getting it right."
  Sarah Johnson / SVP of Product, Summit K12
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
Discover More
- “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
  Paul English / Co-Founder, KAYAK
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
Discover More
- "I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented."
  Matt Hasel / Program Manager, eHuman
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
Discover More
- “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
  Jake Peters / CEO & Co-Founder, PayPerks
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
Discover More
- "The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met."
  Veronika Sonsev / Co-Founder
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
Discover More
- “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
  Silvan Rath / CEO, Predict.io

AI in audio: Exploring Whisper and Its Variants with English Audio Samples Analysis

Ushnah AbbasiPosted on December 3, 2024

13-14 Min Read Time

Artificial Intelligence basically is the development of machines capable of doing tasks which required human intelligence before. AI has revolutionized every domain of our industries, including the audio industry, where innovations in ai and data science services are driving state-of-the-art advancements.

AI in audio is deeply rooted in our daily usage, and it has enabled powerful applications like voice assistants, real-time transcriptions, speech recognition, language translation, and sound processing.

Among the various breakthroughs, one of the most powerful models in the audio domain is Whisper. It is an open-source model developed by OpenAI, designed for speech recognition and transcription.

In this blog, we are going to explore whisper in depth along with its variants and adaptations. We will analyze them using audio samples in the English language and compare their performances.

Whisper and its Variants

The first version of Whisper was released by OpenAI in September 2022 for the purpose of performing Automatic Speech Recognition (ASR) with human-level capabilities, especially for the English language.

Whisper was trained on a massive 680,000 hours of data, which made it robust to accents, background noise, and technical language. It’s multilingual, i.e., it can convert audio to text in multiple languages. Moreover, Whisper has the capability of translating any language into English.

Therefore, it can process complex audio inputs, handle diverse languages, and perform well even in challenging acoustic conditions. Whisper also can identify the language of an audio.

So basically it can:

1. Perform ASR, i.e., convert audio into text

2. Perform language translation, i.e. convert audio from any language to English

3. Detect the language of the audio

Model Architecture

Whisper is an end-to-end transformer-based model, exemplifying the advanced deep learning solutions that drive modern speech recognition with its robust encoder-decoder architecture.

Source: Paper “Robust Speech Recognition via Large-Scale Weak Supervision”

As shown in the image above, the Whisper model comprises two parts: the encoder and the decoder.

1. The encoder is responsible for taking the audio as input and generating its encodings. It has multiple transformer blocks stacked, each having self-attention mechanisms. The audio input is first converted into a Log-Mel spectrogram, which captures frequency information over time and is a standard feature for audio processing.

2. The decoder is responsible for generating the output sequence, either as part of the transcription or translation task, using the encoded audio features from the encoder. The decoder uses a multi-head attention mechanism, cross-attention for attending to the encoder output, and self-attention within the generated text sequence.

In the final output layer, special tokens are used to instruct a single model about the output of each task: ASR, translation in English, and language detection.

The Different Variants of Whisper

To allow the users to select the model according to their requirements and use cases, OpenAI has released multiple variants of the Whisper model. They are different from each other on the basis of size, resource utilization, speech, and accuracy.

There are 6 variants released up till now: tiny, base, small, medium, large (large, large-v2, large-v3), and turbo.

Tiny

Parameters: 39 Million
Required VRAM: ~1GB
Relative speed: Approximately 10 times faster than a large model
It has an English-only model as well, named tiny.en

Base

Parameters: 74 Million
Required VRAM: ~1GB
Relative speed: Approximately 7 times faster than a large model
It also has an English-only model named base.en

Small

Parameters: 244 Million
Required VRAM: ~2GB
Relative speed: Approximately 4 times faster than a large model
Its English-only model is named as small.en

Medium

Parameters: 769 Million
Required VRAM: ~5GB
Relative speed: Approximately 2 times faster than a large model
English-only model is also available, named as medium.en

Large

Parameters: 1550 Million
Required VRAM: ~10 GB
Relative speed: Slowest among all variants
Two more variants:
- Large v2:
  - The same model size and parameters as a large model
  - Trained for 2.5 times more epochs
  - Trained with SpecAugment (a data augmentation method for ASR), stochastic depth, and BPE dropout for regularization
- Large v3
  - The same model size and parameters as a large model
  - A new language token for Cantonese was introduced
  - 128 Mel frequency bins were used in input instead of 80
  - Trained on 4 million hours of pseudo-labeled audio collected using large-v2 and 1 million hours of weakly labeled audio. It was trained for 2.0 epochs over this mixture dataset

Turbo

Parameters: 809 Million
Required VRAM: ~6GB
Relative speed: Approximately 8 times faster than the large model
An optimized version of large v3; it has 4 decoder layers instead of 32 (all large models have 32 decoder layers)
It was trained on 2 more epochs for the same data on which large series were trained on

Below are the architectural details of these models.

Source: Paper Robust Speech Recognition via Large-Scale Weak Supervision

Here are some available benchmarks:

Source: https://github.com/openai/whisper/discussions/2363

Source: https://github.com/openai/whisper/discussions/2363

From the available benchmarks by OpenAI, we can see that Whisper Large-v3 delivers the best performance. For most languages, both Large-v2 and Large-v3 have a lower “Word Error Rate” (WER) compared to the Turbo model. However, for some languages, the Turbo model achieves a WER close to that of Large-v2. The second figure indicates that the Turbo model has an average WER similar to Medium and Large-v2, but its inference speed is closer to that of the Base and Tiny models.

The Turbo model is a good choice when a user requires high inference speed alongside good ASR accuracy. On the other hand, Tiny and Base models are significantly faster, making them ideal for applications where speed is a higher priority than accuracy. For scenarios involving low-resource languages or cases where high transcription quality is essential, Whisper Large-v2 and Large-v3 are more suitable.

Analysis of All Models

Let’s test the models for transcription using English audio.

The Ground Truth

The actual transcription of an audio file with a duration of 1 minute and 5 seconds is as follows:

Today, we are going to talk about the power of AI in everyday life.

Hello and welcome!

Today, let’s take a moment to explore how artificial intelligence is transforming the way we live, work, and connect.

Think about it—every time you unlock your phone with your face, ask a smart assistant for the weather, or get recommendations for your next favorite show, AI is quietly working behind the scenes.

But AI isn’t just about convenience. It’s solving big problems too. From diagnosing diseases early to helping farmers grow more food with less waste, the possibilities are endless.

As AI evolves, so do the questions we ask: How can we ensure it’s ethical? How do we make it accessible to everyone? These are challenges worth solving.

So, as we continue this journey into the future, remember: that AI isn’t just about machines getting smarter. It’s about making our lives better.

Thanks for listening—and here’s to the incredible world of possibilities ahead!

Experimentation Summary

The results of the experimentation, conducted on a T4 GPU with 15 GB VRAM, are summarized in the following table. For WER calculations, all punctuation and line breaks were removed from both the ground truth and model results to ensure a fair comparison.

Model	WER (%)	Inference time (secs)	GPU VRAM (GB)
Tiny	11.32	2.25	0.4
Base	10.69	2.41	0.6
Small	8.8	4.30	1.7
Medium	8.17	8.26	4.6
Large	8.17	12.25	9.4
Large v2	8.17	13.34	9.5
Large v3	8.17	13.53	9.5
turbo	11.94	3.20	5.1

As seen in the table above, the WER is lowest for the Medium, Large, Large-v2, and Large-v3 models for this English audio. The WER of the Turbo model is closer to that of the Tiny model. Consistent with the official claims, we also observe the lowest inference time for the Tiny model, which increases as the model size grows. The inference time of the Turbo model falls between that of the Base and Small models.

Adaptations of Whisper Model

Faster-Whisper

It is an implementation of Whisper developed by OpenAI. In this implementation, Ctranslate2 is used. Ctranslate2 is essentially a faster inference engine for transformer models. It is claimed to make the inference time four times faster while maintaining the same accuracy. This model also uses fewer resources. Both segment-level timestamps and word-level timestamps are available in the output.

HuggingFace Implementation

The HuggingFace implementation provides users with the option to use batches, which speeds up inference time at the cost of increased GPU resources. It is compatible with all the decoding options of Whisper, giving users control over parameters such as temperature, beam size, etc.

It also offers the option to use Flash Attention 2, which further accelerates inference. Additionally, there is an option to use a chunked long-form approach. This approach chunks the audio into smaller overlapping parts, processes each part separately, and then stitches them together, unlike the sequential processing of the Whisper model. Users can obtain both segment-level and word-level timestamps in the output.

Insanely Faster Whisper

It is a lightweight CLI that can be used to run Whisper models locally on a user’s device. It is built upon the HuggingFace implementation of Whisper. The use of batching and Flash Attention makes it even faster than other implementations. It also provides support for MPS (Macs with Apple Silicon). It is claimed that 2.5 hours of audio can be transcribed in less than 98 seconds! Users can obtain both segment-level and word-level timestamps in the output.

WhisperX

WhisperX uses Faster-Whisper as the transcription model. Unlike the typical approach where word-level timestamps are based on utterances and spoken sentences, WhisperX employs a separate model (wav2vec2) to align spoken audio with text (a process known as forced alignment). This implementation also supports batched inference, but at the cost of increased GPU resource usage.

However, there are some limitations, such as the inability to provide timestamps for numerical digits like "$70" or "1999," because such terms are not typically part of the alignment model. Additionally, a language-specific alignment model is required for optimal performance.

WhisperLive

WhisperLive is an application designed to provide near-live transcriptions of audio from a microphone or an audio file. It works by running a server on the host machine (on a specific port) and clients on the user machine. The user can set limits on the number of clients a host server can handle simultaneously, as well as the maximum timeout.

By default, the host initializes a separate Whisper model for each client. However, if the user enables the 'single model mode,' it ensures that the same model is shared across all clients.

Whisper.cpp

Whisper.cpp is a lightweight and efficient implementation, particularly useful for resource-limited applications. It stores the model weights in GGML format, enabling it to run in C/C++ environments, thereby eliminating the need for Python environments or GPU resources. It can be easily executed on macOS, embedded systems, and CPUs.

Comparison

The author of Faster-Whisper conducted a comparison analysis of several models using a 13-minute audio file, as shown below:

Source: https://github.com/SYSTRAN/faster-whisper

It can be seen that Faster-Whisper is the fastest among all the models mentioned in the image.

Conclusion

As discussed throughout the blog, Whisper models have emerged as state-of-the-art in the ASR domain. Many developers have created their implementations based on these models. Additionally, several API services are available for Whisper inference. OpenAI itself offers an API for transcription, which uses the Whisper Large v2 model. Groq provides APIs for using Whisper models, delivering blazing-fast inference speeds. Organizations can leverage machine learning solutions to implement similar advanced models. There are also many other third-party paid and free options available.

Various applications have now been built on Whisper models. Developers have even created distilled versions of Whisper and JAX implementations.

In conclusion, each Whisper model is suited to specific use cases. There is a trade-off between accuracy and speed, and users must select models based on their needs. For the best and fastest results, with control over decoding options and ease of implementation, Faster-Whisper would be the ideal choice. However, if highly accurate word-level timestamps are not a priority, WhisperX can be a great alternative.

Just published

img-https://d1foa0aaimjyw4.cloudfront.net/AWC_Blog_Shifting_Accessibility_Left_How_to_Empower_Developers_QA_and_Designers_Together_Tanveer_Khan_844e625162.jpg

Shifting Accessibility Left: How to Empower Developers, QA and Designers TogetherRead more

img-https://d1foa0aaimjyw4.cloudfront.net/AWC_Blog_How_Transformers_Redefined_Natural_Language_Processing_Abdul_Moiz_afab5da5f1.png

How Transformers Redefined Natural Language ProcessingRead more

img-https://d1foa0aaimjyw4.cloudfront.net/AWC_Blog_Micro_Partitions_The_Hidden_Engine_Behind_Snowflake_s_Performance_Advantage_Abdul_Rafey_7de6610d5d.png

Micro-Partitions: The Hidden Engine Behind Snowflake's Performance AdvantageRead more

...Loading Related Blogs

Explore More

AI in audio: Exploring Whisper and Its Variants with English Audio Samples Analysis

Whisper and its Variants

Model Architecture

The Different Variants of Whisper

Tiny

Base

Small

Medium

Large

Turbo

Analysis of All Models

Experimentation Summary

Adaptations of Whisper Model

Faster-Whisper

HuggingFace Implementation

Insanely Faster Whisper

WhisperX

WhisperLive

Whisper.cpp

Comparison

Conclusion

Just published

Have Questions? Let's Talk.

Newsletter

More from Ushnah Abbasi

Step-by-Step with Transformers: From Seq2Seq Bottlenecks to Cutting-Ed...

Just published