“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
“They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
81.8% NPS78% of our clients believe that Arbisoft is better than most other providers they have worked with.
Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.
“Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”
Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.
“Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”
As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.
I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.
We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.
“Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”
Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!
The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.
Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!
“The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.
“Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”
“I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”
Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.
AI in audio: Exploring Whisper and Its Variants with English Audio Samples Analysis
Artificial Intelligence basically is the development of machines capable of doing tasks which required human intelligence before. AI has revolutionized every domain of our industries, the audio industry being a part of it.
AI in audio is deeply rooted in our daily usage, and it has enabled powerful applications like voice assistants, real-time transcriptions, speech recognition, language translation, and sound processing.
Among the various breakthroughs, one of the most powerful models in the audio domain is Whisper. It is an open-source model developed by OpenAI, designed for speech recognition and transcription.
In this blog, we are going to explore whisper in depth along with its variants and adaptations. We will analyze them using audio samples in the English language and compare their performances.
Whisper and its Variants
The first version of Whisper was released by OpenAI in September 2022 for the purpose of performing Automatic Speech Recognition (ASR) with human-level capabilities, especially for the English language.
Whisper was trained on a massive 680,000 hours of data, which made it robust to accents, background noise, and technical language. It’s multilingual, i.e., it can convert audio to text in multiple languages. Moreover, Whisper has the capability of translating any language into English.
Therefore, it can process complex audio inputs, handle diverse languages, and perform well even in challenging acoustic conditions. Whisper also can identify the language of an audio.
So basically it can:
1. Perform ASR, i.e., convert audio into text
2. Perform language translation, i.e. convert audio from any language to English
3. Detect the language of the audio
Model Architecture
Whisper is an end-to-end transformer-based model having encoder-decoder architecture.
As shown in the image above, the Whisper model comprises two parts: the encoder and the decoder.
1. The encoder is responsible for taking the audio as input and generating its encodings. It has multiple transformer blocks stacked, each having self-attention mechanisms. The audio input is first converted into a Log-Mel spectrogram, which captures frequency information over time and is a standard feature for audio processing.
2. The decoder is responsible for generating the output sequence, either as part of the transcription or translation task, using the encoded audio features from the encoder. The decoder uses a multi-head attention mechanism, cross-attention for attending to the encoder output, and self-attention within the generated text sequence.
In the final output layer, special tokens are used to instruct a single model about the output of each task: ASR, translation in English, and language detection.
The Different Variants of Whisper
To allow the users to select the model according to their requirements and use cases, OpenAI has released multiple variants of the Whisper model. They are different from each other on the basis of size, resource utilization, speech, and accuracy.
There are 6 variants released up till now: tiny, base, small, medium, large (large, large-v2, large-v3), and turbo.
Tiny
Parameters: 39 Million
Required VRAM: ~1GB
Relative speed: Approximately 10 times faster than a large model
It has an English-only model as well, named tiny.en
Base
Parameters: 74 Million
Required VRAM: ~1GB
Relative speed: Approximately 7 times faster than a large model
It also has an English-only model named base.en
Small
Parameters: 244 Million
Required VRAM: ~2GB
Relative speed: Approximately 4 times faster than a large model
Its English-only model is named as small.en
Medium
Parameters: 769 Million
Required VRAM: ~5GB
Relative speed: Approximately 2 times faster than a large model
English-only model is also available, named as medium.en
Large
Parameters: 1550 Million
Required VRAM: ~10 GB
Relative speed: Slowest among all variants
Two more variants:
Large v2:
The same model size and parameters as a large model
Trained for 2.5 times more epochs
Trained with SpecAugment (a data augmentation method for ASR), stochastic depth, and BPE dropout for regularization
Large v3
The same model size and parameters as a large model
A new language token for Cantonese was introduced
128 Mel frequency bins were used in input instead of 80
Trained on 4 million hours of pseudo-labeled audio collected using large-v2 and 1 million hours of weakly labeled audio. It was trained for 2.0 epochs over this mixture dataset
Turbo
Parameters: 809 Million
Required VRAM: ~6GB
Relative speed: Approximately 8 times faster than the large model
An optimized version of large v3; it has 4 decoder layers instead of 32 (all large models have 32 decoder layers)
It was trained on 2 more epochs for the same data on which large series were trained on
Below are the architectural details of these models.
From the available benchmarks by OpenAI, we can see that Whisper Large-v3 delivers the best performance. For most languages, both Large-v2 and Large-v3 have a lower “Word Error Rate” (WER) compared to the Turbo model. However, for some languages, the Turbo model achieves a WER close to that of Large-v2. The second figure indicates that the Turbo model has an average WER similar to Medium and Large-v2, but its inference speed is closer to that of the Base and Tiny models.
The Turbo model is a good choice when a user requires high inference speed alongside good ASR accuracy. On the other hand, Tiny and Base models are significantly faster, making them ideal for applications where speed is a higher priority than accuracy. For scenarios involving low-resource languages or cases where high transcription quality is essential, Whisper Large-v2 and Large-v3 are more suitable.
Analysis of All Models
Let’s test the models for transcription using English audio.
The Ground Truth
The actual transcription of an audio file with a duration of 1 minute and 5 seconds is as follows:
Today, we are going to talk about the power of AI in everyday life.
Hello and welcome!
Today, let’s take a moment to explore how artificial intelligence is transforming the way we live, work, and connect.
Think about it—every time you unlock your phone with your face, ask a smart assistant for the weather, or get recommendations for your next favorite show, AI is quietly working behind the scenes.
But AI isn’t just about convenience. It’s solving big problems too. From diagnosing diseases early to helping farmers grow more food with less waste, the possibilities are endless.
As AI evolves, so do the questions we ask: How can we ensure it’s ethical? How do we make it accessible to everyone? These are challenges worth solving.
So, as we continue this journey into the future, remember: that AI isn’t just about machines getting smarter. It’s about making our lives better.
Thanks for listening—and here’s to the incredible world of possibilities ahead!
Experimentation Summary
The results of the experimentation, conducted on a T4 GPU with 15 GB VRAM, are summarized in the following table. For WER calculations, all punctuation and line breaks were removed from both the ground truth and model results to ensure a fair comparison.
Model
WER
(%)
Inference time
(secs)
GPU VRAM
(GB)
Tiny
11.32
2.25
0.4
Base
10.69
2.41
0.6
Small
8.8
4.30
1.7
Medium
8.17
8.26
4.6
Large
8.17
12.25
9.4
Large v2
8.17
13.34
9.5
Large v3
8.17
13.53
9.5
turbo
11.94
3.20
5.1
As seen in the table above, the WER is lowest for the Medium, Large, Large-v2, and Large-v3 models for this English audio. The WER of the Turbo model is closer to that of the Tiny model. Consistent with the official claims, we also observe the lowest inference time for the Tiny model, which increases as the model size grows. The inference time of the Turbo model falls between that of the Base and Small models.
Adaptations of Whisper Model
Faster-Whisper
It is an implementation of Whisper developed by OpenAI. In this implementation, Ctranslate2 is used. Ctranslate2 is essentially a faster inference engine for transformer models. It is claimed to make the inference time four times faster while maintaining the same accuracy. This model also uses fewer resources. Both segment-level timestamps and word-level timestamps are available in the output.
HuggingFace Implementation
The HuggingFace implementation provides users with the option to use batches, which speeds up inference time at the cost of increased GPU resources. It is compatible with all the decoding options of Whisper, giving users control over parameters such as temperature, beam size, etc.
It also offers the option to use Flash Attention 2, which further accelerates inference. Additionally, there is an option to use a chunked long-form approach. This approach chunks the audio into smaller overlapping parts, processes each part separately, and then stitches them together, unlike the sequential processing of the Whisper model. Users can obtain both segment-level and word-level timestamps in the output.
Insanely Faster Whisper
It is a lightweight CLI that can be used to run Whisper models locally on a user’s device. It is built upon the HuggingFace implementation of Whisper. The use of batching and Flash Attention makes it even faster than other implementations. It also provides support for MPS (Macs with Apple Silicon). It is claimed that 2.5 hours of audio can be transcribed in less than 98 seconds! Users can obtain both segment-level and word-level timestamps in the output.
WhisperX
WhisperX uses Faster-Whisper as the transcription model. Unlike the typical approach where word-level timestamps are based on utterances and spoken sentences, WhisperX employs a separate model (wav2vec2) to align spoken audio with text (a process known as forced alignment). This implementation also supports batched inference, but at the cost of increased GPU resource usage.
However, there are some limitations, such as the inability to provide timestamps for numerical digits like "$70" or "1999," because such terms are not typically part of the alignment model. Additionally, a language-specific alignment model is required for optimal performance.
WhisperLive
WhisperLive is an application designed to provide near-live transcriptions of audio from a microphone or an audio file. It works by running a server on the host machine (on a specific port) and clients on the user machine. The user can set limits on the number of clients a host server can handle simultaneously, as well as the maximum timeout.
By default, the host initializes a separate Whisper model for each client. However, if the user enables the 'single model mode,' it ensures that the same model is shared across all clients.
Whisper.cpp
Whisper.cpp is a lightweight and efficient implementation, particularly useful for resource-limited applications. It stores the model weights in GGML format, enabling it to run in C/C++ environments, thereby eliminating the need for Python environments or GPU resources. It can be easily executed on macOS, embedded systems, and CPUs.
Comparison
The author of Faster-Whisper conducted a comparison analysis of several models using a 13-minute audio file, as shown below:
It can be seen that Faster-Whisper is the fastest among all the models mentioned in the image.
Conclusion
As discussed throughout the blog, Whisper models have emerged as state-of-the-art in the ASR domain. Many developers have created their implementations based on these models. Additionally, several API services are available for Whisper inference. OpenAI itself offers an API for transcription, which uses the Whisper Large v2 model. Groq provides APIs for using Whisper models, delivering blazing-fast inference speeds. There are also many other third-party paid and free options available.
Various applications have now been built on Whisper models. Developers have even created distilled versions of Whisper and JAX implementations.
In conclusion, each Whisper model is suited to specific use cases. There is a trade-off between accuracy and speed, and users must select models based on their needs. For the best and fastest results, with control over decoding options and ease of implementation, Faster-Whisper would be the ideal choice. However, if highly accurate word-level timestamps are not a priority, WhisperX can be a great alternative.
I am a Senior Machine Learning Engineer with over three years of experience in building AI-driven solutions. I have worked with advanced technologies such as deep learning, NLP, and computer vision. I am passionate about tackling complex challenges and continuously pushing the boundaries of what AI can achieve.