arbisoft brand logo
arbisoft brand logo

A Technology Partnership That Goes Beyond Code

  • company logo

    “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

    Jake Peters profile picture

    Jake Peters/CEO & Co-Founder, PayPerks

  • company logo

    “They delivered a high-quality product and their customer service was excellent. We’ve had other teams approach us, asking to use it for their own projects”.

    Alice Danon profile picture

    Alice Danon/Project Coordinator, World Bank

1000+Tech Experts

550+Projects Completed

50+Tech Stacks

100+Tech Partnerships

4Global Offices

4.9Clutch Rating

  • company logo

    “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”

    Ed Zarecor profile picture

    Ed Zarecor/Senior Director & Head of Engineering

81.8% NPS78% of our clients believe that Arbisoft is better than most other providers they have worked with.

  • Arbisoft is your one-stop shop when it comes to your eLearning needs. Our Ed-tech services are designed to improve the learning experience and simplify educational operations.

    Companies that we have worked with

    • MIT logo
    • edx logo
    • Philanthropy University logo
    • Ten Marks logo

    • company logo

      “Arbisoft has been a valued partner to edX since 2013. We work with their engineers day in and day out to advance the Open edX platform and support our learners across the world.”

      Ed Zarecor profile picture

      Ed Zarecor/Senior Director & Head of Engineering

  • Get cutting-edge travel tech solutions that cater to your users’ every need. We have been employing the latest technology to build custom travel solutions for our clients since 2007.

    Companies that we have worked with

    • Kayak logo
    • Travelliance logo
    • SastaTicket logo
    • Wanderu logo

    • company logo

      “Arbisoft has been my most trusted technology partner for now over 15 years. Arbisoft has very unique methods of recruiting and training, and the results demonstrate that. They have great teams, great positive attitudes and great communication.”

      Paul English profile picture

      Paul English/Co-Founder, KAYAK

  • As a long-time contributor to the healthcare industry, we have been at the forefront of developing custom healthcare technology solutions that have benefitted millions.

    Companies that we have worked with

    • eHuman logo
    • Reify Health logo

    • company logo

      I wanted to tell you how much I appreciate the work you and your team have been doing of all the overseas teams I've worked with, yours is the most communicative, most responsive and most talented.

      Matt Hasel profile picture

      Matt Hasel/Program Manager, eHuman

  • We take pride in meeting the most complex needs of our clients and developing stellar fintech solutions that deliver the greatest value in every aspect.

    Companies that we have worked with

    • Payperks logo
    • The World Bank logo
    • Lendaid logo

    • company logo

      “Arbisoft is an integral part of our team and we probably wouldn't be here today without them. Some of their team has worked with us for 5-8 years and we've built a trusted business relationship. We share successes together.”

      Jake Peters profile picture

      Jake Peters/CEO & Co-Founder, PayPerks

  • Unlock innovative solutions for your e-commerce business with Arbisoft’s seasoned workforce. Reach out to us with your needs and let’s get to work!

    Companies that we have worked with

    • HyperJar logo
    • Edited logo

    • company logo

      The development team at Arbisoft is very skilled and proactive. They communicate well, raise concerns when they think a development approach wont work and go out of their way to ensure client needs are met.

      Veronika Sonsev profile picture

      Veronika Sonsev/Co-Founder

  • Arbisoft is a holistic technology partner, adept at tailoring solutions that cater to business needs across industries. Partner with us to go from conception to completion!

    Companies that we have worked with

    • Indeed logo
    • Predict.io logo
    • Cerp logo
    • Wigo logo

    • company logo

      “The app has generated significant revenue and received industry awards, which is attributed to Arbisoft’s work. Team members are proactive, collaborative, and responsive”.

      Silvan Rath profile picture

      Silvan Rath/CEO, Predict.io

  • Software Development Outsourcing

    Building your software with our expert team.

  • Dedicated Teams

    Long term, integrated teams for your project success

  • IT Staff Augmentation

    Quick engagement to boost your team.

  • New Venture Partnership

    Collaborative launch for your business success.

Discover More

Hear From Our Clients

  • company logo

    “Arbisoft partnered with Travelliance (TVA) to develop Accounting, Reporting, & Operations solutions. We helped cut downtime to zero, providing 24/7 support, and making sure their database of 7 million users functions smoothly.”

    Dori Hotoran profile picture

    Dori Hotoran/Director Global Operations - Travelliance

  • company logo

    “I couldn’t be more pleased with the Arbisoft team. Their engineering product is top-notch, as is their client relations and account management. From the beginning, they felt like members of our own team—true partners rather than vendors.”

    Diemand-Yauman profile picture

    Diemand-Yauman/CEO, Philanthropy University

  • company logo

    Arbisoft was an invaluable partner in developing TripScanner, as they served as my outsourced website and software development team. Arbisoft did an incredible job, building TripScanner end-to-end, and completing the project on time and within budget at a fraction of the cost of a US-based developer.

    Ethan Laub profile picture

    Ethan Laub/Founder and CEO

Contact Us

AI in audio: Exploring Whisper and Its Variants with English Audio Samples Analysis

https://d1foa0aaimjyw4.cloudfront.net/AI_in_audio_Exploring_Whisper_and_Its_Variants_with_English_Audio_Samples_Analysis_1_28f2c29365.png

Artificial Intelligence basically is the development of machines capable of doing tasks which required human intelligence before. AI has revolutionized every domain of our industries, the audio industry being a part of it. 

 

AI in audio is deeply rooted in our daily usage, and it has enabled powerful applications like voice assistants, real-time transcriptions, speech recognition, language translation, and sound processing. 

 

Among the various breakthroughs, one of the most powerful models in the audio domain is Whisper. It is an open-source model developed by OpenAI, designed for speech recognition and transcription. 

 

In this blog, we are going to explore whisper in depth along with its variants and adaptations. We will analyze them using audio samples in the English language and compare their performances. 

 

Whisper and its Variants

The first version of Whisper was released by OpenAI in September 2022 for the purpose of performing Automatic Speech Recognition (ASR) with human-level capabilities, especially for the English language. 

 

Whisper was trained on a massive 680,000 hours of data, which made it robust to accents, background noise, and technical language. It’s multilingual, i.e., it can convert audio to text in multiple languages. Moreover, Whisper has the capability of translating any language into English. 

 

Therefore, it can process complex audio inputs, handle diverse languages, and perform well even in challenging acoustic conditions. Whisper also can identify the language of an audio. 

 

So basically it can:

1. Perform ASR, i.e., convert audio into text

2. Perform language translation, i.e. convert audio from any language to English

3. Detect the language of the audio 

 

Model Architecture

Whisper is an end-to-end transformer-based model having encoder-decoder architecture. 

 

unnamed.png

Source: Paper “Robust Speech Recognition via Large-Scale Weak Supervision

 

As shown in the image above, the Whisper model comprises two parts: the encoder and the decoder.

 

1. The encoder is responsible for taking the audio as input and generating its encodings. It has multiple transformer blocks stacked, each having self-attention mechanisms. The audio input is first converted into a Log-Mel spectrogram, which captures frequency information over time and is a standard feature for audio processing.

 

2. The decoder is responsible for generating the output sequence, either as part of the transcription or translation task, using the encoded audio features from the encoder. The decoder uses a multi-head attention mechanism, cross-attention for attending to the encoder output, and self-attention within the generated text sequence.

 

In the final output layer, special tokens are used to instruct a single model about the output of each task: ASR, translation in English, and language detection.

 

The Different Variants of Whisper

To allow the users to select the model according to their requirements and use cases, OpenAI has released multiple variants of the Whisper model. They are different from each other on the basis of size, resource utilization, speech, and accuracy. 

 

There are 6 variants released up till now: tiny, base, small, medium, large (large, large-v2, large-v3), and turbo. 

Tiny

  • Parameters: 39 Million
  • Required VRAM: ~1GB
  • Relative speed: Approximately 10 times faster than a large model
  • It has an English-only model as well, named tiny.en 

Base

  • Parameters: 74 Million
  • Required VRAM: ~1GB
  • Relative speed: Approximately 7 times faster than a large model
  • It also has an English-only model named base.en

Small

  • Parameters: 244 Million
  • Required VRAM: ~2GB
  • Relative speed: Approximately 4 times faster than a large model
  • Its English-only model is named as small.en

Medium

  • Parameters: 769 Million
  • Required VRAM: ~5GB
  • Relative speed: Approximately 2 times faster than a large model
  • English-only model is also available, named as medium.en

Large

  • Parameters:  1550 Million
  • Required VRAM: ~10 GB
  • Relative speed: Slowest among all variants 
  • Two more variants:
    • Large v2:
      • The same model size and parameters as a large model 
      • Trained for 2.5 times more epochs
      • Trained with SpecAugment (a data augmentation method for ASR), stochastic depth, and BPE dropout for regularization
    • Large v3
      • The same model size and parameters as a large model
      • A new language token for Cantonese was introduced
      • 128 Mel frequency bins were used in input instead of 80
      • Trained on 4 million hours of pseudo-labeled audio collected using large-v2 and 1 million hours of weakly labeled audio. It was trained for 2.0 epochs over this mixture dataset

Turbo

  • Parameters: 809 Million
  • Required VRAM: ~6GB
  • Relative speed: Approximately 8 times faster than the large model
  • An optimized version of large v3; it has 4 decoder layers instead of 32 (all large models have 32 decoder layers)
  • It was trained on 2 more epochs for the same data on which large series were trained on

 

Below are the architectural details of these models. 

 

unnamed (1).png

Source: Paper Robust Speech Recognition via Large-Scale Weak Supervision

 

Here are some available benchmarks:

 

unnamed (2).png

Source: https://github.com/openai/whisper/discussions/2363 

 

 

 

unnamed (3).png

 

Source: https://github.com/openai/whisper/discussions/2363

 

From the available benchmarks by OpenAI, we can see that Whisper Large-v3 delivers the best performance. For most languages, both Large-v2 and Large-v3 have a lower “Word Error Rate” (WER) compared to the Turbo model. However, for some languages, the Turbo model achieves a WER close to that of Large-v2. The second figure indicates that the Turbo model has an average WER similar to Medium and Large-v2, but its inference speed is closer to that of the Base and Tiny models.

 

The Turbo model is a good choice when a user requires high inference speed alongside good ASR accuracy. On the other hand, Tiny and Base models are significantly faster, making them ideal for applications where speed is a higher priority than accuracy. For scenarios involving low-resource languages or cases where high transcription quality is essential, Whisper Large-v2 and Large-v3 are more suitable.

 

Analysis of All Models

Let’s test the models for transcription using English audio.

 

The Ground Truth

The actual transcription of an audio file with a duration of 1 minute and 5 seconds is as follows:

Today, we are going to talk about the power of AI in everyday life. 

 

Hello and welcome!

 

Today, let’s take a moment to explore how artificial intelligence is transforming the way we live, work, and connect.

 

Think about it—every time you unlock your phone with your face, ask a smart assistant for the weather, or get recommendations for your next favorite show, AI is quietly working behind the scenes.

 

But AI isn’t just about convenience. It’s solving big problems too. From diagnosing diseases early to helping farmers grow more food with less waste, the possibilities are endless.

 

As AI evolves, so do the questions we ask: How can we ensure it’s ethical? How do we make it accessible to everyone? These are challenges worth solving.

 

So, as we continue this journey into the future, remember: that AI isn’t just about machines getting smarter. It’s about making our lives better.

 

Thanks for listening—and here’s to the incredible world of possibilities ahead!

 

Experimentation Summary

The results of the experimentation, conducted on a T4 GPU with 15 GB VRAM, are summarized in the following table. For WER calculations, all punctuation and line breaks were removed from both the ground truth and model results to ensure a fair comparison.

Model 

WER 

(%)

Inference time 

(secs)

GPU VRAM 

(GB)

Tiny

11.32

2.25 

0.4

Base

10.69

2.41

0.6

Small

8.8

4.30

1.7

Medium

8.17

8.26

4.6

Large 

8.17

12.25

9.4

Large v2

8.17

13.34

9.5

Large v3

8.17

13.53

9.5

turbo

11.94

3.20

5.1

 

As seen in the table above, the WER is lowest for the Medium, Large, Large-v2, and Large-v3 models for this English audio. The WER of the Turbo model is closer to that of the Tiny model. Consistent with the official claims, we also observe the lowest inference time for the Tiny model, which increases as the model size grows. The inference time of the Turbo model falls between that of the Base and Small models.

 

Adaptations of Whisper Model

Faster-Whisper

It is an implementation of Whisper developed by OpenAI. In this implementation, Ctranslate2 is used. Ctranslate2 is essentially a faster inference engine for transformer models. It is claimed to make the inference time four times faster while maintaining the same accuracy. This model also uses fewer resources. Both segment-level timestamps and word-level timestamps are available in the output.

HuggingFace Implementation

The HuggingFace implementation provides users with the option to use batches, which speeds up inference time at the cost of increased GPU resources. It is compatible with all the decoding options of Whisper, giving users control over parameters such as temperature, beam size, etc.

 

It also offers the option to use Flash Attention 2, which further accelerates inference. Additionally, there is an option to use a chunked long-form approach. This approach chunks the audio into smaller overlapping parts, processes each part separately, and then stitches them together, unlike the sequential processing of the Whisper model. Users can obtain both segment-level and word-level timestamps in the output.

Insanely Faster Whisper

It is a lightweight CLI that can be used to run Whisper models locally on a user’s device. It is built upon the HuggingFace implementation of Whisper. The use of batching and Flash Attention makes it even faster than other implementations. It also provides support for MPS (Macs with Apple Silicon). It is claimed that 2.5 hours of audio can be transcribed in less than 98 seconds! Users can obtain both segment-level and word-level timestamps in the output.

WhisperX

WhisperX uses Faster-Whisper as the transcription model. Unlike the typical approach where word-level timestamps are based on utterances and spoken sentences, WhisperX employs a separate model (wav2vec2) to align spoken audio with text (a process known as forced alignment). This implementation also supports batched inference, but at the cost of increased GPU resource usage.

 

However, there are some limitations, such as the inability to provide timestamps for numerical digits like "$70" or "1999," because such terms are not typically part of the alignment model. Additionally, a language-specific alignment model is required for optimal performance.

WhisperLive

WhisperLive is an application designed to provide near-live transcriptions of audio from a microphone or an audio file. It works by running a server on the host machine (on a specific port) and clients on the user machine. The user can set limits on the number of clients a host server can handle simultaneously, as well as the maximum timeout.

 

By default, the host initializes a separate Whisper model for each client. However, if the user enables the 'single model mode,' it ensures that the same model is shared across all clients.

Whisper.cpp

Whisper.cpp is a lightweight and efficient implementation, particularly useful for resource-limited applications. It stores the model weights in GGML format, enabling it to run in C/C++ environments, thereby eliminating the need for Python environments or GPU resources. It can be easily executed on macOS, embedded systems, and CPUs.

 

Comparison

The author of Faster-Whisper conducted a comparison analysis of several models using a 13-minute audio file, as shown below:

 

unnamed (5).png

 

Source: https://github.com/SYSTRAN/faster-whisper

 

It can be seen that Faster-Whisper is the fastest among all the models mentioned in the image.

 

Conclusion

As discussed throughout the blog, Whisper models have emerged as state-of-the-art in the ASR domain. Many developers have created their implementations based on these models. Additionally, several API services are available for Whisper inference. OpenAI itself offers an API for transcription, which uses the Whisper Large v2 model. Groq provides APIs for using Whisper models, delivering blazing-fast inference speeds. There are also many other third-party paid and free options available.

 

Various applications have now been built on Whisper models. Developers have even created distilled versions of Whisper and JAX implementations.

 

In conclusion, each Whisper model is suited to specific use cases. There is a trade-off between accuracy and speed, and users must select models based on their needs. For the best and fastest results, with control over decoding options and ease of implementation, Faster-Whisper would be the ideal choice. However, if highly accurate word-level timestamps are not a priority, WhisperX can be a great alternative.

Ushnah's profile picture
Ushnah Abbasi

I am a Senior Machine Learning Engineer with over three years of experience in building AI-driven solutions. I have worked with advanced technologies such as deep learning, NLP, and computer vision. I am passionate about tackling complex challenges and continuously pushing the boundaries of what AI can achieve.

Explore More

Have Questions? Let's Talk.

We have got the answers to your questions.

We recommend using your work email.
What is your budget? *