Whisper on the line

Testing transcription with OpenAI Whisper models

By Nicolas Gambardella

Artificial Intelligence is on the front page of all newspapers those days (often for the wrong reasons…). Of course, AI is not new, the field dating from the middle of last century; neither are cascading neural networks, which I used in the 1990s to predict protein’s secondary structures. However, it really exploded with Deep Learning. Some stories made the news over the years, for instance, DeepMind’s AlphaGo beating a Go’s world champion and AlphaFold solving the long-standing problem of predicting protein’s 3D structures (well, at least the non-flexible conserved parts…). Still, most successes remained confined to their respective domains (e.g., environment recognition for autonomous vehicles, tumour identification and machine translation such as Google Translate and DeepL). However, what brought AI to everyone’s attention was OpenAI’s DALL’E and, more recently, ChatGPT, which can generate images and text following a textual prompt.

OpenAI actually developed many other Deep Learning models, some open source. Among those, Whisper models promise to be a real improvement in speech recognition and transcription. Therefore, I decided to test Whisper on two different tasks: a scientific presentation containing technical terminology, given by a US speaker, and a song containing cultural references, sung by an American singer but with no accent (i.e., BBC-like…).

More details on OpenAI whisper can be found on Github and in the scientific publication.

Setup

Lenovo Thinkpad T470 I, no GPU.
Linux Ubuntu 22.10
Python 3.10.7
Pytorch 2.0
I used the model for speech recognition in the video editing tool
Kdenlive 23.04
(Update 31 May 2023: I was able to run Whisper on Linux Ubuntu 23.04 with Python 3.11.2 and Kdenlive 23.04.1 ; although installing all the bits and pieces was not easy)

Whisper can be used locally or through an API via Python or R (for instance with the package openai), although a subscription might be needed if you already used all your free tokens.

There are five Whisper models.

ModelsParametersRequired VRAMRelative speed
Tiny39 M~1 GB~32x
Base74 M~1 GB~16x
Small244 M~2 GB~6x
Medium769 M~5 GB~2x
Large1550 M~10 GB1x

Note that these models run perfectly well on CPUs. While GPUs are necessary for training Deep Learning models, this is not the case when using them (although running them on CPU will always be slower because of floating point computations).

As an external “control”, I used the VOSK model vosk-model-en-us-0.22 (just called VOSK in the rest of this text).

NB: this is an anecdotal account of my experience, and does not replace comprehensive and systematic tests, for instance, using datasets such as Tedlium. or LibriSpeech test-clean

I used speech recognition to create subtitles, as it represents the final aim in 95% of my professional transcriptions. Many videos are available to explain how to generate subtitles in Kdenlive, including with Whisper.

Transcription of a scientific presentation

The first thing I observed was the split into text fragments. I noticed that their placement along the timeline (the timecodes) was systematically off, starting and finishing too early. Since this did not depend on the model, the problem could come from Kdenlive.

The length of the text fragments varied widely. VOSK produced small fragments, perfect for subtitles, but cut anywhere. Indeed, VOSK does not introduce punctuation and has no notion of sentence. The best fragments were provided by Tiny and Large. They were small, perfect for subtitles, and cut at the right places. On the contrary, Small produced fragments of consistent lengths, longer than Tiny and Large, and thus too long for subtitles, while the fragments produced by Base and Medium were very heterogeneous, only cut after periods.

The models differed straight from the beginning. Tiny produced text that was much worse than VOSK (if we ignore the absence of punctuation by the latter). “professor at pathology” instead of “of pathology”, a missing “and”, a period replaced by a comma, and a quite funny “point of care testing” transformed into “point of fear testing”! All the other models produced perfect texts, Small, Medium, and Large even adhering to my beloved Oxford comma.

VOSK: welcome today i'm john smith i'm a professor of pathology microbiology and immunology and medical director of clinical chemistry and point of care testing
Tiny: Welcome today, I'm John Smith. I'm a professor at Pathology, Microbiology, Immunology, and Medical Director of Clinical Chemistry and Point of Fear Testing.
Base: Welcome today. I'm John Smith. I'm a professor of pathology, microbiology and immunology and medical director of clinical chemistry and point of care testing.
Small: Welcome today. I'm John Smith. I'm a professor of pathology, microbiology, and immunology, and medical director of clinical chemistry and point of care testing.
Medium: Welcome today. I'm John Smith. I'm a professor of pathology, microbiology, and immunology and medical director of clinical chemistry and point of care testing.
Large: Welcome today. I’m John Smith. I'm a professor of pathology, microbiology, and immunology, and medical director of clinical chemistry and point of care testing.

A bit further, the speaker described blood gas analysers. VOSK can only analyse speech using dictionaries. As such, it could not guess that pH, pCO2, and pO2 represent acidity, partial pressure of carbon dioxide and partial pressure of oxygen, respectively and spell them correctly. It also missed the p in pCO2. The result is still understandable for a specialist but would give odd subtitles.. The Whisper models consider the whole context, and can thus infer the correct meaning of the words. Tiny did not understand pH and pCO2, merging them. All the other models produced perfect texts.

VOSK: [...] instruments came out for p h p c o two and p o two
Tiny: [...] instruments came out for PHPCO2 and PO2.
Base: [...] instruments came out for pH, PCO2 and PO2
Small: [...] instruments came out for PH, PCO2, and PO2.
Medium: [...] instruments came out for pH, PCO2, and PO2.
Large: [...] instruments came out for pH, PCO2, and PO2.

VOSK’s dictionaries sometimes gave it an edge. For instance, it knew that co oximetry was a thing, albeit missing a dash. Only Large got it right here, Base, and Medium hearing coaximetry, and Tiny even hearing co-acemetery!

VOSK: [...] we saw co oximetry system
Tiny: [...] we saw co-acemetery systems.
Base: [...] we saw coaximetry systems.
Small: [...] we saw coaxymetry systems.
Medium: [...] we saw coaximetry systems.
Large: [...] we saw co-oxymetry systems.

Another example is the name of chemicals. VOSK recognised benzalkonium, which was recognised only by Medium. Interestingly, the worst performers were Base, which misheard benzylconium, and Large, which misheard benzoalconium (Tiny and Small heard properly but spelt the word wrong, with a ‘c’ instead of a ‘k’).

The way Whisper Large can understand mispronounced words is very impressive. In the following example, Small and Large are the only models correctly recognising that there are two separate sentences. More importantly, VOSK and Tiny could not identify the word “micro-clots”. The hyphenation is a matter of discussion. While the parsimonious rule should apply (do not use hyphens except when not doing so would generate pronunciation errors), using hyphens after micro and macro are commonplace.

VOSK: [...] to correct the problem like micro plots and and i discussed the injection of a clot busting solution
Tiny: […] to correct the proper microplasks. And I discussed the injection of a clock-busting solution.
Base: […] to correct the problem like microclots and I discussed the injection of a clot busting solution.  (wrong start)
Small: […] to correct the problem, like microclots. And I discussed the injection of a clot-busting solution.
Medium: […] to correct the problem, like microclots, and I discussed the injection of a (wrong start, missed end of sentence.)
Large: […] to correct the problem, like micro-clots. And I discussed the injection of a clot-busting solution.

In conclusion, Whisper Large is the best model, not surprisingly, with the surprising exception of the small mistake on “benzalkonium”.

Transcription of a pop culture song

Adding sound on top of speech, such as music, can throw off speech recognition systems. Therefore, I decided to test the models on a song. I chose a very simple song with slow and clearly enunciated lyrics. And frankly, being a nerd, I relished highlighting those fantastic Wizard Rock scene artists whose dozen groups have been celebrating the Harry Potter universe through hundreds of great songs. I used a song from Tonks and the Aurors entitled “what does it means”.

The first massive observation was that I could not compare VOSK to Whisper models. Indeed, the former is absolutely unable to recognise anything. The “transcription” was made of 27 small fragments of complete gibberish, with barely a few correct words. Only one fragment contained an actual piece of the lyrics: “Everybody says that it’s probably nothing”. However, these words were also identified by all Whisper models.

Now for the Whisper models. The overall recognition was remarkable. On the front of fragment size, only Tiny produced small fragments, perfect for subtitles, cut at the right places. Medium was not too bad. Base and Small produced heterogeneous fragments, only cut after periods. Large produced super short segments, often made up of one word. Initially, I thought this was very bad. However, I then realised that it was perfect for subtitling the song; just the right rate in character per second, very readable.

Something odd happened at the start, with some models adding fragments while there was no speech, such as “Add text” (Tiny) or “Music” (Base). Large added the Unicode character “♪” whenever they were music and no speech, which I found nice.

When it came to recognise characters from Harry Potter, the models performed very differently. Tiny and Base did not recognise Mad-Eye Moody. And they also failed to distinguish between Mad-Eye and Molly (Weasley). They also made several other mistakes, e.g. “say” instead of “stay”. Tiny even mistook “vigilant” for “vigilance” and completely ignored the word “here”. It also interpreted “sympathy” in the rather bizarre, “said, but me”!

Tiny: Dear, Maddie, I'm trying to say constantly vigilance
Base: Dear, Madhy, I'm trying to say, Constantly vigilant Here, 
Small: Dear Mad Eye, I'm trying to stay constantly vigilant Here, 
Medium: Dear Mad-Eye, I'm trying to stay constantly vigilant here
Large: Dear mad eye I'm trying to stay constantly vigilant Here
Tiny: Dear, Maddie, thanks a lot for your tea and said, but me
Base: Dear, Madhy, thanks a lot for your tea and sympathy
Small: Oh, dear Molly, thanks a lot for your tea and sympathy (“Oh” should be on a segment of its own)
Medium: Oh, dear Molly, thanks a lot for your tea and sympathy (“Oh” should be on a segment of its own)
Large: Dear Molly Thanks a lot for your tea and sympathy

Another example of context-specific knowledge is the Patronus. Note that such a concept is absolutely out of reach for models using dictionaries. Only machine learning models trained with Harry Potter material can work it out. And this is the case for Medium and Large. Small is not so far, while Tiny and Base try to fit the sounds into a common word.

Tiny: But I can't stop thinking about my patrolness 
Base: but I can't stop thinking about my patrolness
Small: Here, but I can't stop thinking about my Patronas
Medium: But I can't stop thinking about my Patronus
Large: But I Can't stop Thinking about My Patronus

Tiny and Base also made several other errors, mistaking “phase” for “face”, or “starlight” for “start light”.

Interestingly, Tiny hallucinated 40 seconds of “Oh, oh, oh, oh” at the end of the the song, while Base, Small, and Medium hallucinated a “Thank you for watching”.

Conclusion

While Medium provided an overall slightly better text, Large provided the best subtitles for the song, without hallucinations.

Overall, for both task Open AI Whisper Large provided an excellent job, surpassing many human transcriptions I had to deal with, in particular when it comes to technical and non-standard terms.

Leave a Reply