audio transcription with whisper from R

[This article was first published on bnosac :: open analytical helpers - bnosac :: open analytical helpers, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here)
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Last week, OpenAI released version 2 of an updated neural net called Whisper that approaches human level robustness and accuracy on speech recognition. You can now directly call from R a C/C++ inference engine which allow you to transcribe .wav audio files.

logo audio whisper x100

To allow to easily do this in R, BNOSAC created an R wrapper around the whisper.cpp code. This R package is available at https://github.com/bnosac/audio.whisper and can be installed as follows. 

remotes::install_github("bnosac/audio.whisper")

The following code shows how you can transcribe an example 16-bit wav file with a fragment of a speech by JFK available here

library(audio.whisper)
model <- whisper("tiny")
path  <- system.file(package = "audio.whisper", "samples", "jfk.wav")
trans <- predict(model, newdata = path, language = "en", n_threads = 2)
trans
$n_segments
[1] 1

$data
 segment         from           to                                                                                                       text
       1 00:00:00.000 00:00:11.000  And so my fellow Americans ask not what your country can do for you ask what you can do for your country.

$tokens
 segment      token token_prob
       1        And  0.7476438
       1         so  0.9042299
       1         my  0.6872202
       1     fellow  0.9984470
       1  Americans  0.9589157
       1        ask  0.2573057
       1        not  0.7678108
       1       what  0.6542882
       1       your  0.9386917
       1   counstry  0.9854987
       1        can  0.9813995
       1         do  0.9937403
       1        for  0.9791515
       1        you  0.9925495
       1        ask  0.3058807
       1       what  0.8303462
       1        you  0.9735528
       1        can  0.9711444
       1         do  0.9616748
       1        for  0.9778513
       1       your  0.9604713
       1    country  0.9923630
       1          .  0.4983074

Another example based on a Micro Machines commercial from the 1980's.

I've always wanted to get the transcription of the performances of Francis E. Dec available on UbuWeb Sound - Francis E. Dec like this performance: https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3. This is how you can now do that from R.

library(av)
download.file(url = "https://www.ubu.com/media/sound/dec_francis/Dec-Francis-E_rant1.mp3", 
              destfile = "rant1.mp3", mode = "wb")
av_audio_convert("rant1.mp3", output = "output.wav", format = "wav", sample_rate = 16000)

trans <- predict(model, newdata = "output.wav", language = "en", 
                 duration = 30 * 1000, offset = 7 * 1000, 
                 token_timestamps = TRUE)
trans
$n_segments
[1] 11

$data
segment         from           to                                                             text
      1 00:00:07.000 00:00:09.000                                             Look at the picture.
      2 00:00:09.000 00:00:11.000                                                   See the skull.
      3 00:00:11.000 00:00:13.000                                        The part of bone removed.
      4 00:00:13.000 00:00:16.000                     The master race Frankenstein radio controls.
      5 00:00:16.000 00:00:18.000                           The brain thoughts broadcasting radio.
      6 00:00:18.000 00:00:21.000        The eyesight television. The Frankenstein earphone radio.
      7 00:00:21.000 00:00:25.000  The threshold brain wash radio. The latest new skull reforming.
      8 00:00:25.000 00:00:28.000                            To contain all Frankenstein controls.
      9 00:00:28.000 00:00:31.000                     Even in thin skulls of white pedigree males.
     10 00:00:31.000 00:00:34.000                                   Visible Frankenstein controls.
     11 00:00:34.000 00:00:37.000            The synthetic nerve radio, directional and an alloop.

$tokens
segment         token token_prob   token_from     token_to
      1          Look  0.4281234 00:00:07.290 00:00:07.420
      1            at  0.9485379 00:00:07.420 00:00:07.620
      1           the  0.9758387 00:00:07.620 00:00:07.940
      1       picture  0.9734664 00:00:08.150 00:00:08.580
      1             .  0.9688568 00:00:08.680 00:00:08.910
      2           See  0.9847929 00:00:09.000 00:00:09.420
      2           the  0.7588121 00:00:09.420 00:00:09.840
      2         skull  0.9989663 00:00:09.840 00:00:10.310
      2             .  0.9548351 00:00:10.550 00:00:11.000
      3           The  0.9914295 00:00:11.000 00:00:11.170
      3          part  0.9789217 00:00:11.560 00:00:11.600
      3            of  0.9958754 00:00:11.600 00:00:11.770
      3          bone  0.9759618 00:00:11.770 00:00:12.030
      3       removed  0.9956936 00:00:12.190 00:00:12.710
      3             .  0.9965582 00:00:12.710 00:00:12.940
...

Maybe in the near future we will put it on CRAN, currently it is only at https://github.com/bnosac/audio.whisper.

Get in touch if you are interested in this and let us know what you plan to use it for. 

To leave a comment for the author, please follow the link and comment on their blog: bnosac :: open analytical helpers - bnosac :: open analytical helpers.

R-bloggers.com offers daily e-mail updates about R news and tutorials about learning R and many other topics. Click here if you're looking to post or find an R/data-science job.
Want to share your content on R-bloggers? click here if you have a blog, or here if you don't.

Never miss an update!
Subscribe to R-bloggers to receive
e-mails with the latest R posts.
(You will not see this message again.)

Click here to close (This popup will not appear again)