Hey,
I'm Parker.

Transcribe YouTube videos with Whisper.cpp

I can read more quickly than I can listen. I can search a text file but not an audio file. There are many cases when the written word is superior to the spoken word, but sometimes the source data is spoken. Can I extract spoken word data into something written?

Whisper.cpp is an ML model which interprets spoken words and outputs written words and can be run on a Mac laptop. Simon Willison frequently refers to his use of MacWhisper on his blog, so I thought I’d try my hand at a command-line approach to transcribing a YouTube video.

Whisper.cpp is easy to setup but there’s a learning curve. You need:

  1. ggml-base.en.bin placed in models/ subdirectory
  2. ggml-metal.metal which is a C++ file
  3. Your audio in WAV format at 16kHz
$ # Step 1, setup Whisper.cpp
$ brew install whisper-cpp # or install for your machine type
$ mkdir models
$ wget -O models/ggml-base.en.bin https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.en.bin?download=true
$ wget -O ggml-metal.metal https://github.com/ggerganov/llama.cpp/raw/master/ggml-metal.metal

Now that Whisper.cpp is ready to run, you need to get your input material. In my case, I got a webm video which had an Opus audio stream. This required some careful but simple conversion:

$ # Step 2, get your input
$ youtube-dl --extract-audio --audio-format best [yt url]
$ ffmpeg -i file_from_youtube_dl_execution.opus -ar 16000 -vn prepared_input.wav

Now you have your input and Whisper.cpp is setup, you can generate your transcript. You can run it just like whisper-cpp prepared_input.wav and copy the output from stdout, or you can pass dictate how it’s output: txt, csv, srt, vtt, lrc, and json.

$ # Step 3, extract your transcript
$ whisper-cpp --output-txt prepared_input.wav
$ less prepared_input.wav.txt # read your transcript!

You can supply as many --output-<fmt> flags as you would like and it will output all formats you request. Pretty neat!