Local Speech Recognition with Whisper

jefftk

I've been a heavy user of dictation, off and on, as my wrists have gotten better and worse. I've mostly used the built-in Mac and Android recognition: Mac isn't great, Android is pretty good, neither has improved much over the past ~5y despite large improvements in what should be possible. OpenAI has an open speech recognition model, whisper, and I wanted to have a go at running it on my Mac.

It looks like for good local performance the best version is whisper.cpp, which is a plain C/C++ implementation with support for Mac's ML hardware. To get this installed I needed to install XCode (not just the command line tools, since I needed coremlc) and then run:

$ sudo xcodebuild -license
$ git clone https://github.com/ggerganov/whisper.cpp
$ cd whisper.cpp

$ python3.11 -m venv whisper_v3.11
$ source whisper_v3.11/bin/activate

$ pip install "numpy<2"
$ pip install ane_transformers
$ pip install openai-whisper
$ pip install coremltools
$ brew install sdl2

$ sh ./models/download-ggml-model.sh large-v3-turbo
$ PATH="$PATH:/Applications/Xcode.app/Contents/Developer/usr/bin" \
    ./models/generate-coreml-model.sh large-v3-turbo

$ cmake -B build -DWHISPER_COREML=1 -DWHISPER_SDL2=ON
$ cmake --build build -j --config Release

Note that both older (3.10) and newer (3.13) Python versions gave compilation errors.

While I don't know if these are the ideal arguments, I've been using:

$ ~/code/whisper.cpp/build/bin/whisper-stream \
    --capture 1 \
    --model ~/code/whisper.cpp/models/ggml-large-v3-turbo.bin \
    -t 8 --flash-attn --keep-context --keep 1000 \
    --file output.txt

By default the output is quite repetitive. For example I dictated:

It looks like for good local performance, the best version to use is whisper.cpp, which is a plain C/C++ implementation with support for Mac's machine learning hardware. To get this installed, I needed to install Xcode (not just the command line tools since I needed coremlc), and then run a whole bunch of commands.

The output was:

It looks like for good local performance the best version to use is whisper.cpp, which is a plain C++
to use is whisper.cpp which is a plain C/C++ implementation with support for maximum
machine learning hardware. To get this insight into the
machine learning hardware. To get this installed, I needed to install Xcode, not just the computer
not just the command line tools since I needed core MLC.
not just the command line tools since I needed core MLC, and then run a whole bunch of commands.

The fix for not enough LLM these days is often more LLM, so I used Claude Sonnet 4 to clean it up with the prompt "Could you clean up this Whisper transcript? The things that look like repetitions aren't, it's just that whisper does a lot of 'restating' as it updates it's view of what was said." This gave me:

It looks like for good local performance, the best version to use is whisper.cpp, which is a plain C/C++ implementation with support for maximum machine learning hardware. To get this installed, I needed to install Xcode, not just the command line tools since I needed Core ML, and then run a whole bunch of commands.

This is very good! The only two things it seems to have gotten 'wrong' are "maximum" for "Mac's" and commas where I was thinking parens. And neither of these are very wrong: "Mac's" comes out verbally as "max" and "maximum" is also plausible in context; the commas read fine, perhaps better than my parens.

I set this up a couple weeks ago, and have generally been finding this quite useful.

Comment via: facebook, mastodon, bluesky, substack

LESSWRONG
LW

9

Local Speech Recognition with Whisper

9

New to LessWrong?

9