Files

local 104da381fb Refactor tool-speechtotext: extract sttlib shared library and add tests

Extract duplicated code (Whisper loading, audio recording, transcription,
VAD processing) into reusable sttlib/ package. Rewrite all 3 scripts as
thin wrappers. Add 24 unit tests with mocked hardware. Fix GPU fallback
bug in assistant.py and args.system assignment bug.

2026-02-08 00:40:31 +00:00

2.2 KiB

Raw Blame History

Project: speech-to-text tools

Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).

Environment

Debian Bookworm, kernel 6.1, X11
Conda env: whisper-ollama (Python 3.10, CUDA 12.2)
mamba must be initialized before use — run: eval "$(micromamba shell hook -s bash)"
GPU: NVIDIA (float16 capable)
xdotool installed for keyboard simulation (X11 only)

Tools

assistant.py / talk.sh — transcribe speech, copy to clipboard, optionally send to Ollama
voice_to_terminal.py / terminal.sh — voice-controlled terminal via Ollama tool calling
voice_to_xdotool.py / xdotool.sh — hands-free voice typing into any focused window (VAD + xdotool)

Shared Library

sttlib/ — shared package used by all scripts and importable by other projects
- whisper_loader.py — model loading with GPU→CPU fallback
- audio.py — press-enter recording, PCM conversion
- transcription.py — Whisper transcribe wrapper, hallucination filter
- vad.py — VADProcessor, audio callback, constants
Other projects import via: sys.path.insert(0, "/path/to/tool-speechtotext")

Testing

Run tests: mamba run -n whisper-ollama python -m pytest tests/
Use --model-size base for faster iteration during development
Tests mock hardware (Whisper model, VAD, mic) — no GPU/mic needed to run them
Audio device is available — live mic testing is possible
Test xdotool output by focusing a text editor window

Dependencies

Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
Pip (in conda env): webrtcvad
System: libportaudio2, xdotool
Dev: pytest

Conventions

Shell wrappers go in .sh files using mamba run -n whisper-ollama
Shared code lives in sttlib/ — scripts are thin entry points that import from it
Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
CT2_CUDA_ALLOW_FP16=1 is set by sttlib.whisper_loader at import time
Don't print output for non-actionable events

Preferences

Prefer packages available via apt over building from source
Check availability before recommending a dependency
Prefer snappy/responsive defaults over cautious ones
Avoid over-engineering — keep scripts simple and focused

2.2 KiB Raw Blame History

Project: speech-to-text tools

Environment

Tools

Shared Library

Testing

Dependencies

Conventions

Preferences

2.2 KiB

Raw Blame History