Extract duplicated code (Whisper loading, audio recording, transcription, VAD processing) into reusable sttlib/ package. Rewrite all 3 scripts as thin wrappers. Add 24 unit tests with mocked hardware. Fix GPU fallback bug in assistant.py and args.system assignment bug.
2.2 KiB
2.2 KiB
Project: speech-to-text tools
Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).
Environment
- Debian Bookworm, kernel 6.1, X11
- Conda env:
whisper-ollama(Python 3.10, CUDA 12.2) - mamba must be initialized before use — run:
eval "$(micromamba shell hook -s bash)" - GPU: NVIDIA (float16 capable)
- xdotool installed for keyboard simulation (X11 only)
Tools
assistant.py/talk.sh— transcribe speech, copy to clipboard, optionally send to Ollamavoice_to_terminal.py/terminal.sh— voice-controlled terminal via Ollama tool callingvoice_to_xdotool.py/xdotool.sh— hands-free voice typing into any focused window (VAD + xdotool)
Shared Library
sttlib/— shared package used by all scripts and importable by other projectswhisper_loader.py— model loading with GPU→CPU fallbackaudio.py— press-enter recording, PCM conversiontranscription.py— Whisper transcribe wrapper, hallucination filtervad.py— VADProcessor, audio callback, constants
- Other projects import via:
sys.path.insert(0, "/path/to/tool-speechtotext")
Testing
- Run tests:
mamba run -n whisper-ollama python -m pytest tests/ - Use
--model-size basefor faster iteration during development - Tests mock hardware (Whisper model, VAD, mic) — no GPU/mic needed to run them
- Audio device is available — live mic testing is possible
- Test xdotool output by focusing a text editor window
Dependencies
- Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
- Pip (in conda env): webrtcvad
- System: libportaudio2, xdotool
- Dev: pytest
Conventions
- Shell wrappers go in .sh files using
mamba run -n whisper-ollama - Shared code lives in
sttlib/— scripts are thin entry points that import from it - Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
CT2_CUDA_ALLOW_FP16=1is set bysttlib.whisper_loaderat import time- Don't print output for non-actionable events
Preferences
- Prefer packages available via apt over building from source
- Check availability before recommending a dependency
- Prefer snappy/responsive defaults over cautious ones
- Avoid over-engineering — keep scripts simple and focused