Refactor tool-speechtotext: extract sttlib shared library and add tests

Extract duplicated code (Whisper loading, audio recording, transcription, VAD processing) into reusable sttlib/ package. Rewrite all 3 scripts as thin wrappers. Add 24 unit tests with mocked hardware. Fix GPU fallback bug in assistant.py and args.system assignment bug.
2026-02-08 00:40:31 +00:00
parent 848681087e
commit 104da381fb
15 changed files with 480 additions and 195 deletions
--- a/python/tool-speechtotext/CLAUDE.md
+++ b/python/tool-speechtotext/CLAUDE.md
@@ -12,11 +12,20 @@ Speech-to-text command line utilities leveraging local models (faster-whisper, O
 ## Tools
 - `assistant.py` / `talk.sh` — transcribe speech, copy to clipboard, optionally send to Ollama
 - `voice_to_terminal.py` / `terminal.sh` — voice-controlled terminal via Ollama tool calling
- `voice_to_xdotool.py` / `dotool.sh` — hands-free voice typing into any focused window (VAD + xdotool)
+- `voice_to_xdotool.py` / `xdotool.sh` — hands-free voice typing into any focused window (VAD + xdotool)
+
+## Shared Library
+- `sttlib/` — shared package used by all scripts and importable by other projects
+  - `whisper_loader.py` — model loading with GPU→CPU fallback
+  - `audio.py` — press-enter recording, PCM conversion
+  - `transcription.py` — Whisper transcribe wrapper, hallucination filter
+  - `vad.py` — VADProcessor, audio callback, constants
+- Other projects import via: `sys.path.insert(0, "/path/to/tool-speechtotext")`

 ## Testing
- To test scripts: `mamba run -n whisper-ollama python <script.py> --model-size base`
+- Run tests: `mamba run -n whisper-ollama python -m pytest tests/`
 - Use `--model-size base` for faster iteration during development
+- Tests mock hardware (Whisper model, VAD, mic) — no GPU/mic needed to run them
 - Audio device is available — live mic testing is possible
 - Test xdotool output by focusing a text editor window

@@ -24,12 +33,13 @@ Speech-to-text command line utilities leveraging local models (faster-whisper, O
 - Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
 - Pip (in conda env): webrtcvad
 - System: libportaudio2, xdotool
+- Dev: pytest

 ## Conventions
 - Shell wrappers go in .sh files using `mamba run -n whisper-ollama`
- All scripts set `CT2_CUDA_ALLOW_FP16=1`
+- Shared code lives in `sttlib/` — scripts are thin entry points that import from it
 - Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
- Keep scripts self-contained (no shared module)
+- `CT2_CUDA_ALLOW_FP16=1` is set by `sttlib.whisper_loader` at import time
 - Don't print output for non-actionable events

 ## Preferences