Files
Code/python/tool-speechtotext/CLAUDE.md
local 104da381fb Refactor tool-speechtotext: extract sttlib shared library and add tests
Extract duplicated code (Whisper loading, audio recording, transcription,
VAD processing) into reusable sttlib/ package. Rewrite all 3 scripts as
thin wrappers. Add 24 unit tests with mocked hardware. Fix GPU fallback
bug in assistant.py and args.system assignment bug.
2026-02-08 00:40:31 +00:00

50 lines
2.2 KiB
Markdown

# Project: speech-to-text tools
Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).
## Environment
- Debian Bookworm, kernel 6.1, X11
- Conda env: `whisper-ollama` (Python 3.10, CUDA 12.2)
- mamba must be initialized before use — run: `eval "$(micromamba shell hook -s bash)"`
- GPU: NVIDIA (float16 capable)
- xdotool installed for keyboard simulation (X11 only)
## Tools
- `assistant.py` / `talk.sh` — transcribe speech, copy to clipboard, optionally send to Ollama
- `voice_to_terminal.py` / `terminal.sh` — voice-controlled terminal via Ollama tool calling
- `voice_to_xdotool.py` / `xdotool.sh` — hands-free voice typing into any focused window (VAD + xdotool)
## Shared Library
- `sttlib/` — shared package used by all scripts and importable by other projects
- `whisper_loader.py` — model loading with GPU→CPU fallback
- `audio.py` — press-enter recording, PCM conversion
- `transcription.py` — Whisper transcribe wrapper, hallucination filter
- `vad.py` — VADProcessor, audio callback, constants
- Other projects import via: `sys.path.insert(0, "/path/to/tool-speechtotext")`
## Testing
- Run tests: `mamba run -n whisper-ollama python -m pytest tests/`
- Use `--model-size base` for faster iteration during development
- Tests mock hardware (Whisper model, VAD, mic) — no GPU/mic needed to run them
- Audio device is available — live mic testing is possible
- Test xdotool output by focusing a text editor window
## Dependencies
- Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
- Pip (in conda env): webrtcvad
- System: libportaudio2, xdotool
- Dev: pytest
## Conventions
- Shell wrappers go in .sh files using `mamba run -n whisper-ollama`
- Shared code lives in `sttlib/` — scripts are thin entry points that import from it
- Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
- `CT2_CUDA_ALLOW_FP16=1` is set by `sttlib.whisper_loader` at import time
- Don't print output for non-actionable events
## Preferences
- Prefer packages available via apt over building from source
- Check availability before recommending a dependency
- Prefer snappy/responsive defaults over cautious ones
- Avoid over-engineering — keep scripts simple and focused