Extract duplicated code (Whisper loading, audio recording, transcription, VAD processing) into reusable sttlib/ package. Rewrite all 3 scripts as thin wrappers. Add 24 unit tests with mocked hardware. Fix GPU fallback bug in assistant.py and args.system assignment bug.
50 lines
2.2 KiB
Markdown
50 lines
2.2 KiB
Markdown
# Project: speech-to-text tools
|
|
|
|
Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).
|
|
|
|
## Environment
|
|
- Debian Bookworm, kernel 6.1, X11
|
|
- Conda env: `whisper-ollama` (Python 3.10, CUDA 12.2)
|
|
- mamba must be initialized before use — run: `eval "$(micromamba shell hook -s bash)"`
|
|
- GPU: NVIDIA (float16 capable)
|
|
- xdotool installed for keyboard simulation (X11 only)
|
|
|
|
## Tools
|
|
- `assistant.py` / `talk.sh` — transcribe speech, copy to clipboard, optionally send to Ollama
|
|
- `voice_to_terminal.py` / `terminal.sh` — voice-controlled terminal via Ollama tool calling
|
|
- `voice_to_xdotool.py` / `xdotool.sh` — hands-free voice typing into any focused window (VAD + xdotool)
|
|
|
|
## Shared Library
|
|
- `sttlib/` — shared package used by all scripts and importable by other projects
|
|
- `whisper_loader.py` — model loading with GPU→CPU fallback
|
|
- `audio.py` — press-enter recording, PCM conversion
|
|
- `transcription.py` — Whisper transcribe wrapper, hallucination filter
|
|
- `vad.py` — VADProcessor, audio callback, constants
|
|
- Other projects import via: `sys.path.insert(0, "/path/to/tool-speechtotext")`
|
|
|
|
## Testing
|
|
- Run tests: `mamba run -n whisper-ollama python -m pytest tests/`
|
|
- Use `--model-size base` for faster iteration during development
|
|
- Tests mock hardware (Whisper model, VAD, mic) — no GPU/mic needed to run them
|
|
- Audio device is available — live mic testing is possible
|
|
- Test xdotool output by focusing a text editor window
|
|
|
|
## Dependencies
|
|
- Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
|
|
- Pip (in conda env): webrtcvad
|
|
- System: libportaudio2, xdotool
|
|
- Dev: pytest
|
|
|
|
## Conventions
|
|
- Shell wrappers go in .sh files using `mamba run -n whisper-ollama`
|
|
- Shared code lives in `sttlib/` — scripts are thin entry points that import from it
|
|
- Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
|
|
- `CT2_CUDA_ALLOW_FP16=1` is set by `sttlib.whisper_loader` at import time
|
|
- Don't print output for non-actionable events
|
|
|
|
## Preferences
|
|
- Prefer packages available via apt over building from source
|
|
- Check availability before recommending a dependency
|
|
- Prefer snappy/responsive defaults over cautious ones
|
|
- Avoid over-engineering — keep scripts simple and focused
|