Files

local 848681087e Add voice-to-xdotool: hands-free speech typing via VAD + Whisper + xdotool

New tool that uses webrtcvad for voice activity detection, faster-whisper
for transcription, and xdotool to type into any focused window. Supports
session-based listening, configurable silence threshold, and a "full stop"
magic word to auto-submit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-06 23:37:14 +00:00

1.7 KiB

Raw Blame History

Project: speech-to-text tools

Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).

Environment

Debian Bookworm, kernel 6.1, X11
Conda env: whisper-ollama (Python 3.10, CUDA 12.2)
mamba must be initialized before use — run: eval "$(micromamba shell hook -s bash)"
GPU: NVIDIA (float16 capable)
xdotool installed for keyboard simulation (X11 only)

Tools

assistant.py / talk.sh — transcribe speech, copy to clipboard, optionally send to Ollama
voice_to_terminal.py / terminal.sh — voice-controlled terminal via Ollama tool calling
voice_to_xdotool.py / dotool.sh — hands-free voice typing into any focused window (VAD + xdotool)

Testing

To test scripts: mamba run -n whisper-ollama python <script.py> --model-size base
Use --model-size base for faster iteration during development
Audio device is available — live mic testing is possible
Test xdotool output by focusing a text editor window

Dependencies

Conda: faster-whisper, sounddevice, numpy, pyperclip, requests, ollama
Pip (in conda env): webrtcvad
System: libportaudio2, xdotool

Conventions

Shell wrappers go in .sh files using mamba run -n whisper-ollama
All scripts set CT2_CUDA_ALLOW_FP16=1
Whisper model loading always has GPU (cuda/float16) -> CPU (cpu/int8) fallback
Keep scripts self-contained (no shared module)
Don't print output for non-actionable events

Preferences

Prefer packages available via apt over building from source
Check availability before recommending a dependency
Prefer snappy/responsive defaults over cautious ones
Avoid over-engineering — keep scripts simple and focused

1.7 KiB Raw Blame History

Project: speech-to-text tools

Environment

Tools

Testing

Dependencies

Conventions

Preferences

1.7 KiB

Raw Blame History