New tool that uses webrtcvad for voice activity detection, faster-whisper for transcription, and xdotool to type into any focused window. Supports session-based listening, configurable silence threshold, and a "full stop" magic word to auto-submit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Purpose
Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).
Tools
| Script | Wrapper | Description |
|---|---|---|
assistant.py |
talk.sh |
Transcribe speech, copy to clipboard, optionally send to Ollama LLM |
voice_to_terminal.py |
terminal.sh |
Voice-controlled terminal — AI suggests and executes bash commands |
voice_to_dotool.py |
dotool.sh |
Hands-free voice typing into any focused window via xdotool (VAD-based) |
Setup
# Create the environment with Python 3.10 and CUDA toolkit
mamba create -n whisper-ollama python=3.10 nvidia/label/cuda-12.2.0::cuda-toolkit cudnn -c nvidia -c conda-forge -y
# Activate the environment
mamba activate whisper-ollama
# Install Audio and Logic dependencies
# Note: portaudio is required for sounddevice to work on Linux
sudo apt-get update && sudo apt-get install libportaudio2 -y
pip install faster-whisper sounddevice numpy pyperclip requests webrtcvad
xdotool setup (required for voice_to_dotool.py)
xdotool simulates keyboard input via X11. Already installed on most Linux desktops.
# Install if not already present
sudo apt-get install xdotool
Note: xdotool is X11-only. For Wayland, swap to ydotool (sudo apt install ydotool).
Usage: voice_to_dotool.py
Hands-free speech input — uses VAD to auto-detect when you start/stop speaking, transcribes with Whisper, and types the text into the focused window via xdotool.
# Basic: type transcribed text (you press Enter to submit)
./dotool.sh
# Auto-submit: also presses Enter after typing
./dotool.sh --submit
# Adjust silence threshold (seconds of silence to end an utterance)
./dotool.sh --silence-threshold 2.0
# Use a smaller/faster Whisper model
./dotool.sh --model-size base
# All options
./dotool.sh --submit --silence-threshold 1.5 --model-size medium --vad-aggressiveness 3
Workflow
- Press Enter to start a listening session
- Speak — VAD detects speech automatically
- Pause — after the silence threshold, text is transcribed and typed
- Keep speaking for more utterances, or press Enter to end the session
- Ctrl+C to quit