Add voice-to-xdotool: hands-free speech typing via VAD + Whisper + xdotool

New tool that uses webrtcvad for voice activity detection, faster-whisper for transcription, and xdotool to type into any focused window. Supports session-based listening, configurable silence threshold, and a "full stop" magic word to auto-submit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-06 23:37:14 +00:00
parent 370e97d08d
commit 848681087e
4 changed files with 357 additions and 2 deletions
--- a/python/tool-speechtotext/README.md
+++ b/python/tool-speechtotext/README.md
@@ -1,6 +1,14 @@

 # Purpose
-speech to text command line utility by leveraging off ollama a local speech-to-text model
+Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama).
+
+## Tools
+
+| Script | Wrapper | Description |
+|--------|---------|-------------|
+| `assistant.py` | `talk.sh` | Transcribe speech, copy to clipboard, optionally send to Ollama LLM |
+| `voice_to_terminal.py` | `terminal.sh` | Voice-controlled terminal — AI suggests and executes bash commands |
+| `voice_to_dotool.py` | `dotool.sh` | Hands-free voice typing into any focused window via xdotool (VAD-based) |

 ## Setup

@@ -15,5 +23,44 @@ mamba activate whisper-ollama
 # Note: portaudio is required for sounddevice to work on Linux
 sudo apt-get update && sudo apt-get install libportaudio2 -y

-pip install faster-whisper sounddevice numpy pyperclip requests
+pip install faster-whisper sounddevice numpy pyperclip requests webrtcvad
 ```
+
+## xdotool setup (required for voice_to_dotool.py)
+
+xdotool simulates keyboard input via X11. Already installed on most Linux desktops.
+
+```bash
+# Install if not already present
+sudo apt-get install xdotool
+```
+
+Note: xdotool is X11-only. For Wayland, swap to ydotool (`sudo apt install ydotool`).
+
+## Usage: voice_to_dotool.py
+
+Hands-free speech input — uses VAD to auto-detect when you start/stop speaking, transcribes with Whisper, and types the text into the focused window via xdotool.
+
+```bash
+# Basic: type transcribed text (you press Enter to submit)
+./dotool.sh
+
+# Auto-submit: also presses Enter after typing
+./dotool.sh --submit
+
+# Adjust silence threshold (seconds of silence to end an utterance)
+./dotool.sh --silence-threshold 2.0
+
+# Use a smaller/faster Whisper model
+./dotool.sh --model-size base
+
+# All options
+./dotool.sh --submit --silence-threshold 1.5 --model-size medium --vad-aggressiveness 3
+```
+
+### Workflow
+1. Press Enter to start a listening session
+2. Speak — VAD detects speech automatically
+3. Pause — after the silence threshold, text is transcribed and typed
+4. Keep speaking for more utterances, or press Enter to end the session
+5. Ctrl+C to quit