# Purpose Speech-to-text command line utilities leveraging local models (faster-whisper, Ollama). ## Tools | Script | Wrapper | Description | |--------|---------|-------------| | `assistant.py` | `talk.sh` | Transcribe speech, copy to clipboard, optionally send to Ollama LLM | | `voice_to_terminal.py` | `terminal.sh` | Voice-controlled terminal — AI suggests and executes bash commands | | `voice_to_dotool.py` | `dotool.sh` | Hands-free voice typing into any focused window via xdotool (VAD-based) | ## Setup ```bash # Create the environment with Python 3.10 and CUDA toolkit mamba create -n whisper-ollama python=3.10 nvidia/label/cuda-12.2.0::cuda-toolkit cudnn -c nvidia -c conda-forge -y # Activate the environment mamba activate whisper-ollama # Install Audio and Logic dependencies # Note: portaudio is required for sounddevice to work on Linux sudo apt-get update && sudo apt-get install libportaudio2 -y pip install faster-whisper sounddevice numpy pyperclip requests webrtcvad ``` ## xdotool setup (required for voice_to_dotool.py) xdotool simulates keyboard input via X11. Already installed on most Linux desktops. ```bash # Install if not already present sudo apt-get install xdotool ``` Note: xdotool is X11-only. For Wayland, swap to ydotool (`sudo apt install ydotool`). ## Usage: voice_to_dotool.py Hands-free speech input — uses VAD to auto-detect when you start/stop speaking, transcribes with Whisper, and types the text into the focused window via xdotool. ```bash # Basic: type transcribed text (you press Enter to submit) ./dotool.sh # Auto-submit: also presses Enter after typing ./dotool.sh --submit # Adjust silence threshold (seconds of silence to end an utterance) ./dotool.sh --silence-threshold 2.0 # Use a smaller/faster Whisper model ./dotool.sh --model-size base # All options ./dotool.sh --submit --silence-threshold 1.5 --model-size medium --vad-aggressiveness 3 ``` ### Workflow 1. Press Enter to start a listening session 2. Speak — VAD detects speech automatically 3. Pause — after the silence threshold, text is transcribed and typed 4. Keep speaking for more utterances, or press Enter to end the session 5. Ctrl+C to quit