Alexa, Siri, and Google Assistant are convenient β but every word you say gets shipped to a server you donβt control. What if you could build the same thing, running entirely on your own hardware, with zero cloud dependency?
Thatβs exactly what weβre building today. A local voice assistant that listens through your microphone, transcribes speech with OpenAIβs Whisper, thinks with a local LLM via Ollama, and speaks the answer back to you. No API keys, no subscriptions, no data leaving your machine.
Architecture
The pipeline is straightforward:
π€ Microphone
β
π Whisper (speech-to-text, runs locally)
β
π§ Ollama (LLM generates a response)
β
π pyttsx3 / edge-tts (text-to-speech)
β
π§ Speaker
Each piece is swappable. You could replace Whisper with faster-whisper for speed, swap Ollama models on the fly, or switch TTS engines depending on your platform. The glue between them is a short Python script.
Prerequisites
Before we start, make sure you have:
- Python 3.10+ installed
- Ollama installed and running β if youβre new to it, check out our complete Ollama guide first
- A working microphone
- Around 8 GB of RAM minimum (16 GB recommended for comfortable model loading β see best AI models under 16 GB VRAM)
Pull a model in Ollama before continuing:
ollama pull llama3.2
Step 1: Install Whisper
OpenAIβs Whisper runs locally and handles speech-to-text. Weβll also need sounddevice and scipy to capture audio from the microphone.
pip install openai-whisper sounddevice scipy numpy
Whisper comes in several sizes. For a voice assistant where latency matters, base or small are the sweet spot. The base model is about 140 MB and transcribes in near real-time on most machines:
import whisper
model = whisper.load_model("base")
result = model.transcribe("audio.wav")
print(result["text"])
If you have a decent GPU, you can bump up to small or medium for better accuracy. On CPU-only setups, stick with base β itβs surprisingly good for English.
Tip: For even faster transcription, consider
faster-whisperwhich uses CTranslate2 under the hood and can be 4x faster than the original.
Step 2: Set Up Ollama
Ollama serves as the brain. It takes the transcribed text and generates a response. If you followed the prerequisite, you already have a model pulled.
Test it from the command line:
ollama run llama3.2 "What is the capital of France?"
In Python, we talk to Ollama through its local HTTP API:
import requests
def ask_ollama(prompt, model="llama3.2"):
response = requests.post("http://localhost:11434/api/generate", json={
"model": model,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
Thatβs it. No API keys, no tokens, no rate limits. Just a local HTTP call. For a deeper dive into what models work best for this kind of task, see our guide on the cheapest way to run AI locally.
Step 3: Text-to-Speech
For speaking the response back, pyttsx3 is the simplest option β it works offline and cross-platform with no extra downloads:
pip install pyttsx3
import pyttsx3
engine = pyttsx3.init()
engine.say("Hello, I am your local voice assistant.")
engine.runAndWait()
pyttsx3 uses your OSβs built-in speech engine (SAPI5 on Windows, NSSpeechSynthesizer on macOS, espeak on Linux). The voice quality is functional but robotic.
If you want more natural-sounding speech and donβt mind a slightly heavier dependency, edge-tts is an excellent alternative:
pip install edge-tts playsound
import edge_tts, asyncio
async def speak(text):
communicate = edge_tts.Communicate(text, "en-US-AriaNeural")
await communicate.save("response.mp3")
asyncio.run(speak("Hello from edge TTS"))
Note that edge-tts does make network calls to Microsoftβs edge services, so itβs not fully local. For a 100% offline setup, stick with pyttsx3.
Step 4: Wire It All Together
Hereβs the complete script. It records from your microphone, transcribes with Whisper, sends the text to Ollama, and speaks the response:
import whisper
import sounddevice as sd
import scipy.io.wavfile as wav
import numpy as np
import requests
import pyttsx3
import tempfile
import os
# --- Config ---
WHISPER_MODEL = "base"
OLLAMA_MODEL = "llama3.2"
SAMPLE_RATE = 16000
RECORD_SECONDS = 5
OLLAMA_URL = "http://localhost:11434/api/generate"
# --- Load models ---
print("Loading Whisper model...")
whisper_model = whisper.load_model(WHISPER_MODEL)
tts_engine = pyttsx3.init()
def record_audio(duration=RECORD_SECONDS):
"""Record audio from the microphone."""
print(f"π€ Listening for {duration} seconds...")
audio = sd.rec(int(duration * SAMPLE_RATE), samplerate=SAMPLE_RATE,
channels=1, dtype="float32")
sd.wait()
return audio.flatten()
def transcribe(audio):
"""Transcribe audio using Whisper."""
tmp = tempfile.mktemp(suffix=".wav")
wav.write(tmp, SAMPLE_RATE, (audio * 32767).astype(np.int16))
result = whisper_model.transcribe(tmp)
os.remove(tmp)
return result["text"].strip()
def ask_ollama(prompt):
"""Send prompt to Ollama and return the response."""
response = requests.post(OLLAMA_URL, json={
"model": OLLAMA_MODEL,
"prompt": prompt,
"stream": False
})
return response.json()["response"]
def speak(text):
"""Speak text using pyttsx3."""
print(f"π {text}")
tts_engine.say(text)
tts_engine.runAndWait()
def main():
print("Voice assistant ready. Press Ctrl+C to quit.\n")
while True:
try:
audio = record_audio()
text = transcribe(audio)
if not text or len(text) < 2:
continue
print(f"π You said: {text}")
response = ask_ollama(text)
speak(response)
print()
except KeyboardInterrupt:
print("\nGoodbye!")
break
if __name__ == "__main__":
main()
Save this as voice_assistant.py and run it:
python voice_assistant.py
The assistant will record 5 seconds of audio each loop, transcribe it, get a response from Ollama, and read it back to you. Simple, private, and entirely local.
Improvements
This basic version works, but thereβs plenty of room to make it better:
Wake word detection β Instead of recording in a fixed loop, use a library like pvporcupine or openwakeword to trigger recording only when you say a keyword like βHey assistant.β This makes it feel much more natural.
Streaming responses β Right now we wait for Ollama to finish generating before speaking. You can stream the Ollama response token by token and start TTS as soon as the first sentence is complete. This cuts perceived latency dramatically.
Conversation memory β The current script sends each prompt in isolation. Wrap it with a conversation history that passes previous exchanges to Ollama for context-aware responses.
Voice activity detection β Replace the fixed 5-second recording with webrtcvad or silero-vad to detect when you stop talking and automatically end the recording.
Better TTS β Look into Coqui TTS or Piper for high-quality, fully offline text-to-speech with natural-sounding voices.
For a more polished take on this concept with additional features, check out our private voice assistant with Ollama tutorial.
Wrapping Up
You now have a fully functional voice assistant that never phones home. Whisper handles the ears, Ollama provides the brain, and pyttsx3 gives it a voice β all running on your hardware.
The beauty of this setup is modularity. Swap base for medium when you need better transcription. Switch from llama3.2 to mistral or phi-3 depending on your task. Replace pyttsx3 with a neural TTS engine when you want a more human voice.
No cloud bills. No privacy concerns. Just your machine, doing what you tell it to.
Related Links
- Ollama Complete Guide (2026)
- Build a Private Voice Assistant with Ollama
- Best AI Models Under 16 GB VRAM
- Cheapest Way to Run AI Locally (2026)