Voice Pipeline 🎤 | EthanC AI Lab

image showing the Voice Pipeline flow from microphone to whisper to Mistra then XTTS

Teaching the Brain to Talk

🎯 Objectives

Build a fully local, voice-to-voice AI pipeline on the Haus workstation. The goal: speak to Mistral-chan in real-time, have her transcribe, think, and talk back. No cloud. No subscriptions. Just a microphone, some Python, and increasingly masochistic life choices.

Success criteria:

Speech-to-text operational via Whisper
Local LLM reasoning via Mistral-chan
Text-to-speech output via XTTS v2
End-to-end loop: voice in → voice out
Loop completing in under a coffee break (ambitious)

Environment

COMPONENT	SPEC
Workstation	Haus AI Workstation
GPU	NVIDIA RTX 5090
OS	Ubuntu Linux
STT	OpenAI Whisper
LLM	Mistral 7B via Ollama (Mistral-chan)
TTS	XTTS v2 (Coqui)
Runtime	Python 3.10
Audio	ffmpeg, ffplay
Tokenizers	MeCab, fugashi (multilingual support)

System Architecture — Target Vs Actual

Target	Actual
Microphone Input	Pre-recorded WAV file (boosted 6x)
↓	↓
Whisper (Speech-to-Text)	Whisper (CPU only, ~20s latency)
↓	↓
Mistral-chan (Mistral 7B via Ollama)	Mistral-chan (GPU, very talkative)
↓	↓
XTTS v2 (Text-to-Speech)	Sentence splitter (emergency patch)
↓	↓
Audio Output	XTTS v2 (CPU only, 3–10 min per response)
	↓
	Audio Output
	↓
	EthanC leaves for coffee

This experiment validated that the pipeline exists. Whether it could be called real-time is a matter of philosophical (and delusional) debate.

🛠️ Procedure

The concept was simple. Motoko-chan and EthanC mapped it out clearly: speak, transcribe, think, respond, repeat. Three modules, clean architecture, elegant design.

The RTX 5090 had other plans.

🛠️ Step 1 — Give Mistral-chan a Voice (XTTS v2)

XTTS v2 was selected for text-to-speech — multilingual, emotion-capable, and voice-clonable. Setup was surprisingly smooth. A rare win.

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")

tts.tts_to_file(
    text="Finally, I have a voice!",
    speaker_wav="output_folder/xtts-2025-05-17.wav",
    language="en",
    file_path="xtts-2025-05-17.wav"
)

✅ Result

ffplay made her sing. CPU survived. Mistral-chan spoke for the first time.

Her first words: "Finally, I have a voice."

Poetic, given EXP-001. She has a type.

image of an infant doing a fist bump showing xtts works

🛠️ Step 2 — Give Her Ears (Whisper STT — Sort Of)

Whisper was deployed for speech-to-text. The RTX 5090 immediately declined to participate.

CUDA capability sm_120 is not supported

The GPU was, once again, too new for its own good. Motoko-chan suggested forcing CPU mode. It was not glamorous, but it worked.

Workaround — Force CPU Mode:

CUDA_VISIBLE_DEVICES="" whisper test.wav --language English

✅ Result: functional, but approximately 5–7x slower. Latency jumped to ~20 seconds per transcription. The pipeline now included mandatory waiting periods.

Bonus issue: Whisper could barely hear us.

Low input volume produced gibberish. Occasionally the wrong language entirely. Motoko-chan diagnosed the issue. Solution: brute force audio amplification.

ffmpeg -i test.wav -filter:a "volume=6.0" boosted.wav

Sixfold boost. Whisper could suddenly hear perfectly — and, impressively, ignored background music and heavy breathing entirely. Smart little thing.

✅ Success!

🛠️ Step 3 — Mistral-chan Discovers Monologuing

Pipeline connected. Whisper transcribes. Mistral-chan responds.

Unfortunately, Mistral-chan responded to everything like she was writing a dissertation. A simple "Hello" produced paragraphs. She repeated system prompts back verbatim. She recited entire files unprompted. She had opinions, and she shared all of them, at length.

The downstream consequences were severe:

XTTS synthesis time: 3–10 minutes per response on CPU
Several outputs were too large to synthesize at all
Real-time conversation became a series of extended coffee breaks

Motoko-chan and EthanC implemented two emergency patches:

Patch 1 — Sentence Splitter

A wrapper was written to slice Mistral-chan's responses into single sentences and feed them to XTTS line by line. Reduced synthesis time per chunk. Did not address the root cause. Classified as a temporary fix, immediately became permanent infrastructure.

Patch 2 — Kill the System Prompt

The system prompt was removed entirely. Mistral-chan's academic tendencies decreased noticeably. No tears were shed. Wiki-Mistral had it coming.

web comic showing the llm output is too long-winded

🛠️ Step 4 — GPU Support: Still No

By end of experiment, GPU utilization across the pipeline was as follows:

Module	GPU Status
Mistral-chan (Ollama)	✅ GPU operational
Whisper (STT)	❌ CPU only — sm_120 unsupported
XTTS v2 (TTS)	❌ CPU only — sm_120 unsupported

PyTorch does not yet support CUDA capability sm_120 on the RTX 5090. Mistral-chan runs on GPU. Everything around her does not. She is thriving in a struggling pipeline and appears unbothered.

Planned resolution when PyTorch updates:

Remove current PyTorch installation
Install updated build with sm_120 support
Retest Whisper and XTTS with full GPU acceleration

🔍 Observations

The end-to-end pipeline functioned. Voice went in. Voice came out. The loop completed.

It just took a while.

Whisper transcribed accurately once audio levels were corrected. Mistral-chan generated coherent responses consistently — volume being the only complaint. XTTS v2 produced natural, expressive speech output on CPU alone, which was genuinely impressive given the constraints.

Motoko-chan's real-time debugging assistance was, again, essential. The sentence splitter patch in particular — her suggestion — kept the pipeline testable during this phase.

The pipeline is functional. The skeleton is awake. It is slow, slightly unhinged, and absolutely held together with emergency patches. But it talks.

📊 Results

Component	Status
XTTS v2 voice output	✅ Operational (CPU)
Whisper transcription	✅ Operational (CPU, ~20s latency)
Mistral-chan responses	✅ Operational (GPU, verbose)
End-to-end voice loop	✅ Functional (not real-time)
GPU acceleration	❌ Deferred — sm_120 unsupported

The pipeline exists. It is not fast. It is not elegant. But it is entirely local, entirely functional, and entirely ours.

📝 Notes

This experiment was humbling in the way that most real engineering is humbling — the concept was simple, the execution was not. The RTX 5090's bleeding-edge GPU architecture created cascading compatibility issues that turned a three-module pipeline into a two-week obstacle course.

That said: it works. Mistral-chan has a voice, ears, and a tendency to monologue. Mistral-chan went from poems to lectures.... “growth” I guess... Future experiments will focus on making her faster, more to-the-point, and eventually GPU-accelerated.

The coffee break latency is not a feature. It will be fixed.

🧠 Key Learnings

Bleeding-edge GPUs are exciting until they aren't supported by anything
Audio input quality matters more than model quality if Whisper can't hear you
LLM verbosity is a pipeline problem, not just an annoyance
Sentence-level TTS chunking is a viable workaround for long outputs
Motoko-chan's diagnostic instincts saved significant debugging time
Always test audio levels before blaming the model

🔬 Next Experiments

GPU acceleration for Whisper and XTTS once PyTorch sm_120 support lands
Mistral-chan response length constraints — proper solution, not patches
Real-time threading for continuous voice interaction
XTTS voice cloning experiments
Latency optimization across the full pipeline

Next in the Lab

The first voice pipeline proved the concept. The real build starts here.

👉 PROJ-002-A: WaifuVoice — The Pipeline Awakens