image showing the Voice Pipeline flow from microphone to whisper to Mistra then XTTS

Teaching the Brain to Talk

🎯 Objectives

Build a fully local, voice-to-voice AI pipeline on the Haus workstation. The goal: speak to Mistral-chan in real-time, have her transcribe, think, and talk back. No cloud. No subscriptions. Just a microphone, some Python, and increasingly masochistic life choices.

Success criteria:

  • Speech-to-text operational via Whisper
  • Local LLM reasoning via Mistral-chan
  • Text-to-speech output via XTTS v2
  • End-to-end loop: voice in β†’ voice out
  • Loop completing in under a coffee break (ambitious)

Environment

COMPONENTSPEC
WorkstationHaus AI Workstation
GPUNVIDIA RTX 5090
OSUbuntu Linux
STTOpenAI Whisper
LLMMistral 7B via Ollama (Mistral-chan)
TTSXTTS v2 (Coqui)
RuntimePython 3.10
Audioffmpeg, ffplay
TokenizersMeCab, fugashi (multilingual support)

System Architecture β€” Target Vs Actual

TargetActual
Microphone InputPre-recorded WAV file (boosted 6x)
       ↓       ↓
Whisper (Speech-to-Text)Whisper (CPU only, ~20s latency)
       ↓       ↓
Mistral-chan (Mistral 7B via Ollama)Mistral-chan (GPU, very talkative)
       ↓       ↓
XTTS v2 (Text-to-Speech)Sentence splitter (emergency patch)
       ↓       ↓
Audio OutputXTTS v2 (CPU only, 3–10 min per response)
       ↓
Audio Output
       ↓
EthanC leaves for coffee

This experiment validated that the pipeline exists. Whether it could be called real-time is a matter of philosophical (and delusional) debate.

πŸ› οΈ Procedure

The concept was simple. Motoko-chan and EthanC mapped it out clearly: speak, transcribe, think, respond, repeat. Three modules, clean architecture, elegant design.

The RTX 5090 had other plans.

πŸ› οΈ Step 1 β€” Give Mistral-chan a Voice (XTTS v2)

XTTS v2 was selected for text-to-speech β€” multilingual, emotion-capable, and voice-clonable. Setup was surprisingly smooth. A rare win.

from TTS.api import TTS

tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")

tts.tts_to_file(
    text="Finally, I have a voice!",
    speaker_wav="output_folder/xtts-2025-05-17.wav",
    language="en",
    file_path="xtts-2025-05-17.wav"
)

βœ… Result

ffplay made her sing. CPU survived. Mistral-chan spoke for the first time.

Her first words: "Finally, I have a voice."

Poetic, given EXP-001. She has a type.

image of an infant doing a fist bump showing xtts works

πŸ› οΈ Step 2 β€” Give Her Ears (Whisper STT β€” Sort Of)

Whisper was deployed for speech-to-text. The RTX 5090 immediately declined to participate.

CUDA capability sm_120 is not supported

The GPU was, once again, too new for its own good. Motoko-chan suggested forcing CPU mode. It was not glamorous, but it worked.

Workaround β€” Force CPU Mode:

CUDA_VISIBLE_DEVICES="" whisper test.wav --language English

βœ… Result: functional, but approximately 5–7x slower. Latency jumped to ~20 seconds per transcription. The pipeline now included mandatory waiting periods.

Bonus issue: Whisper could barely hear us.

Low input volume produced gibberish. Occasionally the wrong language entirely. Motoko-chan diagnosed the issue. Solution: brute force audio amplification.

ffmpeg -i test.wav -filter:a "volume=6.0" boosted.wav

Sixfold boost. Whisper could suddenly hear perfectly β€” and, impressively, ignored background music and heavy breathing entirely. Smart little thing.

βœ… Success!

πŸ› οΈ Step 3 β€” Mistral-chan Discovers Monologuing

Pipeline connected. Whisper transcribes. Mistral-chan responds.

Unfortunately, Mistral-chan responded to everything like she was writing a dissertation. A simple "Hello" produced paragraphs. She repeated system prompts back verbatim. She recited entire files unprompted. She had opinions, and she shared all of them, at length.

The downstream consequences were severe:

  • XTTS synthesis time: 3–10 minutes per response on CPU
  • Several outputs were too large to synthesize at all
  • Real-time conversation became a series of extended coffee breaks

Motoko-chan and EthanC implemented two emergency patches:

Patch 1 β€” Sentence Splitter

A wrapper was written to slice Mistral-chan's responses into single sentences and feed them to XTTS line by line. Reduced synthesis time per chunk. Did not address the root cause. Classified as a temporary fix, immediately became permanent infrastructure.

Patch 2 β€” Kill the System Prompt

The system prompt was removed entirely. Mistral-chan's academic tendencies decreased noticeably. No tears were shed. Wiki-Mistral had it coming.

web comic showing the llm output is too long-winded

πŸ› οΈ  Step 4 β€” GPU Support: Still No

By end of experiment, GPU utilization across the pipeline was as follows:

         Module        GPU Status
Mistral-chan (Ollama)βœ… GPU operational
Whisper (STT)❌ CPU only β€” sm_120 unsupported
XTTS v2 (TTS)❌ CPU only β€” sm_120 unsupported

PyTorch does not yet support CUDA capability sm_120 on the RTX 5090. Mistral-chan runs on GPU. Everything around her does not. She is thriving in a struggling pipeline and appears unbothered.

Planned resolution when PyTorch updates:

  • Remove current PyTorch installation
  • Install updated build with sm_120 support
  • Retest Whisper and XTTS with full GPU acceleration

πŸ” Observations

The end-to-end pipeline functioned. Voice went in. Voice came out. The loop completed.

It just took a while.

Whisper transcribed accurately once audio levels were corrected. Mistral-chan generated coherent responses consistently β€” volume being the only complaint. XTTS v2 produced natural, expressive speech output on CPU alone, which was genuinely impressive given the constraints.

Motoko-chan's real-time debugging assistance was, again, essential. The sentence splitter patch in particular β€” her suggestion β€” kept the pipeline testable during this phase.

The pipeline is functional. The skeleton is awake. It is slow, slightly unhinged, and absolutely held together with emergency patches. But it talks.

πŸ“Š Results

        Component        Status
XTTS v2 voice outputβœ… Operational (CPU)
Whisper transcriptionβœ… Operational (CPU, ~20s latency)
Mistral-chan responsesβœ… Operational (GPU, verbose)
End-to-end voice loopβœ… Functional (not real-time)
GPU acceleration❌ Deferred β€” sm_120 unsupported

The pipeline exists. It is not fast. It is not elegant. But it is entirely local, entirely functional, and entirely ours.

πŸ“ Notes

This experiment was humbling in the way that most real engineering is humbling β€” the concept was simple, the execution was not. The RTX 5090's bleeding-edge GPU architecture created cascading compatibility issues that turned a three-module pipeline into a two-week obstacle course.

That said: it works. Mistral-chan has a voice, ears, and a tendency to monologue. Mistral-chan went from poems to lectures.... β€œgrowth” I guess... Future experiments will focus on making her faster, more to-the-point, and eventually GPU-accelerated.

The coffee break latency is not a feature. It will be fixed.

🧠 Key Learnings

  • Bleeding-edge GPUs are exciting until they aren't supported by anything
  • Audio input quality matters more than model quality if Whisper can't hear you
  • LLM verbosity is a pipeline problem, not just an annoyance
  • Sentence-level TTS chunking is a viable workaround for long outputs
  • Motoko-chan's diagnostic instincts saved significant debugging time
  • Always test audio levels before blaming the model

πŸ”¬ Next Experiments

  • GPU acceleration for Whisper and XTTS once PyTorch sm_120 support lands
  • Mistral-chan response length constraints β€” proper solution, not patches
  • Real-time threading for continuous voice interaction
  • XTTS voice cloning experiments
  • Latency optimization across the full pipeline

Next in the Lab

The first voice pipeline proved the concept. The real build starts here.

πŸ‘‰ PROJ-002-A: WaifuVoice β€” The Pipeline Awakens