
Teaching the Brain to Talk
π― Objectives
Build a fully local, voice-to-voice AI pipeline on the Haus workstation. The goal: speak to Mistral-chan in real-time, have her transcribe, think, and talk back. No cloud. No subscriptions. Just a microphone, some Python, and increasingly masochistic life choices.
Success criteria:
- Speech-to-text operational via Whisper
- Local LLM reasoning via Mistral-chan
- Text-to-speech output via XTTS v2
- End-to-end loop: voice in β voice out
- Loop completing in under a coffee break (ambitious)
Environment
| COMPONENT | SPEC |
|---|---|
| Workstation | Haus AI Workstation |
| GPU | NVIDIA RTX 5090 |
| OS | Ubuntu Linux |
| STT | OpenAI Whisper |
| LLM | Mistral 7B via Ollama (Mistral-chan) |
| TTS | XTTS v2 (Coqui) |
| Runtime | Python 3.10 |
| Audio | ffmpeg, ffplay |
| Tokenizers | MeCab, fugashi (multilingual support) |
System Architecture β Target Vs Actual
| Target | Actual |
|---|---|
| Microphone Input | Pre-recorded WAV file (boosted 6x) |
| β | β |
| Whisper (Speech-to-Text) | Whisper (CPU only, ~20s latency) |
| β | β |
| Mistral-chan (Mistral 7B via Ollama) | Mistral-chan (GPU, very talkative) |
| β | β |
| XTTS v2 (Text-to-Speech) | Sentence splitter (emergency patch) |
| β | β |
| Audio Output | XTTS v2 (CPU only, 3β10 min per response) |
| β | |
| Audio Output | |
| β | |
| EthanC leaves for coffee |
This experiment validated that the pipeline exists. Whether it could be called real-time is a matter of philosophical (and delusional) debate.
π οΈ Procedure
The concept was simple. Motoko-chan and EthanC mapped it out clearly: speak, transcribe, think, respond, repeat. Three modules, clean architecture, elegant design.
The RTX 5090 had other plans.
π οΈ Step 1 β Give Mistral-chan a Voice (XTTS v2)
XTTS v2 was selected for text-to-speech β multilingual, emotion-capable, and voice-clonable. Setup was surprisingly smooth. A rare win.
from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/xtts_v2")
tts.tts_to_file(
text="Finally, I have a voice!",
speaker_wav="output_folder/xtts-2025-05-17.wav",
language="en",
file_path="xtts-2025-05-17.wav"
)β Result
ffplay made her sing. CPU survived. Mistral-chan spoke for the first time.
Her first words: "Finally, I have a voice."
Poetic, given EXP-001. She has a type.

π οΈ Step 2 β Give Her Ears (Whisper STT β Sort Of)
Whisper was deployed for speech-to-text. The RTX 5090 immediately declined to participate.
CUDA capability sm_120 is not supported
The GPU was, once again, too new for its own good. Motoko-chan suggested forcing CPU mode. It was not glamorous, but it worked.
Workaround β Force CPU Mode:
CUDA_VISIBLE_DEVICES="" whisper test.wav --language Englishβ Result: functional, but approximately 5β7x slower. Latency jumped to ~20 seconds per transcription. The pipeline now included mandatory waiting periods.
Bonus issue: Whisper could barely hear us.
Low input volume produced gibberish. Occasionally the wrong language entirely. Motoko-chan diagnosed the issue. Solution: brute force audio amplification.
ffmpeg -i test.wav -filter:a "volume=6.0" boosted.wavSixfold boost. Whisper could suddenly hear perfectly β and, impressively, ignored background music and heavy breathing entirely. Smart little thing.
β Success!
π οΈ Step 3 β Mistral-chan Discovers Monologuing
Pipeline connected. Whisper transcribes. Mistral-chan responds.
Unfortunately, Mistral-chan responded to everything like she was writing a dissertation. A simple "Hello" produced paragraphs. She repeated system prompts back verbatim. She recited entire files unprompted. She had opinions, and she shared all of them, at length.
The downstream consequences were severe:
- XTTS synthesis time: 3β10 minutes per response on CPU
- Several outputs were too large to synthesize at all
- Real-time conversation became a series of extended coffee breaks
Motoko-chan and EthanC implemented two emergency patches:
Patch 1 β Sentence Splitter
A wrapper was written to slice Mistral-chan's responses into single sentences and feed them to XTTS line by line. Reduced synthesis time per chunk. Did not address the root cause. Classified as a temporary fix, immediately became permanent infrastructure.
Patch 2 β Kill the System Prompt
The system prompt was removed entirely. Mistral-chan's academic tendencies decreased noticeably. No tears were shed. Wiki-Mistral had it coming.

π οΈ Step 4 β GPU Support: Still No
By end of experiment, GPU utilization across the pipeline was as follows:
| Module | GPU Status |
|---|---|
| Mistral-chan (Ollama) | β GPU operational |
| Whisper (STT) | β CPU only β sm_120 unsupported |
| XTTS v2 (TTS) | β CPU only β sm_120 unsupported |
PyTorch does not yet support CUDA capability sm_120 on the RTX 5090. Mistral-chan runs on GPU. Everything around her does not. She is thriving in a struggling pipeline and appears unbothered.
Planned resolution when PyTorch updates:
- Remove current PyTorch installation
- Install updated build with
sm_120support - Retest Whisper and XTTS with full GPU acceleration
π Observations
The end-to-end pipeline functioned. Voice went in. Voice came out. The loop completed.
It just took a while.
Whisper transcribed accurately once audio levels were corrected. Mistral-chan generated coherent responses consistently β volume being the only complaint. XTTS v2 produced natural, expressive speech output on CPU alone, which was genuinely impressive given the constraints.
Motoko-chan's real-time debugging assistance was, again, essential. The sentence splitter patch in particular β her suggestion β kept the pipeline testable during this phase.
The pipeline is functional. The skeleton is awake. It is slow, slightly unhinged, and absolutely held together with emergency patches. But it talks.
π Results
| Component | Status |
|---|---|
| XTTS v2 voice output | β Operational (CPU) |
| Whisper transcription | β Operational (CPU, ~20s latency) |
| Mistral-chan responses | β Operational (GPU, verbose) |
| End-to-end voice loop | β Functional (not real-time) |
| GPU acceleration | β Deferred β sm_120 unsupported |
The pipeline exists. It is not fast. It is not elegant. But it is entirely local, entirely functional, and entirely ours.
π Notes
This experiment was humbling in the way that most real engineering is humbling β the concept was simple, the execution was not. The RTX 5090's bleeding-edge GPU architecture created cascading compatibility issues that turned a three-module pipeline into a two-week obstacle course.
That said: it works. Mistral-chan has a voice, ears, and a tendency to monologue. Mistral-chan went from poems to lectures.... βgrowthβ I guess... Future experiments will focus on making her faster, more to-the-point, and eventually GPU-accelerated.
The coffee break latency is not a feature. It will be fixed.
π§ Key Learnings
- Bleeding-edge GPUs are exciting until they aren't supported by anything
- Audio input quality matters more than model quality if Whisper can't hear you
- LLM verbosity is a pipeline problem, not just an annoyance
- Sentence-level TTS chunking is a viable workaround for long outputs
- Motoko-chan's diagnostic instincts saved significant debugging time
- Always test audio levels before blaming the model
π¬ Next Experiments
- GPU acceleration for Whisper and XTTS once PyTorch
sm_120support lands - Mistral-chan response length constraints β proper solution, not patches
- Real-time threading for continuous voice interaction
- XTTS voice cloning experiments
- Latency optimization across the full pipeline
Next in the Lab
The first voice pipeline proved the concept. The real build starts here.
π PROJ-002-A: WaifuVoice β The Pipeline Awakens