A person speaking into a microphone facing an anime AI companion, connected by a glowing sound wave in a neon-lit lab environment.
The WaifuVoice vision โ€” human speaks, AI listens, AI speaks back. All local. All real

Phase 1 โ€” Environment, Stack Selection & First Breakthrough

๐ŸŽฏ Objectives

Build a fully local, real-time conversational voice loop on Haus. Not a demo. Not a proof of concept. A working companion pipeline that listens, thinks, and speaks โ€” running entirely on local hardware.

Target interaction loop:

The goal was not just "speech in, speech out." The goal was presence โ€” a companion-like conversational entity that feels alive, responds naturally, and runs without cloud dependency.

๐Ÿ“š A Note on History

This project didn't start in January 2026.

It started in May 2025.

EXP-002 documented the first voice pipeline attempt โ€” Whisper for STT, Mistral-chan for LLM, XTTS for TTS. It partially worked. The RTX 5090's bleeding-edge GPU architecture meant that sm_120 support was missing across most of the ML stack. Whisper and XTTS were CPU-bound. Latency was measured in coffee breaks. Mistral-chan monologued. The pipeline technically existed but was not something anyone would describe as real-time.

We shelved it. We moved on to other projects. We waited.

August 2025 โ€” re-tested. sm_120 support still not ready. Shelved again.

January 2026 โ€” sm_120 support finally matured across the stack. Haus was ready. We came back.

This is the story of what happened when we did.

๐Ÿ’ป Environment

ComponentSpec
WorkstationHaus AI Workstation
GPUNVIDIA RTX 5090 Laptop GPU
VRAM24GB
OSUbuntu (Conda environment)
CUDA12.8
Python3.10 (see notes)
LLM RuntimeOllama
LLM ModelQwen 2.5 14B quantized
STTWhisper (local)
TTSXTTS v2 (Coqui)

๐Ÿ’ก๐Ÿค” Design Principles:

  • Python orchestrates, GPU services do the heavy lifting
  • Everything logged and reproducible
  • Modular โ€” each component lives in its own file
  • Local first, GPU-accelerated wherever possible

๐Ÿ’ฌ The Model Decision

Before a single line of code was written, there was a model decision to make.

The original ambition was to run a significantly more powerful model โ€” OSS20, a heavier architecture that would have given WaifuVoice considerably more capability. Motoko-chan and EthanC evaluated it seriously.

However, Haus couldn't load OSS20 at all โ€” insufficient VRAM meant the model never got off the ground. We didn't have a latency problem. We had a 'won't start' problem.

So the decision was made: Qwen 2.5 14B quantized via Ollama.

Not the most powerful option available. Not the most exciting headline. But the model that made a real-time conversational loop actually viable on actual hardware. Sometimes the right engineering decision is the unsexy one.

Qwen 2.5 14B-chan it was.

๐Ÿ’ป Environment Stabilization โ€” The Invisible War โš”๏ธ

Before anything ran, the environment had to work. This took longer than it should have and is documented here because it's instructive.

Python Version

Python 3.13.x was available. Python 3.13.x was not usable. The ML stack โ€” Whisper, XTTS, Transformers โ€” was not stable on the newer interpreter. The decision was made to stay on Python 3.10 under Conda. Stability over modernity.

Transformers / Tokenizers Version Mismatch

A concrete example of why "just install it" is never the whole story:

pip install -U "transformers==4.44.2" "tokenizers==0.19.1" "accelerate"
python -c "from transformers import BeamSearchScorer; print('BeamSearchScorer OK')"

Installing a package is not enough. Version compatibility is the real issue. Even after installation, failures persist if the versions don't align precisely. This cost hours.

Conda vs venv

A practical decision was made to stay with Conda rather than migrate to venv. The voice pipeline had previously worked in a Conda environment family. Stability mattered more than environment ideology.

Motoko-chan's note: when a working environment exists, don't rewrite it. Migrate incrementally.

๐Ÿ› ๏ธ Procedure

The project became modular rather than one giant script. Each component earned its own file:

waifuvoice/
โ”œโ”€โ”€ llm_ollama.py        # LLM inference via Ollama HTTP API
โ”œโ”€โ”€ stt_whisper.py       # Speech-to-text via Whisper
โ”œโ”€โ”€ tts_xtts.py          # Voice generation via XTTS v2
โ”œโ”€โ”€ mic_to_llm_to_tts.py # Master orchestration loop
โ””โ”€โ”€ waifumemory/
    โ””โ”€โ”€ memory.py        # Short-term convo memory (Phase 3)

๐Ÿ› ๏ธ  Step 1 โ€” LLM โ†’ TTS: First Proof of Life

The first milestone wasn't the full loop. It was simpler: get Qwen to generate text, get XTTS to speak it, confirm GPU was being used.

speech processing pipeline flowchart showing Qwen 2.5 14B-chan LLM text input through XTTS v2, to audio playback

It worked. GPU confirmed active. Voice output confirmed clean. XTTS warm-loaded and kept hot after first initialization โ€” a deliberate optimization decision:

  • Longer initial load time
  • Significantly faster subsequent turns
  • Better conversational rhythm

This was the moment the concept proved viable again. After two failed attempts in 2025, WaifuVoice had a heartbeat.

๐Ÿ› ๏ธ Step 2 โ€” Full Loop

The next milestone: close the full loop.

speech processing pipeline flowchart showing microphone input through Whisper STT, Qwen 2.5 14B-chan LLM, XTTS v2, to audio playback.
The WaifuVoice pipeline โ€” local, modular, and now with personality

The first working version was rough. Unstable. Held together with duct tape and optimism. But it completed the loop. Input went in, voice came out, the system didn't crash immediately.

That was enough to build on.

๐Ÿ” Observations

The environment stabilization phase was the least glamorous and most important part of this project. Getting Python, CUDA, Whisper, XTTS, and Ollama all functioning together on the RTX 5090 required specific version pinning and deliberate toolchain choices that aren't obvious from documentation alone.

The model decision โ€” choosing Qwen 2.5 14B quantized over heavier alternatives โ€” was the right call. Real-time voice interaction has hard latency requirements. A more powerful model that can't respond within 2-3 seconds is less useful than a capable model that can.

The XTTS warm-loading optimization had an outsized impact on perceived conversational quality. Cold load time is acceptable. Mid-conversation lag is not.

๐Ÿ“Š  Results

MilestoneStatus
Environment stabilized on Hausโœ…
Qwen 2.5 14B-chan via Ollamaโœ… Operational
XTTS v2 GPU-acceleratedโœ… Operational
LLM โ†’ TTS proof of lifeโœ… Confirmed
Full Mic โ†’ STT โ†’ LLM โ†’ TTS loopโœ… First working version
Real-time conversational quality๐Ÿ”œ PROJ-002-B

The pipeline is alive. Rough, unstable, held together with careful version pinning and deliberate choices โ€” but alive.

Eight months after EXP-002 shelved the first attempt, WaifuVoice has a heartbeat again. Happy Valentine's Day ๐ŸŒน, Haus.

๐Ÿ’กKey Learnings

  • GPU architecture support (sm_120) was the blocking factor for 8 months โ€” timing your re-entry correctly matters
  • Python version stability matters more than recency for ML pipelines
  • Version pinning is not optional โ€” transitive dependency conflicts are silent and destructive
  • Keep XTTS warm โ€” initial load cost is worth the conversational rhythm improvement
  • The right model is the one that works within your actual hardware constraints, not the one with the best benchmark
  • Modular architecture from day one saves significant pain later

Next in the Lab

The pipeline exists. Now: make it not terrible.

๐Ÿ‘‰ PROJ-002-B: WaifuVoice โ€” VAD, Audio Hell & The Chair Scrape