
Phase 1 โ Environment, Stack Selection & First Breakthrough
๐ฏ Objectives
Build a fully local, real-time conversational voice loop on Haus. Not a demo. Not a proof of concept. A working companion pipeline that listens, thinks, and speaks โ running entirely on local hardware.
Target interaction loop:

The goal was not just "speech in, speech out." The goal was presence โ a companion-like conversational entity that feels alive, responds naturally, and runs without cloud dependency.
๐ A Note on History
This project didn't start in January 2026.
It started in May 2025.
EXP-002 documented the first voice pipeline attempt โ Whisper for STT, Mistral-chan for LLM, XTTS for TTS. It partially worked. The RTX 5090's bleeding-edge GPU architecture meant that sm_120 support was missing across most of the ML stack. Whisper and XTTS were CPU-bound. Latency was measured in coffee breaks. Mistral-chan monologued. The pipeline technically existed but was not something anyone would describe as real-time.
We shelved it. We moved on to other projects. We waited.
August 2025 โ re-tested. sm_120 support still not ready. Shelved again.
January 2026 โ sm_120 support finally matured across the stack. Haus was ready. We came back.
This is the story of what happened when we did.
๐ป Environment
| Component | Spec |
|---|---|
| Workstation | Haus AI Workstation |
| GPU | NVIDIA RTX 5090 Laptop GPU |
| VRAM | 24GB |
| OS | Ubuntu (Conda environment) |
| CUDA | 12.8 |
| Python | 3.10 (see notes) |
| LLM Runtime | Ollama |
| LLM Model | Qwen 2.5 14B quantized |
| STT | Whisper (local) |
| TTS | XTTS v2 (Coqui) |
๐ก๐ค Design Principles:
- Python orchestrates, GPU services do the heavy lifting
- Everything logged and reproducible
- Modular โ each component lives in its own file
- Local first, GPU-accelerated wherever possible
๐ฌ The Model Decision
Before a single line of code was written, there was a model decision to make.
The original ambition was to run a significantly more powerful model โ OSS20, a heavier architecture that would have given WaifuVoice considerably more capability. Motoko-chan and EthanC evaluated it seriously.
However, Haus couldn't load OSS20 at all โ insufficient VRAM meant the model never got off the ground. We didn't have a latency problem. We had a 'won't start' problem.
So the decision was made: Qwen 2.5 14B quantized via Ollama.
Not the most powerful option available. Not the most exciting headline. But the model that made a real-time conversational loop actually viable on actual hardware. Sometimes the right engineering decision is the unsexy one.
Qwen 2.5 14B-chan it was.
๐ป Environment Stabilization โ The Invisible War โ๏ธ
Before anything ran, the environment had to work. This took longer than it should have and is documented here because it's instructive.
Python Version
Python 3.13.x was available. Python 3.13.x was not usable. The ML stack โ Whisper, XTTS, Transformers โ was not stable on the newer interpreter. The decision was made to stay on Python 3.10 under Conda. Stability over modernity.
Transformers / Tokenizers Version Mismatch
A concrete example of why "just install it" is never the whole story:
pip install -U "transformers==4.44.2" "tokenizers==0.19.1" "accelerate"
python -c "from transformers import BeamSearchScorer; print('BeamSearchScorer OK')"Installing a package is not enough. Version compatibility is the real issue. Even after installation, failures persist if the versions don't align precisely. This cost hours.
Conda vs venv
A practical decision was made to stay with Conda rather than migrate to venv. The voice pipeline had previously worked in a Conda environment family. Stability mattered more than environment ideology.
Motoko-chan's note: when a working environment exists, don't rewrite it. Migrate incrementally.
๐ ๏ธ Procedure
The project became modular rather than one giant script. Each component earned its own file:
waifuvoice/
โโโ llm_ollama.py # LLM inference via Ollama HTTP API
โโโ stt_whisper.py # Speech-to-text via Whisper
โโโ tts_xtts.py # Voice generation via XTTS v2
โโโ mic_to_llm_to_tts.py # Master orchestration loop
โโโ waifumemory/
โโโ memory.py # Short-term convo memory (Phase 3)๐ ๏ธ Step 1 โ LLM โ TTS: First Proof of Life
The first milestone wasn't the full loop. It was simpler: get Qwen to generate text, get XTTS to speak it, confirm GPU was being used.

It worked. GPU confirmed active. Voice output confirmed clean. XTTS warm-loaded and kept hot after first initialization โ a deliberate optimization decision:
- Longer initial load time
- Significantly faster subsequent turns
- Better conversational rhythm
This was the moment the concept proved viable again. After two failed attempts in 2025, WaifuVoice had a heartbeat.
๐ ๏ธ Step 2 โ Full Loop
The next milestone: close the full loop.

The first working version was rough. Unstable. Held together with duct tape and optimism. But it completed the loop. Input went in, voice came out, the system didn't crash immediately.
That was enough to build on.
๐ Observations
The environment stabilization phase was the least glamorous and most important part of this project. Getting Python, CUDA, Whisper, XTTS, and Ollama all functioning together on the RTX 5090 required specific version pinning and deliberate toolchain choices that aren't obvious from documentation alone.
The model decision โ choosing Qwen 2.5 14B quantized over heavier alternatives โ was the right call. Real-time voice interaction has hard latency requirements. A more powerful model that can't respond within 2-3 seconds is less useful than a capable model that can.
The XTTS warm-loading optimization had an outsized impact on perceived conversational quality. Cold load time is acceptable. Mid-conversation lag is not.
๐ Results
| Milestone | Status |
|---|---|
| Environment stabilized on Haus | โ |
| Qwen 2.5 14B-chan via Ollama | โ Operational |
| XTTS v2 GPU-accelerated | โ Operational |
| LLM โ TTS proof of life | โ Confirmed |
| Full Mic โ STT โ LLM โ TTS loop | โ First working version |
| Real-time conversational quality | ๐ PROJ-002-B |
The pipeline is alive. Rough, unstable, held together with careful version pinning and deliberate choices โ but alive.
Eight months after EXP-002 shelved the first attempt, WaifuVoice has a heartbeat again. Happy Valentine's Day ๐น, Haus.
๐กKey Learnings
- GPU architecture support (
sm_120) was the blocking factor for 8 months โ timing your re-entry correctly matters - Python version stability matters more than recency for ML pipelines
- Version pinning is not optional โ transitive dependency conflicts are silent and destructive
- Keep XTTS warm โ initial load cost is worth the conversational rhythm improvement
- The right model is the one that works within your actual hardware constraints, not the one with the best benchmark
- Modular architecture from day one saves significant pain later
Next in the Lab
The pipeline exists. Now: make it not terrible.
๐ PROJ-002-B: WaifuVoice โ VAD, Audio Hell & The Chair Scrape