WaifuVoice: VAD, Audio Hell & The Chair Scrape

A person speaking into a microphone facing an anime AI companion, both connected by a glowing sound wave, while buildings burns dramatically in the background — WaifuVoice Phase 2. The pipeline is fine. Everything is fine.

Phase 2 — Voice Activity Detection, Audio Pipeline & TTS Debugging

🎯 Objectives

Transform a working but rough voice loop into something that actually feels like a conversation.

PROJ-002-A proved the pipeline was alive. PROJ-002-B is about making it not terrible. That meant solving everything that stood between "technically functional" and "actually usable" — which turned out to be quite a lot.

💻 Environment

Same base environment as PROJ-002-A (Haus RTX 5090 + Ollama + Whisper + XTTS pipeline).
This phase focuses on audio pipeline stabilization and VAD behavior.

🎤 The VAD Evolution

Voice Activity Detection — knowing when someone is speaking and when they've stopped — sounds simple. It is not simple. It took three distinct phases to get right.

🛠️ Phase 1 — Manual / Button-Based Interaction

The first version worked like a command tool:

Press Enter to start recording
Capture fixed duration of audio
Send to Whisper → Qwen → XTTS

Functional. Deeply unnatural. Nobody wants to press Enter to talk to their AI companion. This was scaffolding, not a solution.

🛠️ Phase 2 — VAD-Based Segmentation

The next version detected speech automatically:

Monitor audio stream continuously
Detect energy above threshold → speech started
Detect silence below threshold → speech ended
Send captured segment to pipeline

Better. But the edges were wrong. Speech got cut at the front. Speech got cut at the back. Whisper received damaged audio and produced creative interpretations of what was said.

🛠️ Phase 3 — Full Conversational Loop

The final version:

Listens continuously
Detects speech start with pre-roll buffer
Records through speech with silence inertia
Processes turn
Speaks back
Returns to listening automatically

The moment Phase 3 worked, WaifuVoice stopped feeling like a command tool and started feeling like a presence. That distinction is harder to describe than it sounds — but it was immediately obvious when it happened.

The Problems — A Field Guide to Audio Hell

Ten problems. Documented in full. Because suffering should be shared and also because future EthanC will need this.

🔥 Problem A — Speech Start Getting Cut Off

⚠️ Symptom: First word missing. First syllable gone. Whisper misheard opening phrases consistently. Greetings were reliably butchered.

🕵️ Root cause: VAD only started buffering audio after detecting speech energy. The first syllables happened before the trigger fired.

🛠️ Fix: Pre-roll buffer. Keep a small rolling audio buffer running continuously. When speech is detected, prepend the pre-roll to the captured audio. First syllable restored.

Whisper's first impression of EthanC: someone who greets AI companions with "'Allo?" Like a confused Frenchman who wandered into the wrong pipeline.

🔥 Problem B — Speech End Getting Cut Off Too Early

⚠️ Symptom: Natural pauses mid-sentence triggered end-of-speech detection. Whisper received half a sentence. Responses made no sense.

🛠️ Fix: Silence hangover + tail padding. When silence threshold triggers, don't stop immediately — continue capturing for a defined tail period. Brief pauses in natural speech are preserved. Sentence integrity restored.

There is a specific kind of madness that comes from watching a system cut you off mid-thought. Repeatedly. While you are trying to have a conversation with it. WaifuVoice was, briefly, the rudest AI in the lab.

🔥 Problem C — dBFS Confusion and Threshold Tuning

This one deserves its own section because it was a genuine learning curve, not just a bug fix.

VAD threshold tuning required understanding dBFS — decibels relative to full scale:

0 dBFS = maximum possible loudness
Silence = very negative, approaching -inf
Speech = somewhere in between, depending on microphone, environment, and how loudly EthanC talks at 2am

The threshold must sit cleanly between speech and silence. Too sensitive: background noise triggers the pipeline constantly. Not sensitive enough: actual speech gets ignored.

This is not arbitrary tuning. It is signal interpretation. Getting it wrong in either direction breaks the pipeline in ways that look like model failures but are actually microphone geometry.

Time spent on this phase: significant. Worth it.

🔥 Problem D — ALSA Device Weirdness

Linux audio routing. The gift that keeps giving.

plughw:1,0 vs hw:1,0 — different behavior, inconsistent results, audio capture artifacts that appeared and disappeared depending on which path was used. Repeated chunks. Corrupt segments. Headset-specific quirks that made no sense until they suddenly did.

No elegant solution here. Methodical testing of device paths until the right combination was found. Documented in config. Never touched again.

A note on scope: extensive ALSA tuning was deliberately avoided. The long-term solution is likely running WaifuVoice on a mobile device or a laptop with a properly configured system mic — bypassing Linux audio routing entirely. No point over-engineering a problem we plan to route around.

🔥 Problem E — The Headset Feedback Spike

⚠️ Symptom: Amplified self-voice in headset for approximately 1 second after pipeline started. Weird initial feedback. First-second recording instability.

🕵️ Likely cause: Headset monitoring / sidetone interaction with duplex record+playback audio stack. Linux duplex behavior on this specific hardware was not cooperative.

🛠️ Solution: Warm-up discard, timing guards, VAD tuning around actual hardware behavior. Not fully solved at a hardware level. Worked around sufficiently for practical use.

EthanC's note: some problems are solved. Some problems are managed. This one is managed.

📝 Worth noting: Haus has no built-in speakers or microphone with reasonable function. Aftershock, once again, features in this story. The hardware that arrived was not exactly fully equipped for voice interaction. This is being documented without further comment.

🔥 Problem F — The Chair Scrape

This one has a name because it earned a name.

⚠️ Symptom: A long scraping noise artifact appearing inside XTTS-generated audio. Present between clauses. Present between sentences. Deeply unpleasant. Immediately obvious to any human listener.

🕵️ Investigation: The scrape was confirmed present in the saved WAV file itself — not just in playback. ffplay verified it was baked into the generated audio. This ruled out ALSA playback artifacts entirely.

📜 Conclusion: XTTS generation artifact. Not a playback problem. A TTS hallucination — non-speech audio bursts generated during longer outputs, likely at sentence boundaries.

🟡 Status: Partially mitigated. Full solution points toward chunk-and-stitch generation in a future phase. The chair scrape is documented, understood, and on the roadmap.

🔥 Problem G — XTTS Internal Sentence Splitting

This was the most satisfying debugging session of the entire project.

⚠️ Symptom: Weird pauses at sentence boundaries. Artifacty transitions. Timing felt wrong in ways that were hard to pin down.

🕵️ Investigation path:

Suspected internal sentence splitting behavior in XTTS

Searched the repo — nothing obvious in the wrapper

Traced deeper into Coqui TTS internals

Found it: TTS/utils/synthesizer.py — split_sentences behavior

🛠️ Fix:

split_sentences=False

One parameter. Passed at the wrapper call level. Removed:

Extra pauses between sentences
Artifacty transitions
Sentence-level fragmentation

Motoko-chan's reaction: "It was always one parameter."

It is always one parameter.

🔥 Problem H — Whisper Hallucination & The Repetition Loops

This section contains real examples from the turn logs. They are presented without further comment because they require none.

Observed STT outputs from damaged or short audio input:

"hello hello hello hello hello"
"best with the best with the best with the best"
"favorite condom" (intended: "favorite color")
"wife of voice" (intended: "WaifuVoice")

Whisper, when given poor quality audio, does not fail gracefully. It hallucinates with confidence and commitment.

🛠️ Fix directions:

Better audio capture stability (see Problems A-F)
Post-STT hotword correction layer (planned)
Whisper model upgrade (planned)
Do not trust STT output on obviously broken audio

"wife of voice" has since been adopted as an unofficial alternate project name. Internally.
And “favorite condom” will be our brand name if we get into that industry.

screenshot showing a transcript where whisper STT heard "favorite condom" instead of "favorite color" — Now I am a perv

🔥 Problem I — Pathing, Package Structure & Module Confusion

Less dramatic than audio artifacts. Equally time-consuming.

Multiple instances of:

Module not found
Relative paths breaking depending on execution root
Root-level package vs script-local package conflict
Old scripts and new scripts with similar names doing different things

🛠️ Fix: Memory module moved into a proper top-level package path. Project layout cleaned and standardized. Old scripts archived.

💡 Lesson: Software hygiene is not glamorous. Neglecting it is more expensive than maintaining it.

📊 Results

By the end of this phase:

Issue	Status
Speech start cut-off	✅ Pre-roll buffer implemented
Speech end cut-off	✅ Silence hangover + tail padding
dBFS threshold tuning	✅ Calibrated for Haus hardware
ALSA device routing	✅ Stable configuration found
Headset feedback spike	⚠️ Managed via warm-up discard
Chair scrape artifact	⚠️ Mitigated, chunk-stitch planned
XTTS sentence splitting	✅ Fixed via split_sentences=False
Whisper hallucinations	⚠️ Improved, hotword correction planned
Package structure	✅ Cleaned and standardized
VAD full conversational loop	✅ Operational

Ten problems. Seven resolved. Three managed. Zero ignored.

🏎️ Performance at end of phase:

Turn-around time: 1.0 – 2.5 seconds (felt "very nice")
STT accuracy in good conditions: ~90%
Full 2-sentence outputs: clean, no sentence-break scraping
Conversational rhythm: natural enough to forget it's a pipeline

🔍 Observations

The gap between "technically working" and "actually usable" was enormous. PROJ-002-A proved the pipeline could exist. PROJ-002-B is where the real engineering happened — ten overlapping problems across audio capture, signal processing, TTS generation, and software architecture, solved in sequence and sometimes in parallel.

The most important insight from this phase: most problems that looked like model failures were actually pipeline failures. Whisper wasn't hallucinating because Whisper is bad. It was hallucinating because it was receiving damaged audio. Fix the audio, fix the output.

"wife of voice" remains funnier than intended.

💡Key Learnings

Pre-roll buffer is non-negotiable for natural speech capture
Silence hangover prevents mid-sentence cutoffs — humans pause when they think
dBFS threshold tuning is signal engineering, not guesswork
Always verify artifacts in the saved file before blaming playback
Coqui TTS internals contain behavior that isn't exposed in the public API — read the source
split_sentences=False — tattoo this somewhere visible
Whisper hallucination is an audio quality problem before it is a model problem
"wife of voice" — unintentional. Unforgettable.

Next in the Lab

The audio is stable. Time to give her memory and a personality.

👉 PROJ-002-C: WaifuVoice — Memory, Persona & The Architecture of Presence