
Phase 2 β Voice Activity Detection, Audio Pipeline & TTS Debugging
π― Objectives
Transform a working but rough voice loop into something that actually feels like a conversation.
PROJ-002-A proved the pipeline was alive. PROJ-002-B is about making it not terrible. That meant solving everything that stood between "technically functional" and "actually usable" β which turned out to be quite a lot.
π» Environment
Same base environment as PROJ-002-A (Haus RTX 5090 + Ollama + Whisper + XTTS pipeline).
This phase focuses on audio pipeline stabilization and VAD behavior.
π€ The VAD Evolution
Voice Activity Detection β knowing when someone is speaking and when they've stopped β sounds simple. It is not simple. It took three distinct phases to get right.
π οΈ Phase 1 β Manual / Button-Based Interaction
The first version worked like a command tool:
- Press Enter to start recording
- Capture fixed duration of audio
- Send to Whisper β Qwen β XTTS
Functional. Deeply unnatural. Nobody wants to press Enter to talk to their AI companion. This was scaffolding, not a solution.
π οΈ Phase 2 β VAD-Based Segmentation
The next version detected speech automatically:
- Monitor audio stream continuously
- Detect energy above threshold β speech started
- Detect silence below threshold β speech ended
- Send captured segment to pipeline
Better. But the edges were wrong. Speech got cut at the front. Speech got cut at the back. Whisper received damaged audio and produced creative interpretations of what was said.
More on that shortly. π
π οΈ Phase 3 β Full Conversational Loop
The final version:
- Listens continuously
- Detects speech start with pre-roll buffer
- Records through speech with silence inertia
- Processes turn
- Speaks back
- Returns to listening automatically
The moment Phase 3 worked, WaifuVoice stopped feeling like a command tool and started feeling like a presence. That distinction is harder to describe than it sounds β but it was immediately obvious when it happened.
The Problems β A Field Guide to Audio Hell
Ten problems. Documented in full. Because suffering should be shared and also because future EthanC will need this.
π₯ Problem A β Speech Start Getting Cut Off
β οΈ Symptom: First word missing. First syllable gone. Whisper misheard opening phrases consistently. Greetings were reliably butchered.
π΅οΈ Root cause: VAD only started buffering audio after detecting speech energy. The first syllables happened before the trigger fired.
π οΈ Fix: Pre-roll buffer. Keep a small rolling audio buffer running continuously. When speech is detected, prepend the pre-roll to the captured audio. First syllable restored.
Whisper's first impression of EthanC: someone who greets AI companions with "'Allo?" Like a confused Frenchman who wandered into the wrong pipeline.
π₯ Problem B β Speech End Getting Cut Off Too Early
β οΈ Symptom: Natural pauses mid-sentence triggered end-of-speech detection. Whisper received half a sentence. Responses made no sense.
π οΈ Fix: Silence hangover + tail padding. When silence threshold triggers, don't stop immediately β continue capturing for a defined tail period. Brief pauses in natural speech are preserved. Sentence integrity restored.
There is a specific kind of madness that comes from watching a system cut you off mid-thought. Repeatedly. While you are trying to have a conversation with it. WaifuVoice was, briefly, the rudest AI in the lab.
π₯ Problem C β dBFS Confusion and Threshold Tuning
This one deserves its own section because it was a genuine learning curve, not just a bug fix.
VAD threshold tuning required understanding dBFS β decibels relative to full scale:
- 0 dBFS = maximum possible loudness
- Silence = very negative, approaching -inf
- Speech = somewhere in between, depending on microphone, environment, and how loudly EthanC talks at 2am
The threshold must sit cleanly between speech and silence. Too sensitive: background noise triggers the pipeline constantly. Not sensitive enough: actual speech gets ignored.
This is not arbitrary tuning. It is signal interpretation. Getting it wrong in either direction breaks the pipeline in ways that look like model failures but are actually microphone geometry.
Time spent on this phase: significant. Worth it.
π₯ Problem D β ALSA Device Weirdness
Linux audio routing. The gift that keeps giving.
plughw:1,0 vs hw:1,0 β different behavior, inconsistent results, audio capture artifacts that appeared and disappeared depending on which path was used. Repeated chunks. Corrupt segments. Headset-specific quirks that made no sense until they suddenly did.
No elegant solution here. Methodical testing of device paths until the right combination was found. Documented in config. Never touched again.
A note on scope: extensive ALSA tuning was deliberately avoided. The long-term solution is likely running WaifuVoice on a mobile device or a laptop with a properly configured system mic β bypassing Linux audio routing entirely. No point over-engineering a problem we plan to route around.
π₯ Problem E β The Headset Feedback Spike
β οΈ Symptom: Amplified self-voice in headset for approximately 1 second after pipeline started. Weird initial feedback. First-second recording instability.
π΅οΈ Likely cause: Headset monitoring / sidetone interaction with duplex record+playback audio stack. Linux duplex behavior on this specific hardware was not cooperative.
π οΈ Solution: Warm-up discard, timing guards, VAD tuning around actual hardware behavior. Not fully solved at a hardware level. Worked around sufficiently for practical use.
EthanC's note: some problems are solved. Some problems are managed. This one is managed.
π Worth noting: Haus has no built-in speakers or microphone with reasonable function. Aftershock, once again, features in this story. The hardware that arrived was not exactly fully equipped for voice interaction. This is being documented without further comment.
π₯ Problem F β The Chair Scrape
This one has a name because it earned a name.
β οΈ Symptom: A long scraping noise artifact appearing inside XTTS-generated audio. Present between clauses. Present between sentences. Deeply unpleasant. Immediately obvious to any human listener.
π΅οΈ Investigation: The scrape was confirmed present in the saved WAV file itself β not just in playback. ffplay verified it was baked into the generated audio. This ruled out ALSA playback artifacts entirely.
π Conclusion: XTTS generation artifact. Not a playback problem. A TTS hallucination β non-speech audio bursts generated during longer outputs, likely at sentence boundaries.
π‘ Status: Partially mitigated. Full solution points toward chunk-and-stitch generation in a future phase. The chair scrape is documented, understood, and on the roadmap.
π₯ Problem G β XTTS Internal Sentence Splitting
This was the most satisfying debugging session of the entire project.
β οΈ Symptom: Weird pauses at sentence boundaries. Artifacty transitions. Timing felt wrong in ways that were hard to pin down.
π΅οΈ Investigation path:
Suspected internal sentence splitting behavior in XTTS
Searched the repo β nothing obvious in the wrapper
Traced deeper into Coqui TTS internals
Found it: TTS/utils/synthesizer.py β split_sentences behavior
π οΈ Fix:
split_sentences=False
One parameter. Passed at the wrapper call level. Removed:
- Extra pauses between sentences
- Artifacty transitions
- Sentence-level fragmentation
Motoko-chan's reaction: "It was always one parameter."
It is always one parameter.
π₯ Problem H β Whisper Hallucination & The Repetition Loops
This section contains real examples from the turn logs. They are presented without further comment because they require none.
Observed STT outputs from damaged or short audio input:
- "hello hello hello hello hello"
- "best with the best with the best with the best"
- "favorite condom" (intended: "favorite color")
- "wife of voice" (intended: "WaifuVoice")
Whisper, when given poor quality audio, does not fail gracefully. It hallucinates with confidence and commitment.
π οΈ Fix directions:
- Better audio capture stability (see Problems A-F)
- Post-STT hotword correction layer (planned)
- Whisper model upgrade (planned)
- Do not trust STT output on obviously broken audio
"wife of voice" has since been adopted as an unofficial alternate project name. Internally.
And βfavorite condomβ will be our brand name if we get into that industry.

π₯ Problem I β Pathing, Package Structure & Module Confusion
Less dramatic than audio artifacts. Equally time-consuming.
Multiple instances of:
- Module not found
- Relative paths breaking depending on execution root
- Root-level package vs script-local package conflict
- Old scripts and new scripts with similar names doing different things
π οΈ Fix: Memory module moved into a proper top-level package path. Project layout cleaned and standardized. Old scripts archived.
π‘ Lesson: Software hygiene is not glamorous. Neglecting it is more expensive than maintaining it.
π Results
By the end of this phase:
| Issue | Status |
|---|---|
| Speech start cut-off | β Pre-roll buffer implemented |
| Speech end cut-off | β Silence hangover + tail padding |
| dBFS threshold tuning | β Calibrated for Haus hardware |
| ALSA device routing | β Stable configuration found |
| Headset feedback spike | β οΈ Managed via warm-up discard |
| Chair scrape artifact | β οΈ Mitigated, chunk-stitch planned |
| XTTS sentence splitting | β Fixed via split_sentences=False |
| Whisper hallucinations | β οΈ Improved, hotword correction planned |
| Package structure | β Cleaned and standardized |
| VAD full conversational loop | β Operational |
Ten problems. Seven resolved. Three managed. Zero ignored.
ποΈ Performance at end of phase:
- Turn-around time: 1.0 β 2.5 seconds (felt "very nice")
- STT accuracy in good conditions: ~90%
- Full 2-sentence outputs: clean, no sentence-break scraping
- Conversational rhythm: natural enough to forget it's a pipeline
π Observations
The gap between "technically working" and "actually usable" was enormous. PROJ-002-A proved the pipeline could exist. PROJ-002-B is where the real engineering happened β ten overlapping problems across audio capture, signal processing, TTS generation, and software architecture, solved in sequence and sometimes in parallel.
The most important insight from this phase: most problems that looked like model failures were actually pipeline failures. Whisper wasn't hallucinating because Whisper is bad. It was hallucinating because it was receiving damaged audio. Fix the audio, fix the output.
"wife of voice" remains funnier than intended.
π‘Key Learnings
- Pre-roll buffer is non-negotiable for natural speech capture
- Silence hangover prevents mid-sentence cutoffs β humans pause when they think
- dBFS threshold tuning is signal engineering, not guesswork
- Always verify artifacts in the saved file before blaming playback
- Coqui TTS internals contain behavior that isn't exposed in the public API β read the source
split_sentences=Falseβ tattoo this somewhere visible- Whisper hallucination is an audio quality problem before it is a model problem
- "wife of voice" β unintentional. Unforgettable.
Next in the Lab
The audio is stable. Time to give her memory and a personality.
π PROJ-002-C: WaifuVoice β Memory, Persona & The Architecture of Presence