2 girls in a hacker lab touching palms expressing a collaborative mood

🎯 Objectives

Evaluate whether the newly released Qwen 3.5 9B could replace Qwen 2.5 14B as the conversational core of the WAIfuVoice pipeline.

Not a general benchmark. A specific question: for a real-time AI companion that speaks out loud, which model actually feels better to talk to?

Success criteria:

  • Warm conversational tone suitable for an AI companion persona
  • Short spoken responses appropriate for voice interaction
  • Low, stable latency suitable for real-time dialogue
  • Consistent character persona under repeated prompting

πŸ“š A Brief History of the Voice Core

This experiment didn't happen in a vacuum.

EXP-001 booted the first local LLM on Haus using Mistral-chan. She worked. She also wrote a love poem on first contact and gradually revealed a catastrophic tendency toward extended monologuing. Mistral-chan was retired with respect and a small amount of relief.

The WaifuVoice pipeline subsequently upgraded to Qwen 2.5 14B-chan β€” more capable, more controllable, better suited for companion interaction. But 14B-chan is large. On the RTX 5090, latency was acceptable but not fast. For real-time voice conversation, every second of generation time is a second of awkward silence.

Then Qwen 3.5 9B-chan dropped.

Smaller. Newer. Allegedly warmer. Motoko-chan and EthanC decided to find out if the numbers held up in practice.

πŸ’» Environment

ComponentSpec
WorkstationHaus AI Workstation
GPUNVIDIA RTX 5090 Laptop GPU
VRAM Usage~9–10GB observed
Inference RuntimeOllama (local)
API Endpoint/api/chat
Benchmarking Scriptbench_batch.py (custom Python harness)
LoggingJSON output
Models TestedQwen 3.5 9B, Qwen 2.5 14B

πŸ› οΈ Procedure

πŸ› οΈ Step 1 β€” Build the Benchmarking Harness

A dedicated Python script bench_batch.py was written to run controlled evaluations across both models under identical conditions. Same prompts, same runtime parameters, same everything. No favoritism. May the best Qwen-chan win.

πŸ› οΈ Step 2 β€” Define the Prompt Sets

Two prompt sets were used:

Set A β€” Open Conversation General questions to observe verbosity, creativity, and baseline personality tone.

Set B β€” Voice-Style Companion Prompts Short, natural prompts simulating real companion interaction:

"Hi Qwen-chan, how was your day today?"
"I just had a long day working on AI experiments. Can you say something encouraging please?"
"I'm feeling a little tired today. What do you usually do to relax?"
"I've been thinking about visiting Japan someday. What place would you recommend?"
"Before I sleep tonight, can you say something calming?"

These are the prompts that matter. Anyone can answer a general question. A companion AI needs to handle these β€” low-stakes, emotionally present, conversational β€” and make them feel natural out loud.

πŸ› οΈ Step 3 β€” Inject the Companion Primer

A system message was injected at the start of each session to establish the Qwen-chan persona:

You are Qwen-chan, a warm and thoughtful companion speaking in real-time conversation.
Speak naturally like a person in a relaxed chat.
Keep replies concise and easy to listen to.
Avoid long explanations unless asked.
Do not describe yourself as an AI or language model.
Simply speak as Qwen-chan.

Your tone should be: warm, thoughtful, a little playful, emotionally present.
Most replies should be under 40 words unless the user asks for detail.

πŸ› οΈ Step 4 β€” Apply Runtime Controls

To simulate voice assistant conditions:

num_predict = 40    # hard output limit β€” non-negotiable
temperature = 0.7   # balanced creativity
think = false       # disable internal reasoning loops*

num_predict = 40 was non-negotiable. Soft prompt instructions alone don't reliably constrain output length β€” learned this the hard way in EXP-002. Hard limits or nothing.

*More on think = false in the failures section. It deserves its own story.

🫠 Early Experimental Failures

Before the clean results β€” the mistakes. Documented here because they're instructive and also because suffering should be shared.

🫠 Failure 1 β€” Primer Injection Error

The first version of the script sent the primer as a separate request, then sent prompts independently. Because the API is stateless between calls, the primer evaporated after the first message. Effect: persona drift, frequent "I am an AI language model" responses, and the distinct feeling of talking to a very confused assistant.

πŸ”§ Fix: primer injected as system message within every API call. Persona stabilized immediately.

🫠 Failure 2 β€” Wrong API Endpoint

The script was hitting /api/generate while sending a chat-style payload. This mismatch produced empty responses, broken parsing, and latency measurements that were completely wrong. (See footnote for full /api/generate vs /api/chat explanation.)

πŸ”§ Fix: switched to /api/chat. Everything worked.

🫠 Failure 3 β€” Missing Hard Output Limits

Early runs relied on prompt instructions to keep responses short. Both models ignored these with enthusiasm. Responses ballooned. Latency spiked. The voice pipeline choked.

πŸ”§ Fix: num_predict = 40. Problem solved.

πŸ™ƒπŸ«  Failure 4 β€” Thinking Mode

Qwen models support an internal reasoning mode β€” the model thinks through a problem before generating a response. For complex tasks, this is genuinely useful. For real-time voice interaction, it is catastrophic.

Here's what made this discovery memorable: 9B-chan, unlike 14B-chan, actually showed her work. With think enabled, her entire internal reasoning process printed live to the CLI β€” iterating, reconsidering, refining her response in real time. It was, genuinely, fascinating to watch. A model visibly thinking.

It also made her output time approximately 4x longer than 14B-chan.

So: fascinating, impressive, completely unusable for voice interaction.

πŸ”§  Fix: think = false. 9B-chan's internal monologue silenced. Latency recovered.

Motoko-chan noted that all four failures were architecture issues, not model issues. She was, once again, correct.

πŸ“Š Results

πŸš— Generation Speed

ModelAverageFastestSlowest
Qwen 3.5 9B-chan0.574s0.462s0.634s
Qwen 2.5 14B-chan1.706s0.502s5.687s

9B-chan is not just faster on average β€” she's consistent. 14B-chan's worst case of 5.687 seconds is a conversation killer. A five-second pause mid-dialogue isn't latency. It's an awkward silence that makes you wonder if the system has ghosted you.

For real-time voice: 9B-chan wins this category decisively.

🎭 Conversation Quality β€” The Vibe Analysis

Both models improved significantly after correct primer injection. But their personalities remained distinct.

Qwen 3.5 9B-chan β€” the companion 🌸

Warmer tone. More playful. Emotionally present. Responses felt like talking to someone, not querying a system.

Example response to "I'm feeling a little tired today. What do you usually do to relax?":

"Sometimes just curling up with a cozy blanket and some soft music is the best way to unwind."

Natural. Human-paced. Ready for voice output.

Qwen 2.5 14B-chan β€” the coach 🀝

Structured, helpful, informational. Less companion, more supportive assistant.

Example response to "I just had a long day. Can you say something encouraging?":

"Keep pushing forward; you're making great strides with your experiments."

Perfectly fine. But she sounds like a life coach, not a companion. For WaifuVoice, that distinction matters enormously.

Even punctuation told the story β€” 14B-chan ended every single response with a period β€œ.”. 9B-chan used exclamation marks β€œ!”. Small detail. Big personality difference.

πŸ” Observations

The most important finding wasn't in the benchmark numbers β€” it was in the failure analysis.

Several early results that appeared to show model limitations were actually caused by system integration errors. Wrong endpoint. Missing primer. No hard token limits. Once the architecture was correct, both models performed dramatically better.

The lesson: before blaming the model, check the plumbing.

9B-chan's thinking mode was an unexpected highlight of the experiment β€” not because it was useful, but because it was a rare visible window into how a model actually processes a problem. Worth exploring in a non-latency-critical context sometime.

Motoko-chan flagged the architecture issues early. Getting the benchmarking harness right before trusting the results saved significant time in analysis.

πŸ’‘ Key Findings

  • Smaller models can outperform larger models for specific use cases β€” 9B-chan beats 14B-chan for companion interaction
  • Prompt injection architecture is not optional β€” incorrect plumbing produces misleading results
  • Hard token limits are essential for voice systems β€” soft instructions alone are insufficient
  • Consistent latency matters more than average latency for real-time voice
  • Thinking mode is impressive, visible, and completely wrong for voice pipelines
  • Many perceived model failures are actually integration failures in disguise
  • Smaller, more capable models point toward a near future of localized AI β€” running entirely on mobile devices, laptops, and edge hardware. 9B-chan running at 0.574s on a laptop GPU today is a preview of what pocket-sized companion AI looks like tomorrow.

πŸ₯‰Final Verdict

For the WaifuVoice AI companion pipeline, the optimal model is Qwen 3.5 9B.

CriteriaQwen 3.5 9B-chanQwen 2.5 14B-chan
Average latencyβœ… 0.574s❌ 1.706s
Latency stabilityβœ… Excellent⚠️ Occasional spikes
Companion toneβœ… Warm, playful 🌸⚠️ Helpful, coach-like
Emotional presenceβœ… Strong 🌸⚠️ Moderate
Knowledge depth⚠️ Moderateβœ… Strong
Thinking modeβœ… Fascinatingβž– Not observed

14B-chan isn't retired β€” she has a future as a knowledge assistant. But as the voice and personality of WaifuVoice, 9B-chan is the right choice.

Mistral-chan was the first spark. 14B-chan was the upgrade. 9B-chan is the one that actually feels right to talk to.

meme comparison image contrasting AI with professional tone with AI with cozy tone

Next Steps

The validated configuration will now be integrated into the full WaifuVoice pipeline:

image flow chart showing microphone input to whisper STT to LLM to XTTS to audio playback

Next evaluation phase: end-to-end voice latency and full interaction flow testing.

Next in the Lab

Qwen-chan is validated and ready. The WaifuVoice pipeline is waiting.

πŸ‘‰ PROJ-002-A: WaifuVoice β€” The Pipeline Awakens

.

.

.

Footnote β€” /api/generate vs /api/chat

For anyone hitting similar issues, here's the practical difference:

/api/generate is Ollama's simple completion endpoint. You send a single prompt string, you get a single completion back. It's stateless β€” each call is independent, with no memory of previous messages. Good for one-shot text generation tasks.

/api/chat is the conversational endpoint. You send a messages array containing the full conversation history β€” system message, previous turns, new input β€” and the model responds in context. Designed for multi-turn dialogue and companion interaction.

The failure documented above occurred because the script was sending a messages array (chat-style payload) to /api/generate, which expects a simple prompt string. The mismatch produced empty or broken responses. Switching to /api/chat resolved it immediately.

For any companion AI or multi-turn conversational system: always use /api/chat. /api/generate is not the right tool for the job.

(Documented here because EthanC will definitely need this again someday.)

.

.

.

.

.

Still scrolling? Respect.
Here’s your bonus reward: Z-chan rendered a realistic version too β€” she had opinions.

2 girls touching palms expressing a collaborative mood