ImageGen Awakens: SDXL vs Z-Image

An image of 3 ai-generated women — Z-Image output — facial consistency across multiple generations

🎯 Objectives

Build a fully local AI image generation system on Haus — zero cloud dependency, full GPU acceleration, production-grade output. Then evaluate which model actually deserves to run on it.

No subscriptions. No censorship. No latency. Just raw creative power inside the Haus.

This is EXP-004 — the experiment where the Haus stopped just thinking, and started to Imagine.

📚 Background — The Vidgen Detour

Before imagegen, there was vidgen. Two weeks of it. The results were horrific — a 7B model producing animations barely better than a kindergarten child's doodles, rendered at 2fps, with the visual complexity of 2 alternating frame swaps. Two weeks of effort, troubleshooting, and model downgrades to get... this. Morale was thoroughly destroyed.

Motoko-chan put things in perspective: vidgen is frontier-tech, level 10 difficulty. Imagegen is level 3 — achievable on Haus. Pivot now.

She was right. Again.

The pivot happened. And within days, everything changed.

💻 Environment

Component	Spec
Workstation	Haus AI Workstation
GPU	NVIDIA RTX 5090 Laptop GPU
VRAM	24GB
OS	Ubuntu Linux
CUDA	12.8
Frontend	SwarmUI
Execution Engine	ComfyUI
Models Tested	Stable Diffusion XL 1.0 Base (SDXL-chan), Z-Image Turbo FP8 Mixed (Z-chan)

⚙️ The Stack:

Flowchart showing the process from SwarmUI to ComfyUI to Pytorch to RTX5090 to Local Image

🛠️ Procedure

🛠️ Step 1 — Environment Setup

With past experience, setup this time was significantly smoother.

Prior experiments across EXP-001 through PROJ-002 had already established the stable foundation: Python 3.10, CUDA 12.8, correct PyTorch builds. No exploration needed. Straight to the working configuration.

SwarmUI with ComfyUI backend handles GPU detection automatically via PyTorch — no manual conda or venv management required. It uses its own internal environment. The GPU was recognized immediately.

Lesson compounding: suffering in EXP-001 paid dividends in EXP-004.

🛠️ Step 2 — SwarmUI Setup

Initial SwarmUI install had minor dependency issues — dotnet missing, some ComfyUI dependencies requiring manual resolution. Nothing dramatic. Fixed quickly with standard package installation.

The system was running within the same session.

🛠️ Step 3 — Model Downloads: Taking Control

SwarmUI's default model download behavior was the first real friction point — models downloading silently in the background with no progress indication. The system appeared frozen. Nothing visible was happening.

This produces a very specific anxiety: is it downloading or is it broken?

The fix: bypass SwarmUI's download entirely and use wget directly.

wget -c <model_url>

The -c flag enables resume on interrupted downloads. Immediate benefits:

Visible download progress and speed
File size confirmation
Safe stop and resume
No more "is it stuck?" anxiety

For larger models, aria2c provides multi-connection acceleration:

aria2c -x 16 -s 16 <model_url>

Personal notes : Download your own models. Don't let the UI do it silently. This is now lab standard.

🛠️ Step 4 — First Render

After setup, after the downloads, after everything — the first image rendered locally on Haus.

No fanfare. No dramatic alert. Just an image appearing on screen.

Heart thumping. Genuinely.

After the vidgen detour, after weeks of GPU driver battles across previous experiments — the GPU spun up, the fans kicked in, and the Haus produced something. Not perfect. But real. Local. Ours.

That was the moment the Image Factory came online.

🛠️ Step 5 — SDXL-chan Baseline Testing

Stable Diffusion XL 1.0 Base deployed first as the baseline.

Parameter	Value
Sampler	Euler / DPM++
Steps	20–30
CFG	5–7
Resolution	1024x1024

Results were not encouraging. Attempting to improve outputs with Stable Diffusion XL 1.0 Refiner produced minimal improvement. The problems were not refinement problems. They were foundational.

🛠️ Step 6 — Z-chan Evaluation

Z-chan (Z-Image Turbo, FP8 Mixed) was spotted during the experiment. The decision to test immediately was instantaneous.

Notably — Z-chan came with her own instructions:

"Set CFG Scale to 1, set steps to 4, 8, or 12."

She knew exactly what she needed. We complied.

Parameter	Value
Sampler	DPM++ 2M (preferred)
Steps	4 / 8 / 12 (model specified)
CFG	1 (model specified)
Resolution	1024 / 1536 / 2048

💡 Lower steps. Lower CFG. The behavior was fundamentally different from SDXL-chan before the first image finished rendering.

🫠💀 SDXL-chan — The Horror Report

SDXL-chan is not broken. She is simply not good enough.

The failure modes were not subtle. They were not occasional edge cases. They were consistent, repeatable, and in several instances — deeply unsettling.

Fingers: Mangled. Displaced thumbnails. Med school car crash references for the human hand. Attempted improvement with the Refiner. No discernible improvement - at least not by EthanC...

AI generated image showing mangled fingers. — Fingers of car crash victim

Faces: This is where SDXL-chan truly distinguished herself, in the worst possible way:

A second lower lip, growing from the first
A metallic protrusion emerging from the side of the nose
etc… (see for yourself)

images of badly AI generated female faces — Alternative beauty

EthanC forced himself not to delete the evidence. It exists in the logs. It will remain in the logs.

Poses and clothing: Rigid poses. Deformed shoes. Clothing cuts ranging from "unusual fashion choices" to "huh ? "

📜 The verdict:

The feeling was not disappointment. It was revulsion.

"I don't feel excited looking at these" would have been generous. These images produced a visceral negative reaction that prompted EthanC to terminate the experiment.

SDXL-chan served her purpose as a baseline. That purpose was to make Z-chan's results look even better by contrast. In this, she succeeded completely.

⭐ Z-chan — First Contact

The first thing checked on Z-chan: fingers.

Clean. Correct. Beautiful. Ten fingers, correctly jointed, anatomically present. The relief was immediate and genuine.

The reaction to the first full output: "Incredible."

Not "good." Not "better than SDXL-chan." Incredible.

Area	SDXL-chan	Z-chan
Fingers	Mangled	Correct
Faces	Horror	Consistent, clean
Poses	Rigid	Natural
Atmosphere	Flat	Present
Scene cohesion	Pasted	Integrated
Steps needed	20-30	4-12

Z-chan achieved superior results at a fraction of the step count. Her own recommended settings — CFG 1, steps 4-12 — produced outputs that SDXL-chan couldn't match at maximum settings.

👀 The Reference Shift

Something worth documenting explicitly.

SDXL-chan tested first — results felt acceptable. Z-chan tested next — everything clicked. Going back to SDXL-chan afterward:

"Wait... why does this look so bad now?"

This is reference shift. The brain recalibrates its standard once it sees what good actually looks like. Going back feels like regression because it is regression.

Motoko-chan's note: always test baseline first, before evaluating improvements. Once you've seen the ceiling, the floor looks further away.

EthanC had leveled up his eyes. There was no going back.

Image of a man with the representation of SDXL and distracted by the better looking Z-image representation

🔬 Unexpected Research Findings

During Z-chan testing, a recurring phenomenon was observed.

During testing, Z-chan occasionally produced outputs with significantly reduced clothing coverage - *ahem* when prompts were ambiguous. This is likely influenced by training data from fashion and editorial photography, where visibility of the human form is emphasized.

Interestingly, these outputs often showed stronger lighting, skin detail, material realism, and certain 'anatomy detail' - *ahem* suggesting the model prioritizes visual clarity and surface interaction when constraints are not explicitly defined.

Explicit wardrobe constraints in prompts effectively mitigate this behavior.

Research motivation during this phase of testing remained consistently high. The reward loop was notably enhanced. Further documentation is available but will not be included here.

🔍 Observations

The imagegen pivot was the correct decision. Within days of setup, the system was producing outputs that the vidgen experiments couldn't approach in two weeks of effort.

Z-chan's release timing was fortunate — a brand new model, evaluated on brand new hardware, producing results that immediately set a new quality bar for the lab.

SDXL-chan's contribution was to establish a baseline so clearly inadequate that the decision to move to Z-chan required no deliberation. Sometimes the most useful thing a tool can do is demonstrate its own limitations thoroughly.

The Image Factory is online. The Haus can Imagine. And what it imagines is considerably better than what we started with.

💡 Key Learnings

Past environment experience compounds — EXP-001's suffering paid dividends in EXP-004. (Python 3.10 + CUDA 12.8 became the stable foundation for our Haus AI stack — once that worked, there was no value in chasing newer Python versions just for the sake of it. )
SwarmUI handles GPU automatically — no manual PyTorch/conda management needed
Download models manually with wget -c or aria2c — never trust silent background downloads
SDXL-chan's finger and face deformities are consistent and not fixable with Refiner
Z-chan's self-specified settings delivered superior results at dramatically lower step counts (CFG=1, steps 4-12) — CFG around 1 felt unusual at first, but proved correct for this model.
Reference shift is real — always establish baseline before testing improvements
With Z-Image, clothing constraints need to be explicit. If left ambiguous, the model tends to optimize toward fashion/editorial aesthetics and clearer body-form rendering.
Vidgen is level 10. Imagegen is level 3. Motoko-chan was right. Again.

📊 Results

Component	Status
Local image generation	✅ Operational
GPU acceleration	✅ RTX 5090 active
SwarmUI + ComfyUI stack	✅ Functional
SDXL-chan baseline	✅ Evaluated — revulsion achieved
Z-chan evaluation	✅ Evaluated — excellent, sparks motivation 🤤
Primary model decision	✅ Z-chan selected
Unexpected findings	✅ Thoroughly documented 😄
Extended environment testing	🔜 EXP-005

📝 Conclusion

Z-chan is the primary model going forward. SDXL-chan remains available for baseline comparisons. She will not be used voluntarily.

EXP-005 documents Z-chan's extended environment testing — studio lighting, Shibuya neon, rain and volumetric scenes, anime vs realistic modes, and camera composition discoveries. That is where Z-chan's full capabilities become clear.

The Image Factory is online. The Haus can now Imagine.

And what it Imagines is ... gorgeous. 💙🧡

image of a woman in Shibuya at night in the rain with dramatic lighting

Next in the Lab

SDXL-chan has been evaluated. Z-chan has been selected. Now: find out what she's really capable of.

👉 EXP-005: Z-Image — Environments, Lighting & The Cinematography Discovery