Skip to content

UPSTREAM PR #1299: Circular tiling#69

Open
loci-dev wants to merge 4 commits intomainfrom
loci/pr-1299-circular-tiling
Open

UPSTREAM PR #1299: Circular tiling#69
loci-dev wants to merge 4 commits intomainfrom
loci/pr-1299-circular-tiling

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1299

Vae tiling was not working nicely with --circular flags. This PR aims at improving that by making the tiling circular instead of the VAE itself.

Example with sd-cli.exe --diffusion-model ..\ComfyUI\models\unet\Flux\klein\flux-2-klein-9b-Q6_K.gguf --vae ..\ComfyUI\models\vae\flux\Flux.2-VAE.safetensors --llm ..\ComfyUI\models\clip\llms\Qwen3-8b.Q4_K_XL.gguf --cfg-scale 1.0 -p '360 HDRI panorama image, capturing a wide, 360-degree view of living room. Intimate atmosphere<lora:Klein_9B - HDRI_360_panoramic:1.3>' --scheduler smoothstep --steps 8 -W 1024 -H 512 --rng cpu --lora-model-dir ..\ComfyUI\models\loras\flux\klein9\ --circularx --vae-tiling --vae-tile-size 16

master no circular VAE (1cdcdd5) this PR
output - Master output - Copy output - PR
seam seam seam

(lora: https://civitai.green/models/2413837/hdri-flux2klein-9b)

Example with sd-cli.exe --diffusion-model ..\ComfyUI\models\unet\Flux\klein\flux-2-klein-9b-Q6_K.gguf --vae ..\ComfyUI\models\vae\flux\Flux.2-VAE.safetensors --llm ..\ComfyUI\models\clip\llms\Qwen3-8b.Q4_K_XL.gguf --cfg-scale 1.0-p 'Aerial view of a very busy city' --preview proj --steps 8 -W 512 -H 512 --rng cpu --circular --scheduler smoothstep - -vae-tiling-vae-tile-size 16

master this PR
output output - PR
tiled tiled

@loci-dev loci-dev deployed to stable-diffusion-cpp-prod February 28, 2026 04:05 — with GitHub Actions Active
@loci-review
Copy link

loci-review bot commented Feb 28, 2026

Overview

Analysis of stable-diffusion.cpp compared 48,327 functions across two binaries: build.bin.sd-cli and build.bin.sd-server. The codebase shows 88 modified functions, 12 new, 12 removed, and 48,215 unchanged. Power consumption increased marginally: sd-cli by +0.087% (483,492.36 nJ → 483,912.72 nJ) and sd-server by +0.051% (518,368.34 nJ → 518,632.06 nJ).

Function Analysis

Critical Regression:

  • get_thread_range (sd-cli, GGML CPU backend): Response time increased from 172.47ns to 258.23ns (+85.76ns, +49.72%); throughput time from 132.26ns to 218.02ns (+85.76ns, +64.84%). This low-level threading utility calculates thread distribution for parallel tensor operations and is called frequently in hot paths. The regression appears in the GGML submodule without accessible source changes, but the identical absolute increases in both metrics indicate added logic directly in the function body. Given its frequent invocation during CPU tensor operations, this accumulates to potentially significant impact on diffusion inference performance.

Intentional Quality Tradeoff:

  • ggml_ext_tensor_split_2d (sd-server): Response time increased from 874.31ns to 935.08ns (+60.77ns, +6.95%); throughput time from 311.16ns to 371.93ns (+60.77ns, +19.53%). Source changes added modulo operations (ix + x) % input_width and (iy + y) % input_height for circular coordinate wrapping in VAE tiling. This eliminates edge artifacts in tiled VAE decoding, trading modest computational overhead for improved output quality—a justified design decision.

Standard Library Variations:
Multiple STL functions show performance changes without source modifications: std::vector::end() regressed (+306.60% throughput, +183.29ns), while std::list::end() improved (-74.32% throughput, -180.81ns) and std::_Rb_tree::find() improved (-37.11% throughput, -62.86ns). These likely stem from compiler optimization differences between builds. Smart pointer operations (std::make_shared, std::__shared_ptr::operator=) show throughput regressions of 102-179% but occur only during initialization/cleanup with negligible cumulative impact.

Additional Findings

The VAE circular tiling feature represents the primary intentional change, improving visual quality at acceptable performance cost. The get_thread_range regression in GGML's threading infrastructure is the sole performance-critical concern warranting investigation, as it affects CPU parallelization across all tensor operations. Container operation improvements in sd-server partially offset regressions, resulting in lower power consumption increase compared to sd-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants