UPSTREAM PR #1299: Circular tiling by loci-dev · Pull Request #69 · auroralabs-loci/stable-diffusion.cpp

loci-dev · 2026-02-28T04:04:57Z

Note

Source pull request: leejet/stable-diffusion.cpp#1299

Vae tiling was not working nicely with --circular flags. This PR aims at improving that by making the tiling circular instead of the VAE itself.

Example with sd-cli.exe --diffusion-model ..\ComfyUI\models\unet\Flux\klein\flux-2-klein-9b-Q6_K.gguf --vae ..\ComfyUI\models\vae\flux\Flux.2-VAE.safetensors --llm ..\ComfyUI\models\clip\llms\Qwen3-8b.Q4_K_XL.gguf --cfg-scale 1.0 -p '360 HDRI panorama image, capturing a wide, 360-degree view of living room. Intimate atmosphere<lora:Klein_9B - HDRI_360_panoramic:1.3>' --scheduler smoothstep --steps 8 -W 1024 -H 512 --rng cpu --lora-model-dir ..\ComfyUI\models\loras\flux\klein9\ --circularx --vae-tiling --vae-tile-size 16

master	no circular VAE (1cdcdd5)	this PR

(lora: https://civitai.green/models/2413837/hdri-flux2klein-9b)

Example with sd-cli.exe --diffusion-model ..\ComfyUI\models\unet\Flux\klein\flux-2-klein-9b-Q6_K.gguf --vae ..\ComfyUI\models\vae\flux\Flux.2-VAE.safetensors --llm ..\ComfyUI\models\clip\llms\Qwen3-8b.Q4_K_XL.gguf --cfg-scale 1.0-p 'Aerial view of a very busy city' --preview proj --steps 8 -W 512 -H 512 --rng cpu --circular --scheduler smoothstep - -vae-tiling-vae-tile-size 16

master	this PR

loci-review · 2026-02-28T05:00:25Z

Overview

Analysis of stable-diffusion.cpp compared 48,327 functions across two binaries: build.bin.sd-cli and build.bin.sd-server. The codebase shows 88 modified functions, 12 new, 12 removed, and 48,215 unchanged. Power consumption increased marginally: sd-cli by +0.087% (483,492.36 nJ → 483,912.72 nJ) and sd-server by +0.051% (518,368.34 nJ → 518,632.06 nJ).

Function Analysis

Critical Regression:

get_thread_range (sd-cli, GGML CPU backend): Response time increased from 172.47ns to 258.23ns (+85.76ns, +49.72%); throughput time from 132.26ns to 218.02ns (+85.76ns, +64.84%). This low-level threading utility calculates thread distribution for parallel tensor operations and is called frequently in hot paths. The regression appears in the GGML submodule without accessible source changes, but the identical absolute increases in both metrics indicate added logic directly in the function body. Given its frequent invocation during CPU tensor operations, this accumulates to potentially significant impact on diffusion inference performance.

Intentional Quality Tradeoff:

ggml_ext_tensor_split_2d (sd-server): Response time increased from 874.31ns to 935.08ns (+60.77ns, +6.95%); throughput time from 311.16ns to 371.93ns (+60.77ns, +19.53%). Source changes added modulo operations (ix + x) % input_width and (iy + y) % input_height for circular coordinate wrapping in VAE tiling. This eliminates edge artifacts in tiled VAE decoding, trading modest computational overhead for improved output quality—a justified design decision.

Standard Library Variations:
Multiple STL functions show performance changes without source modifications: std::vector::end() regressed (+306.60% throughput, +183.29ns), while std::list::end() improved (-74.32% throughput, -180.81ns) and std::_Rb_tree::find() improved (-37.11% throughput, -62.86ns). These likely stem from compiler optimization differences between builds. Smart pointer operations (std::make_shared, std::__shared_ptr::operator=) show throughput regressions of 102-179% but occur only during initialization/cleanup with negligible cumulative impact.

Additional Findings

The VAE circular tiling feature represents the primary intentional change, improving visual quality at acceptable performance cost. The get_thread_range regression in GGML's threading infrastructure is the sole performance-critical concern warranting investigation, as it affects CPU parallelization across all tensor operations. Container operation improvements in sd-server partially offset regressions, resulting in lower power consumption increase compared to sd-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

stduhpf added 4 commits February 27, 2026 19:13

disable vae circular padding when tiling

1cdcdd5

Enable circular tiling instead of circular conv/padding for tiled vae

e1e7334

Fix circular tensor merge

248ac9b

Use circular VAE if tiling isn't needed

a09ae3f

loci-dev deployed to stable-diffusion-cpp-prod February 28, 2026 04:05 — with GitHub Actions Active

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #1299: Circular tiling#69

UPSTREAM PR #1299: Circular tiling#69
loci-dev wants to merge 4 commits intomainfrom
loci/pr-1299-circular-tiling

loci-dev commented Feb 28, 2026

Uh oh!

loci-review bot commented Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 28, 2026

Uh oh!

loci-review bot commented Feb 28, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants