Skip to content

UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1261-sd_refactor_vae_tiling
Open

UPSTREAM PR #1261: refactor: move VAE tiling parameters to SDGenerationParams#63
loci-dev wants to merge 1 commit intomainfrom
loci/pr-1261-sd_refactor_vae_tiling

Conversation

@loci-dev
Copy link

Note

Source pull request: leejet/stable-diffusion.cpp#1261

@loci-dev loci-dev temporarily deployed to stable-diffusion-cpp-prod February 20, 2026 04:17 — with GitHub Actions Inactive
@loci-review
Copy link

loci-review bot commented Feb 20, 2026

Overview

Analysis of stable-diffusion.cpp refactoring commit (2367bc7: "move VAE tiling parameters to SDGenerationParams") across 48,374 functions shows minimal performance impact. Modified: 78 functions; New: 80; Removed: 80; Unchanged: 48,136.

Binaries Analyzed:

  • build.bin.sd-server: +0.614% power consumption (515,491.29 nJ → 518,655.31 nJ)
  • build.bin.sd-cli: +0.706% power consumption (480,109.60 nJ → 483,500.24 nJ)

The refactoring successfully moves VAE tiling parameters from context initialization to per-generation configuration, enabling flexible memory management with acceptable performance trade-offs.

Function Analysis

Configuration Parsing (Initialization Only):

SDContextParams::get_options() improved across both binaries: response time -6.6% (sd-server: 279,572ns → 261,119ns; sd-cli: 280,187ns → 261,795ns), throughput time -7.6% to -9.6% due to removing 4 VAE tiling options. This simplification reduced branching and parsing overhead.

SDGenerationParams::get_options() regressed consistently: response time +5.95-5.96% (sd-server: 306,582ns → 324,830ns; sd-cli: 307,317ns → 325,643ns), throughput time +6.11% due to adding the same 4 options with complex parsing logic. The ~200ns self-time increase reflects additional option registration overhead.

SDGenerationParams::to_string() (sd-cli) regressed +17.4% throughput time (1,714ns → 2,012ns) from serializing 6 additional vae_tiling_params fields—expected for a diagnostic function.

GGML Backend (Model Loading/Inference):

make_block_q4_Kx8 (sd-server) regressed +7.9% (8,126ns → 8,768ns) in both response and throughput time, indicating intrinsic overhead in quantization repacking. Affects model loading, not inference hot path.

forward_mul_mat for block_iq4_nl (sd-server) shows +5.38% response time regression (12,916ns → 13,611ns) while throughput time remains stable (2,390ns), indicating child function slowdown rather than direct implementation changes. This matrix multiplication function is inference-critical, though stable self-time suggests indirect impact.

Standard Library Optimizations:

Multiple functions improved significantly: std::make_move_iterator -58.6% response time (287ns → 119ns), __gnu_cxx::__normal_iterator::operator+ -42.1% (165ns → 95ns), std::swap -11% (112ns → 100ns), std::__unique -5.8% response time. These compiler optimizations partially offset regressions.

Other analyzed functions (JSON access, regex compilation, vector reallocation) showed minor self-time variations with negligible total execution impact.

Additional Findings

The architectural refactoring achieves its goal of enabling per-generation VAE tiling control with minimal cost. Configuration parsing improvements offset regressions, resulting in balanced initialization performance. Most performance changes affect initialization rather than inference hot paths. The forward_mul_mat regression warrants monitoring in production, though stable self-time suggests the function's implementation is unchanged with slowdown in GGML dependencies. Power consumption increases (<1%) are negligible for image generation workloads taking seconds to minutes per image.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants