fix(desktop): add Gemini 2.5 thinking budget controls to reduce API costs#7159
fix(desktop): add Gemini 2.5 thinking budget controls to reduce API costs#7159
Conversation
Greptile SummaryThis PR adds
Confidence Score: 3/5Safe to merge for immediate cost reduction, but the proxy defense-in-depth has a logic gap that should be fixed before relying on it as a safety net. One P1 logic bug — proxy doesn't inject thinking budget when generation_config is absent — means the safety net is incomplete. All current Swift callers are protected since they now always set generationConfig, but the gap undermines the stated contract and creates risk for future callers. desktop/Backend-Rust/src/routes/proxy.rs — the thinking budget injection block needs a fallback for requests that omit generation_config entirely. Important Files Changed
Sequence DiagramsequenceDiagram
participant SW as Swift Client
participant PX as Rust Proxy
participant GM as Gemini API
Note over SW: Extraction call (Focus/Task/Memory)
SW->>PX: POST generateContent budget=0
PX->>PX: thinking_config present, skip injection
PX->>GM: forward with budget=0
GM-->>SW: response (no thinking tokens)
Note over SW: Chat / streaming call
SW->>PX: POST generateContent budget=4096
PX->>PX: thinking_config present, skip injection
PX->>GM: forward with budget=4096
GM-->>SW: response (moderate thinking)
Note over PX: Defense-in-depth path
SW->>PX: POST generateContent, generation_config present, NO thinking_config
PX->>PX: thinking_config absent, inject budget=1024
PX->>GM: forward with injected budget=1024
GM-->>SW: response (capped thinking)
Note over PX,GM: Gap: if generation_config absent entirely, no injection occurs
Reviews (1): Last reviewed commit: "Add changelog entry for thinking budget ..." | Re-trigger Greptile |
| // Defense-in-depth: inject default thinking budget if client omits it. | ||
| // Gemini 2.5 Flash defaults to unlimited thinking which is 5.8x more | ||
| // expensive than regular output tokens. Cap at 1024 when absent. | ||
| let has_thinking = gc.contains_key("thinking_config") | ||
| || gc.contains_key("thinkingConfig"); | ||
| if !has_thinking { | ||
| gc.insert( | ||
| "thinkingConfig".to_string(), | ||
| serde_json::json!({"thinkingBudget": DEFAULT_THINKING_BUDGET}), | ||
| ); | ||
| } |
There was a problem hiding this comment.
Defense-in-depth bypass when
generation_config is absent
The injection only fires when the request already contains a generation_config/generationConfig object. A request that omits the key entirely (valid Gemini API behavior — model uses defaults) skips this block, leaving thinking unlimited. The PR comment says "inject default budget=1024 when client omits thinkingConfig" but the actual contract is narrower: the budget is injected only when a generation_config exists without a thinkingConfig. Any future client call that forgets to set generationConfig bypasses the proxy's cost cap entirely, defeating the stated defense-in-depth goal.
The fix is to add a fallback after the loop: if neither generation_config nor generationConfig exists in the object, insert a new generation_config containing only the default thinkingConfig.
| enum CodingKeys: String, CodingKey { | ||
| case thinkingBudget = "thinking_budget" | ||
| } |
There was a problem hiding this comment.
thinking_budget key name inconsistency
Swift's ThinkingConfig maps thinkingBudget → "thinking_budget" (snake_case), while the Rust proxy injects "thinkingBudget" (camelCase). Both are accepted by Gemini's protobuf JSON layer today, but they're inconsistent with each other and could silently break if the API tightens JSON strictness.
| enum CodingKeys: String, CodingKey { | |
| case thinkingBudget = "thinking_budget" | |
| } | |
| enum CodingKeys: String, CodingKey { | |
| case thinkingBudget = "thinkingBudget" | |
| } |
PR #7159 Testing Friction Points (for @sora / workflow improvement)1. Partial knowledge of
|
…ction Gemini 2.5 Flash thinking output costs $3.50/M tokens vs $0.60/M regular (5.8x). Without explicit thinkingConfig, the model defaults to unlimited thinking on every call — representing 65% of daily Gemini spend. - Add ThinkingConfig struct with thinkingBudget field - Add thinkingConfig to all three GenerationConfig structs - Add thinkingBudget parameter to all 6 public GeminiClient methods - Proactive extraction (Focus, Task, Insight, Memory): budget=0 (no thinking) - User-facing chat (streaming + tool-calling): budget=4096 (moderate thinking) - Make responseMimeType optional in GeminiRequest.GenerationConfig Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Inject default thinkingConfig (budget=1024) in sanitize_gemini_body when client omits it. Catches old app versions and any code path that bypasses the Swift-side ThinkingConfig. Respects both snake_case and camelCase existing configs. 4 new tests for injection, preservation, and embed skip. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…to all paths 5 unused methods removed (sendChatStreamRequest, sendToolChatRequest, continueWithToolResults, sendImageToolRequest, continueImageToolRequest) plus associated structs (GeminiChatRequest, GeminiStreamChunk, GeminiToolChatRequest). 685 lines of dead code eliminated. Added generationConfig with thinkingBudget=0 to GeminiImageToolRequest so task extraction and insight tool loop paths explicitly disable thinking tokens instead of relying on proxy default. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Proxy default stays at 1024 to cap old clients that don't send thinkingConfig. Current Swift client explicitly sends budget=0 on all production paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ws 0) Gemini 2.5 Pro requires minimum thinkingBudget=128 while Flash supports 0. Added ThinkingConfig.minimumBudget(for:) that returns 128 for Pro models and 0 for Flash. All methods now clamp budget to model minimum. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Old clients may send requests with no generation_config at all. Previously the proxy only injected thinkingConfig into an existing generation_config object. Now it creates generationConfig with the default thinking budget when the key is missing entirely. Added regression test for contents-only request body. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests for: dual generation_config casings, null generation_config, string generation_config. All malformed cases get a fresh generationConfig with default thinking budget. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
d2c947f to
fd46118
Compare
Test Results & EvidenceRust Proxy Tests — 202/202 passed (74 proxy-specific)Thinking budget tests (8/8 passed):
Swift Build — cleanDesktop App Launch — successfulApp builds and launches without crashes. All frameworks load correctly (Sparkle + libwebp dynamic, Sentry + onnxruntime statically linked). Code Path Verification
by AI for @beastoin |
…eatures Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No issues found. I verified the Swift client sends the default non-tool budget through by AI for @beastoin |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Re-review result: no issues found. Verified locally:
GitHub would not accept a formal approval review from this account because it owns the PR. by AI for @beastoin |
CP9A — Level 1 Live Test (Build + Run Changed Components Standalone)Changed-path coverage checklist
EvidenceRust backend: 202/202 tests passed (74 proxy tests, 8 thinking budget specific) Swift: 13/13 ThinkingBudgetTests passed Desktop app: builds clean, launches to sign-in screen L1 SynthesisAll 9 changed paths (P1-P9) verified at L1. Rust proxy thinking budget injection proven by 8 targeted unit tests. Swift ThinkingConfig model-aware budget logic proven by 13 unit tests covering Flash/Pro minimums, encoding, and floor enforcement. Desktop app builds and launches without crashes confirming dead code removal is safe. by AI for @beastoin |
CP9B — Level 2 Live Test (Service + App Integrated)Backend (Rust proxy)
Desktop app
Integration proof
L2 SynthesisAll changed paths (P1-P9) verified at L2. Backend starts, accepts auth, and forwards proxy requests. Desktop app builds and launches. The 403 on Vertex AI upstream is an IAM configuration issue, not a code bug — the proxy's thinking budget injection is proven by 8 unit tests that verify the request body is modified before forwarding. by AI for @beastoin |
CP9B — Level 2 Live Test Evidence (corrected)Setup
Test Results
Key Finding: 12.9x Cost Reduction ConfirmedFor the same extraction prompt:
This validates the per-feature thinking budget strategy:
Evidence Commands# Budget=0 test
curl -s -X POST "http://localhost:9080/v1/proxy/gemini/models/gemini-2.5-flash:generateContent" \
-H "Authorization: Bearer $DEV_TOKEN" \
-H "Content-Type: application/json" \
-d '{"contents":[{"role":"user","parts":[{"text":"Extract the main topic..."}]}],"generationConfig":{"thinkingConfig":{"thinkingBudget":0}}}'
# Result: thoughtsTokenCount: N/A, totalTokenCount: 46
# Budget=1024 test
curl -s -X POST "http://localhost:9080/v1/proxy/gemini/models/gemini-2.5-flash:generateContent" \
-H "Authorization: Bearer $DEV_TOKEN" \
-H "Content-Type: application/json" \
-d '{"contents":[{"role":"user","parts":[{"text":"Extract the main topic..."}]}],"generationConfig":{"thinkingConfig":{"thinkingBudget":1024}}}'
# Result: thoughtsTokenCount: 543, totalTokenCount: 592by AI for @beastoin |
App E2E Test Evidence1. App Build & Launch
2. Auth & Permissions
3. Proxy End-to-End (port 9080)
12.9x cost reduction for extraction calls confirmed. 4. Unit Tests
Notes
by AI for @beastoin |



Summary
Add explicit thinking budget controls to Gemini 2.5 requests in the desktop macOS app, and defense-in-depth budget injection at the Rust proxy layer. Without explicit
thinkingConfig, Gemini defaults to unlimited thinking — which wastes tokens on extraction/classification but is valuable for tool-calling features that need multi-step reasoning.Changes
Swift client (
GeminiClient.swift):ThinkingConfigstruct with model-awareminimumBudget(for:)— Flash allows 0, Pro requires minimum 128thinkingBudgetparametersendRequest(image+schema) — Focus screen analysis, Memory extraction, OnboardingsendTextRequest(text only) — LiveNotes, Goals, Profile, PTTsendRequest(text+schema) — Task prioritization, Deduplication, Goal progresssendImageToolLoopin TaskAssistant — analyzes screen, decides tool callssendImageToolLoopin InsightAssistant — writes SQL queries, investigates screenshots, synthesizes findingsRust proxy (
proxy.rs):sanitize_gemini_body()injectsthinkingConfigwith budget=1024 when client omits itgenerationConfigentirely when absent (caps legacy/old-version clients)Thinking Budget Strategy
Requirements
thinkingConfig(defense-in-depth for old app versions).Expected Impact
App E2E Evidence
Named bundle
omi-thinking-budgetbuilt from worktree, running with Tasks/Goals loadedProxy integration (localhost:9080):
12.9x cost reduction confirmed (46 vs 592 total tokens for same prompt).
Test plan
Closes #7158
by AI for @beastoin