Implement dual-rate billing for Gemini image-output models#90
Open
adambalogh wants to merge 2 commits into
Open
Implement dual-rate billing for Gemini image-output models#90adambalogh wants to merge 2 commits into
adambalogh wants to merge 2 commits into
Conversation
Google bills nano banana / nano banana 2 output at two rates: image-modality tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image) and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway previously billed the entire output_tokens count (image + text + thinking) at the single image rate, overcharging thinking tokens up to 20x and inflating a typical generation with reasoning by ~50-70%. Add image_output_price_usd to ModelConfig and split billing: reasoning tokens (broken out by langchain via output_token_details) are charged at output_price_usd (text/thinking rate), the remainder at image_output_price_usd. langchain folds image+text+thinking into one output_tokens count and does not expose the per- modality breakdown, so the small text caption rides the image rate — conservative (never undercharges) and strictly cheaper than the previous behavior. Plumb reasoning_tokens through extract_usage and both streaming paths; keep the OpenAI usage triple on responses clean.
The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.
Contributor
There was a problem hiding this comment.
Pull request overview
Implements dual-rate billing for Gemini image-output models by splitting output tokens into (1) thinking/text tokens billed at the standard output rate and (2) remaining output tokens billed at the premium image-modality rate. This aligns cost settlement with Google’s actual pricing for gemini-2.5-flash-image and gemini-3.1-flash-image while keeping OpenAI-compatible usage fields on chat responses.
Changes:
- Add
image_output_price_usdtoModelConfigand update Gemini image model pricing to dual-rate values. - Extract and propagate “reasoning/thinking” token counts through usage so billing can split output tokens.
- Update
compute_session_cost()to apply dual-rate billing whenimage_output_price_usdis set, and expand regression tests accordingly.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
tests/test_pricing.py |
Updates pricing assertions for Gemini 3.1 flash image dual-rate output. |
tee_gateway/test/test_price_feed.py |
Extends mocked model config to include image-output fields for single-rate models. |
tee_gateway/test/test_image_billing.py |
Updates/expands tests to validate dual-rate billing and reasoning token handling. |
tee_gateway/pricing.py |
Splits output billing into reasoning vs non-reasoning tokens for image-output models. |
tee_gateway/model_registry.py |
Adds image_output_price_usd and updates Gemini image-output model prices. |
tee_gateway/llm_backend.py |
Extends extract_usage() to surface reasoning tokens from output_token_details. |
tee_gateway/controllers/chat_controller.py |
Keeps OpenAI usage triple in responses while still passing reasoning tokens into cost calculation (streaming + non-streaming). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+424
to
433
| # Thinking tokens, when present, are folded into output_tokens but also | ||
| # broken out here. Image-output models bill them at the cheaper | ||
| # text/thinking rate (see compute_session_cost), so surface them. | ||
| details = meta.get("output_token_details") or {} | ||
| return { | ||
| "prompt_tokens": meta.get("input_tokens", 0), | ||
| "completion_tokens": meta.get("output_tokens", 0), | ||
| "total_tokens": meta.get("total_tokens", 0), | ||
| "reasoning_tokens": details.get("reasoning", 0), | ||
| } |
Comment on lines
+634
to
+636
| # Thinking tokens are billed at the cheaper text rate; they | ||
| # live in the nested output_token_details dict (skipped by | ||
| # the int/float loop above), so pull them out explicitly. |
dixitaniket
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate (
$30/MTok) and text/thinking tokens at the standard rate ($1.50–$3/MTok). This PR implements proper split-rate billing by:image_output_price_usdfield toModelConfigfor dual-rate modelsreasoning_tokens(thinking) through the usage pipelinecompute_session_cost()based on reasoning countKey Changes
model_registry.py:
image_output_price_usdoptional field toModelConfigfor image-modality token pricingGEMINI_2_5_FLASH_IMAGEandGEMINI_3_1_FLASH_IMAGEto use correct dual-rate pricing (text/thinking atoutput_price_usd, images atimage_output_price_usd)llm_backend.py:
extract_usage()to extract and returnreasoning_tokensfromoutput_token_detailsnested in usage metadatapricing.py:
compute_session_cost(): whenimage_output_price_usdis set, reasoning tokens are billed atoutput_price_usdand remaining output tokens (images + captions) atimage_output_price_usdchat_controller.py:
_create_non_streaming_response()to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculatorgenerate()to extract and accumulatereasoningtokens fromoutput_token_detailsin both non-streaming and streaming pathsreasoning_tokenstocompute_session_cost()via a separatecost_usagedict to avoid polluting the OpenAI-compatible responsetest_image_billing.py:
_usage_dict()to useextract_usage()which now carries reasoning tokenstest_generated_image_is_charged_as_output_tokens()→test_generated_image_is_charged_at_image_rate()with updated assertionstest_thinking_tokens_billed_at_text_rate()to verify thinking tokens use the cheaper text ratetest_thinking_is_cheaper_than_billing_all_at_image_rate()regression test ensuring the fix doesn't revert to the old buggy behaviortest_price_feed.py:
image_output=Falseandimage_output_price_usd=Nonefor single-rate test modelsImplementation Details
output_token_details.reasoningbut folds them into the mainoutput_tokenscount. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq