Implement dual-rate billing for Gemini image-output models by adambalogh · Pull Request #90 · OpenGradient/tee-gateway

adambalogh · 2026-06-03T19:21:42Z

Summary

Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate (~~$30/MTok) and text/thinking tokens at the standard rate (~~$1.50–$3/MTok). This PR implements proper split-rate billing by:

Adding image_output_price_usd field to ModelConfig for dual-rate models
Extracting and surfacing reasoning_tokens (thinking) through the usage pipeline
Splitting output token billing in compute_session_cost() based on reasoning count
Updating Gemini image model pricing to reflect Google's actual dual-rate structure

Key Changes

model_registry.py:
- Added image_output_price_usd optional field to ModelConfig for image-modality token pricing
- Updated GEMINI_2_5_FLASH_IMAGE and GEMINI_3_1_FLASH_IMAGE to use correct dual-rate pricing (text/thinking at output_price_usd, images at image_output_price_usd)
llm_backend.py:
- Modified extract_usage() to extract and return reasoning_tokens from output_token_details nested in usage metadata
pricing.py:
- Implemented dual-rate billing logic in compute_session_cost(): when image_output_price_usd is set, reasoning tokens are billed at output_price_usd and remaining output tokens (images + captions) at image_output_price_usd
- Conservative approach: never undercharges image tokens and stays well below the previous behavior of billing all output at the image rate
chat_controller.py:
- Modified _create_non_streaming_response() to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculator
- Updated streaming response handling in generate() to extract and accumulate reasoning tokens from output_token_details in both non-streaming and streaming paths
- Pass reasoning_tokens to compute_session_cost() via a separate cost_usage dict to avoid polluting the OpenAI-compatible response
test_image_billing.py:
- Updated test documentation to reflect dual-rate billing model
- Modified _usage_dict() to use extract_usage() which now carries reasoning tokens
- Renamed test_generated_image_is_charged_as_output_tokens() → test_generated_image_is_charged_at_image_rate() with updated assertions
- Added test_thinking_tokens_billed_at_text_rate() to verify thinking tokens use the cheaper text rate
- Added test_thinking_is_cheaper_than_billing_all_at_image_rate() regression test ensuring the fix doesn't revert to the old buggy behavior
test_price_feed.py:
- Updated mock model config to include image_output=False and image_output_price_usd=None for single-rate test models

Implementation Details

Reasoning token extraction: LangChain breaks out thinking tokens in output_token_details.reasoning but folds them into the main output_tokens count. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.
Conservative billing: The split never undercharges image tokens and is far below the previous behavior of billing all output at the image rate, ensuring we don't lose revenue on image generation.
OpenAI compatibility: The response surface maintains the standard OpenAI usage triple; reasoning tokens ride along to the cost calculator only, not exposed in the API response.

https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq

Google bills nano banana / nano banana 2 output at two rates: image-modality tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image) and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway previously billed the entire output_tokens count (image + text + thinking) at the single image rate, overcharging thinking tokens up to 20x and inflating a typical generation with reasoning by ~50-70%. Add image_output_price_usd to ModelConfig and split billing: reasoning tokens (broken out by langchain via output_token_details) are charged at output_price_usd (text/thinking rate), the remainder at image_output_price_usd. langchain folds image+text+thinking into one output_tokens count and does not expose the per- modality breakdown, so the small text caption rides the image rate — conservative (never undercharges) and strictly cheaper than the previous behavior. Plumb reasoning_tokens through extract_usage and both streaming paths; keep the OpenAI usage triple on responses clean.

The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.

Copilot

Pull request overview

Implements dual-rate billing for Gemini image-output models by splitting output tokens into (1) thinking/text tokens billed at the standard output rate and (2) remaining output tokens billed at the premium image-modality rate. This aligns cost settlement with Google’s actual pricing for gemini-2.5-flash-image and gemini-3.1-flash-image while keeping OpenAI-compatible usage fields on chat responses.

Changes:

Add image_output_price_usd to ModelConfig and update Gemini image model pricing to dual-rate values.
Extract and propagate “reasoning/thinking” token counts through usage so billing can split output tokens.
Update compute_session_cost() to apply dual-rate billing when image_output_price_usd is set, and expand regression tests accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`tests/test_pricing.py`	Updates pricing assertions for Gemini 3.1 flash image dual-rate output.
`tee_gateway/test/test_price_feed.py`	Extends mocked model config to include image-output fields for single-rate models.
`tee_gateway/test/test_image_billing.py`	Updates/expands tests to validate dual-rate billing and reasoning token handling.
`tee_gateway/pricing.py`	Splits output billing into reasoning vs non-reasoning tokens for image-output models.
`tee_gateway/model_registry.py`	Adds `image_output_price_usd` and updates Gemini image-output model prices.
`tee_gateway/llm_backend.py`	Extends `extract_usage()` to surface reasoning tokens from `output_token_details`.
`tee_gateway/controllers/chat_controller.py`	Keeps OpenAI usage triple in responses while still passing reasoning tokens into cost calculation (streaming + non-streaming).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Thinking tokens, when present, are folded into output_tokens but also
+        # broken out here. Image-output models bill them at the cheaper
+        # text/thinking rate (see compute_session_cost), so surface them.
+        details = meta.get("output_token_details") or {}
        return {
            "prompt_tokens": meta.get("input_tokens", 0),
            "completion_tokens": meta.get("output_tokens", 0),
            "total_tokens": meta.get("total_tokens", 0),
+            "reasoning_tokens": details.get("reasoning", 0),
        }


+                        # Thinking tokens are billed at the cheaper text rate; they
+                        # live in the nested output_token_details dict (skipped by
+                        # the int/float loop above), so pull them out explicitly.


claude added 2 commits June 3, 2026 19:20

Update gemini-3.1-flash-image pricing assertion for dual-rate split

43c69ce

The output_price_usd is now the text/thinking rate ($3/MTok); the image rate ($60/MTok) moved to image_output_price_usd.

adambalogh marked this pull request as ready for review June 5, 2026 12:34

dixitaniket requested a review from Copilot June 5, 2026 12:46

Copilot started reviewing on behalf of dixitaniket June 5, 2026 12:46 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

dixitaniket approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement dual-rate billing for Gemini image-output models#90

Implement dual-rate billing for Gemini image-output models#90
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD

adambalogh commented Jun 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

adambalogh commented Jun 3, 2026

Summary

Key Changes

Implementation Details

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants