Skip to content

Implement dual-rate billing for Gemini image-output models#90

Open
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD
Open

Implement dual-rate billing for Gemini image-output models#90
adambalogh wants to merge 2 commits into
mainfrom
claude/practical-mccarthy-Bs3vD

Conversation

@adambalogh
Copy link
Copy Markdown
Contributor

Summary

Gemini image-output models (gemini-2.5-flash-image, gemini-3.1-flash-image) bill output tokens at two different rates: image-modality tokens at a premium rate ($30/MTok) and text/thinking tokens at the standard rate ($1.50–$3/MTok). This PR implements proper split-rate billing by:

  1. Adding image_output_price_usd field to ModelConfig for dual-rate models
  2. Extracting and surfacing reasoning_tokens (thinking) through the usage pipeline
  3. Splitting output token billing in compute_session_cost() based on reasoning count
  4. Updating Gemini image model pricing to reflect Google's actual dual-rate structure

Key Changes

  • model_registry.py:

    • Added image_output_price_usd optional field to ModelConfig for image-modality token pricing
    • Updated GEMINI_2_5_FLASH_IMAGE and GEMINI_3_1_FLASH_IMAGE to use correct dual-rate pricing (text/thinking at output_price_usd, images at image_output_price_usd)
  • llm_backend.py:

    • Modified extract_usage() to extract and return reasoning_tokens from output_token_details nested in usage metadata
  • pricing.py:

    • Implemented dual-rate billing logic in compute_session_cost(): when image_output_price_usd is set, reasoning tokens are billed at output_price_usd and remaining output tokens (images + captions) at image_output_price_usd
    • Conservative approach: never undercharges image tokens and stays well below the previous behavior of billing all output at the image rate
  • chat_controller.py:

    • Modified _create_non_streaming_response() to surface only the standard OpenAI usage triple (prompt/completion/total tokens) while passing reasoning tokens separately to the cost calculator
    • Updated streaming response handling in generate() to extract and accumulate reasoning tokens from output_token_details in both non-streaming and streaming paths
    • Pass reasoning_tokens to compute_session_cost() via a separate cost_usage dict to avoid polluting the OpenAI-compatible response
  • test_image_billing.py:

    • Updated test documentation to reflect dual-rate billing model
    • Modified _usage_dict() to use extract_usage() which now carries reasoning tokens
    • Renamed test_generated_image_is_charged_as_output_tokens()test_generated_image_is_charged_at_image_rate() with updated assertions
    • Added test_thinking_tokens_billed_at_text_rate() to verify thinking tokens use the cheaper text rate
    • Added test_thinking_is_cheaper_than_billing_all_at_image_rate() regression test ensuring the fix doesn't revert to the old buggy behavior
  • test_price_feed.py:

    • Updated mock model config to include image_output=False and image_output_price_usd=None for single-rate test models

Implementation Details

  • Reasoning token extraction: LangChain breaks out thinking tokens in output_token_details.reasoning but folds them into the main output_tokens count. The billing split uses this breakdown to charge thinking at the text rate and the remainder (image + any caption) at the image rate.
  • Conservative billing: The split never undercharges image tokens and is far below the previous behavior of billing all output at the image rate, ensuring we don't lose revenue on image generation.
  • OpenAI compatibility: The response surface maintains the standard OpenAI usage triple; reasoning tokens ride along to the cost calculator only, not exposed in the API response.

https://claude.ai/code/session_01GDGKRki93xtXCFUNDkcEyq

claude added 2 commits June 3, 2026 19:20
Google bills nano banana / nano banana 2 output at two rates: image-modality
tokens at a high rate ($30/MTok for 2.5-flash-image, $60/MTok for 3.1-flash-image)
and text + thinking tokens at a much lower rate ($1.50 and $3/MTok). The gateway
previously billed the entire output_tokens count (image + text + thinking) at the
single image rate, overcharging thinking tokens up to 20x and inflating a typical
generation with reasoning by ~50-70%.

Add image_output_price_usd to ModelConfig and split billing: reasoning tokens
(broken out by langchain via output_token_details) are charged at output_price_usd
(text/thinking rate), the remainder at image_output_price_usd. langchain folds
image+text+thinking into one output_tokens count and does not expose the per-
modality breakdown, so the small text caption rides the image rate — conservative
(never undercharges) and strictly cheaper than the previous behavior.

Plumb reasoning_tokens through extract_usage and both streaming paths; keep the
OpenAI usage triple on responses clean.
The output_price_usd is now the text/thinking rate ($3/MTok); the image
rate ($60/MTok) moved to image_output_price_usd.
@adambalogh adambalogh marked this pull request as ready for review June 5, 2026 12:34
@dixitaniket dixitaniket requested a review from Copilot June 5, 2026 12:46
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements dual-rate billing for Gemini image-output models by splitting output tokens into (1) thinking/text tokens billed at the standard output rate and (2) remaining output tokens billed at the premium image-modality rate. This aligns cost settlement with Google’s actual pricing for gemini-2.5-flash-image and gemini-3.1-flash-image while keeping OpenAI-compatible usage fields on chat responses.

Changes:

  • Add image_output_price_usd to ModelConfig and update Gemini image model pricing to dual-rate values.
  • Extract and propagate “reasoning/thinking” token counts through usage so billing can split output tokens.
  • Update compute_session_cost() to apply dual-rate billing when image_output_price_usd is set, and expand regression tests accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/test_pricing.py Updates pricing assertions for Gemini 3.1 flash image dual-rate output.
tee_gateway/test/test_price_feed.py Extends mocked model config to include image-output fields for single-rate models.
tee_gateway/test/test_image_billing.py Updates/expands tests to validate dual-rate billing and reasoning token handling.
tee_gateway/pricing.py Splits output billing into reasoning vs non-reasoning tokens for image-output models.
tee_gateway/model_registry.py Adds image_output_price_usd and updates Gemini image-output model prices.
tee_gateway/llm_backend.py Extends extract_usage() to surface reasoning tokens from output_token_details.
tee_gateway/controllers/chat_controller.py Keeps OpenAI usage triple in responses while still passing reasoning tokens into cost calculation (streaming + non-streaming).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +424 to 433
# Thinking tokens, when present, are folded into output_tokens but also
# broken out here. Image-output models bill them at the cheaper
# text/thinking rate (see compute_session_cost), so surface them.
details = meta.get("output_token_details") or {}
return {
"prompt_tokens": meta.get("input_tokens", 0),
"completion_tokens": meta.get("output_tokens", 0),
"total_tokens": meta.get("total_tokens", 0),
"reasoning_tokens": details.get("reasoning", 0),
}
Comment thread tee_gateway/llm_backend.py
Comment on lines +634 to +636
# Thinking tokens are billed at the cheaper text rate; they
# live in the nested output_token_details dict (skipped by
# the int/float loop above), so pull them out explicitly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants