Skip to content

feat(adapters): adapter middleware framework for Model servers#1386

Open
Glorf wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Glorf:suggest/adapter-middleware
Open

feat(adapters): adapter middleware framework for Model servers#1386
Glorf wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Glorf:suggest/adapter-middleware

Conversation

@Glorf
Copy link
Copy Markdown

@Glorf Glorf commented May 21, 2026

Summary

Introduces an adapter middleware framework for Model servers: a FastAPI middleware that runs a configurable interceptor chain on every request and response. Lets evaluation runs and training rollouts share a single observability +behavior-shaping layer (token logging, response caching, system-prompt injection, reasoning normalization, turn budgeting, …) without changing the host server.

adapters: list[dict] | None = None lifts onto BaseResponsesAPIModelConfig and is installed by SimpleResponsesAPIModel.setup_webserver, so everyModel server that inherits the base — vllm_model, openai_model, azure_openai_model, genrm_model, local_vllm_model, local_vllm_model_proxy — accepts an adapters block automatically. Omitting it leaves behavior identical to today.

What's in this PR

Framework

  • nemo_gym/adapters/pipeline.py — async interceptor chain with REQUEST → REQUEST_TO_RESPONSE → RESPONSE stage ordering validated at build time, reverse-order response phase, best_effort exception isolation
  • nemo_gym/adapters/middleware.py — FastAPI middleware that wraps call_next rather than replacing it; the host server's own routing still performs the upstream call. Handles body replay via Starlette's _body cache + content-length rewrite, list-of-bytes header passthrough (preserves duplicate Set-Cookie), /s/<hex>/... session-id prefix stripping, and GracefulError → 429translation
  • nemo_gym/adapters/registry.py — short-name → Interceptor class registry with runtime register() for plugins
  • nemo_gym/adapters/types.pyAdapterRequest, AdapterResponse, three Interceptor ABCs, Stage enum, GracefulError, ContextVar-backed per-request context
  • nemo_gym/adapters/cache/disk_cache.py — sqlite-backed disk cache keyed by canonicalized request body (+ optional session prefix)

14 built-in interceptors

Name Stage Purpose
logging request + response Log body keys and response status/latency
drop_params request Remove named params from outbound body
payload_modifier request Add / remove / rename body fields
system_message request Inject system message (prepend / append / replace)
consolidate_system request Merge displaced system messages into one at position 0
modify_tools request Strip or add properties on tools[].function.parameters
turn_counter request Per-session turn budget; GracefulError on exhaustion
caching request → response Disk-backed cache, session-prefix aware
endpoint request → response Drive the upstream HTTP call directly (standalone-only)
raise_client_errors response Non-retriable 4xx → RuntimeError
log_tokens response Log usage token counts + latency
response_stats response Accumulate request count / total tokens / latency
reasoning response Normalize <think>...</think> content or reasoning field into reasoning_content
progress_tracking response Optional webhook ping every N responses

Safety

  • Stage ordering validated at startup (raises on out-of-order)
  • Unknown interceptor name raises at config-validation time
  • install_middleware raises ValueError at startup if the chain contains endpoint — it would otherwise double-forward inside a middleware-hosted server (the host already does upstream forwarding via call_next)
  • All chains where any interceptor sets best_effort = True swallow exceptions and continue; strict interceptors propagate

Tests — 176 unit tests, 99% line coverage

  • test_adapter_framework.py — registry surface, ABCs
  • test_adapter_pipeline.py — stage ordering, reverse-order response execution, short-circuit, upstream_call hook
  • test_adapter_registry.py — resolve, register, import-failure, available
  • test_adapter_interceptors.py — per-interceptor unit tests (all 14)
  • test_adapter_interceptors_smoke.py — all 14 instantiate or require_config
  • test_adapter_consolidate_system.py — displaced system-message merging
  • test_adapter_cache_keys.py — golden SHA-256 cache keys
  • test_adapter_disk_cache.py — sqlite round-trip
  • test_adapter_middleware_behaviors.py — multi-Set-Cookie preservation, hop-by-hop header stripping, body cache replay
  • test_adapter_middleware_integration.py — end-to-end via Starlette TestClient
  • test_adapter_parity_replay.py — captured-fixture parity over 12 scenarios
  • test_adapter_coverage.py — endpoint retry / timeout / auth, turn_counter GC, middleware helper branches, endpoint-in-chain install guard
  • adapter_fixtures/*.json (12 files) — captured request/response pairs
  • generate_adapter_fixtures.py — regeneration script

The 5 remaining uncovered lines are hard error paths (sqlite init failure, unknown-stage assertion, bytes-body Response constructor).

Docs

  • docs/model-server/adapters.md — interceptor catalog, per-interceptor YAML config, custom-interceptor registration, configuration reference, and caveat notes on dual session-id systems / progress_tracking webhook inline-await/ content-encoding stripping / Hydra _inherit_from list semantics
  • docs/model-server/index.md — adds a Middleware section linking to the new page
  • docs/index.md — adds the page to the global toctree

Example config

responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy_with_adapters.yaml — demonstrates a logging + log_tokens + reasoning chain.

Test plan

  • pytest tests/unit_tests/test_adapter_*.py — 176 tests, ~3s
  • coverage run -m pytest && coverage report — adapter coverage at 99% (overall back over 96% threshold)
  • ruff check nemo_gym/adapters tests/unit_tests/test_adapter_*.py
  • ruff format --check nemo_gym/adapters
  • End-to-end live verification ran during development against the real aws/anthropic/bedrock-claude-sonnet-4-6 endpoint: 43-test battery (per-interceptor, corner cases, load 100-concurrent / 200-sequential, stress 500-burst with fd/mem-leak check, cache stampede, mixed status traffic, slow upstream concurrency) — all green
  • Backward compatibility: Model server configs that omit adapters behave exactly as before. The architect-flagged guard prevents the one well-knownfootgun (endpoint in a middleware-hosted chain) at startup rather than at request time.

🤖 Generated with Claude Code

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@Glorf Glorf requested a review from bxyu-nvidia May 21, 2026 20:33
Adapter chains attach at four boundaries of the Gym ecosystem and run
the same pipeline through two host modes:

  Boundaries (server-hosted via install_middleware):
    Model Server     /v1/chat/completions, /v1/responses
    Agent Server     /v1/responses, /run, /aggregate_metrics
    Resources Server /seed_session, /verify, /aggregate_metrics

  Boundary (standalone host via start_adapter_proxy):
    External-inference proxy   localhost uvicorn in front of an
                               arbitrary upstream (Anthropic, OpenAI, …)

The framework

    nemo_gym/adapters/
      pipeline.py       async chain, REQUEST → REQUEST_TO_RESPONSE →
                        RESPONSE stage validation at build time,
                        reverse-order response phase, best_effort
                        exception isolation
      middleware.py     FastAPI middleware that wraps `call_next` rather
                        than replacing it; body replay via Starlette
                        `_body` cache + content-length rewrite,
                        multi-Set-Cookie preservation, /s/<hex>/...
                        session-id prefix, GracefulError → 429
      proxy.py          start_adapter_proxy(upstream_url, adapters) host
                        — localhost uvicorn with explicit adapted-route
                        set; non-adapted paths pass through to upstream
                        for SDK pre-flight (/v1/models, batches, …)
      registry.py       short-name → InterceptorClass with runtime
                        register() for plugins
      types.py          AdapterRequest, AdapterResponse, three Interceptor
                        ABCs, Stage enum, GracefulError, ContextVar ctx,
                        InterceptorSpec + AdapterProxyConfig typed models
      cache/disk_cache.py
                        sqlite-backed disk cache keyed by canonicalized
                        request body (+ optional session prefix)

  14 built-in interceptors:
    logging, drop_params, payload_modifier, system_message,
    consolidate_system, modify_tools, turn_counter, caching, endpoint,
    raise_client_errors, log_tokens, response_stats, reasoning,
    progress_tracking

Server wire-up

`adapters: list[dict] | None` lifted onto BaseResponsesAPIModelConfig,
BaseResponsesAPIAgentConfig, BaseResourcesServerConfig.
`install_middleware(app, self.config.adapters)` called at the tail of
each base's `setup_webserver`. All in-tree servers (vllm_model,
openai_model, azure_openai_model, genrm_model, local_vllm_model,
local_vllm_model_proxy, and every agent/resources server) accept an
`adapters` block automatically.

External-inference proxy

`adapter_proxy: Optional[AdapterProxyConfig]` lifted onto
BaseResponsesAPIAgentConfig. When set, SimpleResponsesAPIAgent.
setup_webserver starts a localhost uvicorn proxy in a daemon thread,
stores the ProxyHandle on self._proxy_handle, registers atexit
cleanup. Subclasses (e.g. ClaudeCodeAgent) read self._proxy_handle.url
when constructing their SDK client.

ClaudeCodeAgent: when in proxy mode, sets ANTHROPIC_BASE_URL to the
proxy URL and does NOT set ANTHROPIC_AUTH_TOKEN. The SDK uses
ANTHROPIC_API_KEY via x-api-key; the proxy forwards the header
verbatim. Setting AUTH_TOKEN would flip the SDK to Bearer auth which
api.anthropic.com rejects.

Safety

  - Stage ordering validated at startup
  - Unknown interceptor name raises at config-validation time
  - install_middleware rejects `endpoint` in chain (the host already
    forwards via call_next)
  - start_adapter_proxy rejects user-supplied `endpoint` (it forwards
    itself), and rejects host="0.0.0.0" unless unsafe_allow_remote=True
    (otherwise leaks upstream API key)
  - best_effort=True interceptors swallow exceptions; strict ones
    propagate

Tests — 187 unit tests, 97% coverage on the new code

    tests/unit_tests/
      test_adapter_framework.py              registry surface, ABCs
      test_adapter_pipeline.py               stage ordering, short-circuit
      test_adapter_registry.py               resolve, register, available
      test_adapter_interceptors.py           per-interceptor unit suite
      test_adapter_interceptors_smoke.py     all 14 instantiate / require_config
      test_adapter_consolidate_system.py     displaced-system-msg merging
      test_adapter_cache_keys.py             golden SHA-256 cache keys
      test_adapter_disk_cache.py             sqlite round-trip
      test_adapter_middleware_behaviors.py   multi Set-Cookie, hop-by-hop,
                                             body cache replay
      test_adapter_middleware_integration.py end-to-end via TestClient
      test_adapter_parity_replay.py          12 captured-fixture scenarios
      test_adapter_coverage.py               endpoint retry/timeout/auth,
                                             turn_counter GC, middleware
                                             helper branches
      test_adapter_base_class_wiring.py      agent + resources base lift
      test_adapter_proxy.py                  proxy modes: adapted routes,
                                             passthrough, multi Set-Cookie,
                                             host=0.0.0.0 rejection,
                                             endpoint-rejection
      adapter_fixtures/*.json (12 files)     captured req/resp pairs
      generate_adapter_fixtures.py           regeneration script

Docs

    docs/model-server/adapters.md          interceptor catalog, per-config
                                           examples, proxy mode, custom
                                           interceptors, caveats
    docs/model-server/index.md             Middleware section + link
    docs/index.md                          adds page to global toctree

Example config

    responses_api_models/local_vllm_model_proxy/configs/
      local_vllm_model_proxy_with_adapters.yaml

Signed-off-by: Michal Bien <mbien@nvidia.com>
@Glorf Glorf force-pushed the suggest/adapter-middleware branch from e8c77c9 to e6210b1 Compare May 25, 2026 16:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant