feat(adapters): adapter middleware framework for Model servers by Glorf · Pull Request #1386 · NVIDIA-NeMo/Gym

Glorf · 2026-05-21T20:31:54Z

Summary

Introduces an adapter middleware framework for Model servers: a FastAPI middleware that runs a configurable interceptor chain on every request and response. Lets evaluation runs and training rollouts share a single observability +behavior-shaping layer (token logging, response caching, system-prompt injection, reasoning normalization, turn budgeting, …) without changing the host server.

adapters: list[dict] | None = None lifts onto BaseResponsesAPIModelConfig and is installed by SimpleResponsesAPIModel.setup_webserver, so everyModel server that inherits the base — vllm_model, openai_model, azure_openai_model, genrm_model, local_vllm_model, local_vllm_model_proxy — accepts an adapters block automatically. Omitting it leaves behavior identical to today.

What's in this PR

Framework

nemo_gym/adapters/pipeline.py — async interceptor chain with REQUEST → REQUEST_TO_RESPONSE → RESPONSE stage ordering validated at build time, reverse-order response phase, best_effort exception isolation
nemo_gym/adapters/middleware.py — FastAPI middleware that wraps call_next rather than replacing it; the host server's own routing still performs the upstream call. Handles body replay via Starlette's _body cache + content-length rewrite, list-of-bytes header passthrough (preserves duplicate Set-Cookie), /s/<hex>/... session-id prefix stripping, and GracefulError → 429translation
nemo_gym/adapters/registry.py — short-name → Interceptor class registry with runtime register() for plugins
nemo_gym/adapters/types.py — AdapterRequest, AdapterResponse, three Interceptor ABCs, Stage enum, GracefulError, ContextVar-backed per-request context
nemo_gym/adapters/cache/disk_cache.py — sqlite-backed disk cache keyed by canonicalized request body (+ optional session prefix)

14 built-in interceptors

Name	Stage	Purpose
`logging`	request + response	Log body keys and response status/latency
`drop_params`	request	Remove named params from outbound body
`payload_modifier`	request	Add / remove / rename body fields
`system_message`	request	Inject system message (prepend / append / replace)
`consolidate_system`	request	Merge displaced system messages into one at position 0
`modify_tools`	request	Strip or add properties on `tools[].function.parameters`
`turn_counter`	request	Per-session turn budget; `GracefulError` on exhaustion
`caching`	request → response	Disk-backed cache, session-prefix aware
`endpoint`	request → response	Drive the upstream HTTP call directly (standalone-only)
`raise_client_errors`	response	Non-retriable 4xx → `RuntimeError`
`log_tokens`	response	Log `usage` token counts + latency
`response_stats`	response	Accumulate request count / total tokens / latency
`reasoning`	response	Normalize `<think>...</think>` content or `reasoning` field into `reasoning_content`
`progress_tracking`	response	Optional webhook ping every N responses

Safety

Stage ordering validated at startup (raises on out-of-order)
Unknown interceptor name raises at config-validation time
install_middleware raises ValueError at startup if the chain contains endpoint — it would otherwise double-forward inside a middleware-hosted server (the host already does upstream forwarding via call_next)
All chains where any interceptor sets best_effort = True swallow exceptions and continue; strict interceptors propagate

Tests — 176 unit tests, 99% line coverage

test_adapter_framework.py — registry surface, ABCs
test_adapter_pipeline.py — stage ordering, reverse-order response execution, short-circuit, upstream_call hook
test_adapter_registry.py — resolve, register, import-failure, available
test_adapter_interceptors.py — per-interceptor unit tests (all 14)
test_adapter_interceptors_smoke.py — all 14 instantiate or require_config
test_adapter_consolidate_system.py — displaced system-message merging
test_adapter_cache_keys.py — golden SHA-256 cache keys
test_adapter_disk_cache.py — sqlite round-trip
test_adapter_middleware_behaviors.py — multi-Set-Cookie preservation, hop-by-hop header stripping, body cache replay
test_adapter_middleware_integration.py — end-to-end via Starlette TestClient
test_adapter_parity_replay.py — captured-fixture parity over 12 scenarios
test_adapter_coverage.py — endpoint retry / timeout / auth, turn_counter GC, middleware helper branches, endpoint-in-chain install guard
adapter_fixtures/*.json (12 files) — captured request/response pairs
generate_adapter_fixtures.py — regeneration script

The 5 remaining uncovered lines are hard error paths (sqlite init failure, unknown-stage assertion, bytes-body Response constructor).

Docs

docs/model-server/adapters.md — interceptor catalog, per-interceptor YAML config, custom-interceptor registration, configuration reference, and caveat notes on dual session-id systems / progress_tracking webhook inline-await/ content-encoding stripping / Hydra _inherit_from list semantics
docs/model-server/index.md — adds a Middleware section linking to the new page
docs/index.md — adds the page to the global toctree

Example config

responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy_with_adapters.yaml — demonstrates a logging + log_tokens + reasoning chain.

Test plan

pytest tests/unit_tests/test_adapter_*.py — 176 tests, ~3s
coverage run -m pytest && coverage report — adapter coverage at 99% (overall back over 96% threshold)
ruff check nemo_gym/adapters tests/unit_tests/test_adapter_*.py
ruff format --check nemo_gym/adapters
End-to-end live verification ran during development against the real aws/anthropic/bedrock-claude-sonnet-4-6 endpoint: 43-test battery (per-interceptor, corner cases, load 100-concurrent / 200-sequential, stress 500-burst with fd/mem-leak check, cache stampede, mixed status traffic, slow upstream concurrency) — all green
Backward compatibility: Model server configs that omit adapters behave exactly as before. The architect-flagged guard prevents the one well-knownfootgun (endpoint in a middleware-hosted chain) at startup rather than at request time.

🤖 Generated with Claude Code

copy-pr-bot · 2026-05-21T20:31:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Adapter chains attach at four boundaries of the Gym ecosystem and run the same pipeline through two host modes: Boundaries (server-hosted via install_middleware): Model Server /v1/chat/completions, /v1/responses Agent Server /v1/responses, /run, /aggregate_metrics Resources Server /seed_session, /verify, /aggregate_metrics Boundary (standalone host via start_adapter_proxy): External-inference proxy localhost uvicorn in front of an arbitrary upstream (Anthropic, OpenAI, …) The framework nemo_gym/adapters/ pipeline.py async chain, REQUEST → REQUEST_TO_RESPONSE → RESPONSE stage validation at build time, reverse-order response phase, best_effort exception isolation middleware.py FastAPI middleware that wraps `call_next` rather than replacing it; body replay via Starlette `_body` cache + content-length rewrite, multi-Set-Cookie preservation, /s/<hex>/... session-id prefix, GracefulError → 429 proxy.py start_adapter_proxy(upstream_url, adapters) host — localhost uvicorn with explicit adapted-route set; non-adapted paths pass through to upstream for SDK pre-flight (/v1/models, batches, …) registry.py short-name → InterceptorClass with runtime register() for plugins types.py AdapterRequest, AdapterResponse, three Interceptor ABCs, Stage enum, GracefulError, ContextVar ctx, InterceptorSpec + AdapterProxyConfig typed models cache/disk_cache.py sqlite-backed disk cache keyed by canonicalized request body (+ optional session prefix) 14 built-in interceptors: logging, drop_params, payload_modifier, system_message, consolidate_system, modify_tools, turn_counter, caching, endpoint, raise_client_errors, log_tokens, response_stats, reasoning, progress_tracking Server wire-up `adapters: list[dict] | None` lifted onto BaseResponsesAPIModelConfig, BaseResponsesAPIAgentConfig, BaseResourcesServerConfig. `install_middleware(app, self.config.adapters)` called at the tail of each base's `setup_webserver`. All in-tree servers (vllm_model, openai_model, azure_openai_model, genrm_model, local_vllm_model, local_vllm_model_proxy, and every agent/resources server) accept an `adapters` block automatically. External-inference proxy `adapter_proxy: Optional[AdapterProxyConfig]` lifted onto BaseResponsesAPIAgentConfig. When set, SimpleResponsesAPIAgent. setup_webserver starts a localhost uvicorn proxy in a daemon thread, stores the ProxyHandle on self._proxy_handle, registers atexit cleanup. Subclasses (e.g. ClaudeCodeAgent) read self._proxy_handle.url when constructing their SDK client. ClaudeCodeAgent: when in proxy mode, sets ANTHROPIC_BASE_URL to the proxy URL and does NOT set ANTHROPIC_AUTH_TOKEN. The SDK uses ANTHROPIC_API_KEY via x-api-key; the proxy forwards the header verbatim. Setting AUTH_TOKEN would flip the SDK to Bearer auth which api.anthropic.com rejects. Safety - Stage ordering validated at startup - Unknown interceptor name raises at config-validation time - install_middleware rejects `endpoint` in chain (the host already forwards via call_next) - start_adapter_proxy rejects user-supplied `endpoint` (it forwards itself), and rejects host="0.0.0.0" unless unsafe_allow_remote=True (otherwise leaks upstream API key) - best_effort=True interceptors swallow exceptions; strict ones propagate Tests — 187 unit tests, 97% coverage on the new code tests/unit_tests/ test_adapter_framework.py registry surface, ABCs test_adapter_pipeline.py stage ordering, short-circuit test_adapter_registry.py resolve, register, available test_adapter_interceptors.py per-interceptor unit suite test_adapter_interceptors_smoke.py all 14 instantiate / require_config test_adapter_consolidate_system.py displaced-system-msg merging test_adapter_cache_keys.py golden SHA-256 cache keys test_adapter_disk_cache.py sqlite round-trip test_adapter_middleware_behaviors.py multi Set-Cookie, hop-by-hop, body cache replay test_adapter_middleware_integration.py end-to-end via TestClient test_adapter_parity_replay.py 12 captured-fixture scenarios test_adapter_coverage.py endpoint retry/timeout/auth, turn_counter GC, middleware helper branches test_adapter_base_class_wiring.py agent + resources base lift test_adapter_proxy.py proxy modes: adapted routes, passthrough, multi Set-Cookie, host=0.0.0.0 rejection, endpoint-rejection adapter_fixtures/*.json (12 files) captured req/resp pairs generate_adapter_fixtures.py regeneration script Docs docs/model-server/adapters.md interceptor catalog, per-config examples, proxy mode, custom interceptors, caveats docs/model-server/index.md Middleware section + link docs/index.md adds page to global toctree Example config responses_api_models/local_vllm_model_proxy/configs/ local_vllm_model_proxy_with_adapters.yaml Signed-off-by: Michal Bien <mbien@nvidia.com>

Glorf requested a review from bxyu-nvidia May 21, 2026 20:33

Glorf force-pushed the suggest/adapter-middleware branch from e8c77c9 to e6210b1 Compare May 25, 2026 16:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(adapters): adapter middleware framework for Model servers#1386

feat(adapters): adapter middleware framework for Model servers#1386
Glorf wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
Glorf:suggest/adapter-middleware

Glorf commented May 21, 2026

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Glorf commented May 21, 2026

Summary

What's in this PR

Framework

14 built-in interceptors

Safety

Tests — 176 unit tests, 99% line coverage

Docs

Example config

Test plan

Uh oh!

copy-pr-bot Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant