feat(adapters): adapter middleware framework for Model servers#1386
Open
Glorf wants to merge 1 commit into
Open
feat(adapters): adapter middleware framework for Model servers#1386Glorf wants to merge 1 commit into
Glorf wants to merge 1 commit into
Conversation
Adapter chains attach at four boundaries of the Gym ecosystem and run
the same pipeline through two host modes:
Boundaries (server-hosted via install_middleware):
Model Server /v1/chat/completions, /v1/responses
Agent Server /v1/responses, /run, /aggregate_metrics
Resources Server /seed_session, /verify, /aggregate_metrics
Boundary (standalone host via start_adapter_proxy):
External-inference proxy localhost uvicorn in front of an
arbitrary upstream (Anthropic, OpenAI, …)
The framework
nemo_gym/adapters/
pipeline.py async chain, REQUEST → REQUEST_TO_RESPONSE →
RESPONSE stage validation at build time,
reverse-order response phase, best_effort
exception isolation
middleware.py FastAPI middleware that wraps `call_next` rather
than replacing it; body replay via Starlette
`_body` cache + content-length rewrite,
multi-Set-Cookie preservation, /s/<hex>/...
session-id prefix, GracefulError → 429
proxy.py start_adapter_proxy(upstream_url, adapters) host
— localhost uvicorn with explicit adapted-route
set; non-adapted paths pass through to upstream
for SDK pre-flight (/v1/models, batches, …)
registry.py short-name → InterceptorClass with runtime
register() for plugins
types.py AdapterRequest, AdapterResponse, three Interceptor
ABCs, Stage enum, GracefulError, ContextVar ctx,
InterceptorSpec + AdapterProxyConfig typed models
cache/disk_cache.py
sqlite-backed disk cache keyed by canonicalized
request body (+ optional session prefix)
14 built-in interceptors:
logging, drop_params, payload_modifier, system_message,
consolidate_system, modify_tools, turn_counter, caching, endpoint,
raise_client_errors, log_tokens, response_stats, reasoning,
progress_tracking
Server wire-up
`adapters: list[dict] | None` lifted onto BaseResponsesAPIModelConfig,
BaseResponsesAPIAgentConfig, BaseResourcesServerConfig.
`install_middleware(app, self.config.adapters)` called at the tail of
each base's `setup_webserver`. All in-tree servers (vllm_model,
openai_model, azure_openai_model, genrm_model, local_vllm_model,
local_vllm_model_proxy, and every agent/resources server) accept an
`adapters` block automatically.
External-inference proxy
`adapter_proxy: Optional[AdapterProxyConfig]` lifted onto
BaseResponsesAPIAgentConfig. When set, SimpleResponsesAPIAgent.
setup_webserver starts a localhost uvicorn proxy in a daemon thread,
stores the ProxyHandle on self._proxy_handle, registers atexit
cleanup. Subclasses (e.g. ClaudeCodeAgent) read self._proxy_handle.url
when constructing their SDK client.
ClaudeCodeAgent: when in proxy mode, sets ANTHROPIC_BASE_URL to the
proxy URL and does NOT set ANTHROPIC_AUTH_TOKEN. The SDK uses
ANTHROPIC_API_KEY via x-api-key; the proxy forwards the header
verbatim. Setting AUTH_TOKEN would flip the SDK to Bearer auth which
api.anthropic.com rejects.
Safety
- Stage ordering validated at startup
- Unknown interceptor name raises at config-validation time
- install_middleware rejects `endpoint` in chain (the host already
forwards via call_next)
- start_adapter_proxy rejects user-supplied `endpoint` (it forwards
itself), and rejects host="0.0.0.0" unless unsafe_allow_remote=True
(otherwise leaks upstream API key)
- best_effort=True interceptors swallow exceptions; strict ones
propagate
Tests — 187 unit tests, 97% coverage on the new code
tests/unit_tests/
test_adapter_framework.py registry surface, ABCs
test_adapter_pipeline.py stage ordering, short-circuit
test_adapter_registry.py resolve, register, available
test_adapter_interceptors.py per-interceptor unit suite
test_adapter_interceptors_smoke.py all 14 instantiate / require_config
test_adapter_consolidate_system.py displaced-system-msg merging
test_adapter_cache_keys.py golden SHA-256 cache keys
test_adapter_disk_cache.py sqlite round-trip
test_adapter_middleware_behaviors.py multi Set-Cookie, hop-by-hop,
body cache replay
test_adapter_middleware_integration.py end-to-end via TestClient
test_adapter_parity_replay.py 12 captured-fixture scenarios
test_adapter_coverage.py endpoint retry/timeout/auth,
turn_counter GC, middleware
helper branches
test_adapter_base_class_wiring.py agent + resources base lift
test_adapter_proxy.py proxy modes: adapted routes,
passthrough, multi Set-Cookie,
host=0.0.0.0 rejection,
endpoint-rejection
adapter_fixtures/*.json (12 files) captured req/resp pairs
generate_adapter_fixtures.py regeneration script
Docs
docs/model-server/adapters.md interceptor catalog, per-config
examples, proxy mode, custom
interceptors, caveats
docs/model-server/index.md Middleware section + link
docs/index.md adds page to global toctree
Example config
responses_api_models/local_vllm_model_proxy/configs/
local_vllm_model_proxy_with_adapters.yaml
Signed-off-by: Michal Bien <mbien@nvidia.com>
e8c77c9 to
e6210b1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Introduces an adapter middleware framework for Model servers: a FastAPI middleware that runs a configurable interceptor chain on every request and response. Lets evaluation runs and training rollouts share a single observability +behavior-shaping layer (token logging, response caching, system-prompt injection, reasoning normalization, turn budgeting, …) without changing the host server.
adapters: list[dict] | None = Nonelifts ontoBaseResponsesAPIModelConfigand is installed bySimpleResponsesAPIModel.setup_webserver, so everyModel server that inherits the base —vllm_model,openai_model,azure_openai_model,genrm_model,local_vllm_model,local_vllm_model_proxy— accepts anadaptersblock automatically. Omitting it leaves behavior identical to today.What's in this PR
Framework
nemo_gym/adapters/pipeline.py— async interceptor chain withREQUEST → REQUEST_TO_RESPONSE → RESPONSEstage ordering validated at build time, reverse-order response phase,best_effortexception isolationnemo_gym/adapters/middleware.py— FastAPI middleware that wrapscall_nextrather than replacing it; the host server's own routing still performs the upstream call. Handles body replay via Starlette's_bodycache + content-length rewrite, list-of-bytes header passthrough (preserves duplicateSet-Cookie),/s/<hex>/...session-id prefix stripping, andGracefulError → 429translationnemo_gym/adapters/registry.py— short-name →Interceptorclass registry with runtimeregister()for pluginsnemo_gym/adapters/types.py—AdapterRequest,AdapterResponse, three Interceptor ABCs,Stageenum,GracefulError, ContextVar-backed per-request contextnemo_gym/adapters/cache/disk_cache.py— sqlite-backed disk cache keyed by canonicalized request body (+ optional session prefix)14 built-in interceptors
loggingdrop_paramspayload_modifiersystem_messageconsolidate_systemmodify_toolstools[].function.parametersturn_counterGracefulErroron exhaustioncachingendpointraise_client_errorsRuntimeErrorlog_tokensusagetoken counts + latencyresponse_statsreasoning<think>...</think>content orreasoningfield intoreasoning_contentprogress_trackingSafety
install_middlewareraisesValueErrorat startup if the chain containsendpoint— it would otherwise double-forward inside a middleware-hosted server (the host already does upstream forwarding viacall_next)best_effort = Trueswallow exceptions and continue; strict interceptors propagateTests — 176 unit tests, 99% line coverage
test_adapter_framework.py— registry surface, ABCstest_adapter_pipeline.py— stage ordering, reverse-order response execution, short-circuit, upstream_call hooktest_adapter_registry.py— resolve, register, import-failure, availabletest_adapter_interceptors.py— per-interceptor unit tests (all 14)test_adapter_interceptors_smoke.py— all 14 instantiate or require_configtest_adapter_consolidate_system.py— displaced system-message mergingtest_adapter_cache_keys.py— golden SHA-256 cache keystest_adapter_disk_cache.py— sqlite round-triptest_adapter_middleware_behaviors.py— multi-Set-Cookie preservation, hop-by-hop header stripping, body cache replaytest_adapter_middleware_integration.py— end-to-end via Starlette TestClienttest_adapter_parity_replay.py— captured-fixture parity over 12 scenariostest_adapter_coverage.py— endpoint retry / timeout / auth, turn_counter GC, middleware helper branches, endpoint-in-chain install guardadapter_fixtures/*.json(12 files) — captured request/response pairsgenerate_adapter_fixtures.py— regeneration scriptThe 5 remaining uncovered lines are hard error paths (sqlite init failure, unknown-stage assertion, bytes-body Response constructor).
Docs
docs/model-server/adapters.md— interceptor catalog, per-interceptor YAML config, custom-interceptor registration, configuration reference, and caveat notes on dual session-id systems /progress_trackingwebhook inline-await/content-encodingstripping / Hydra_inherit_fromlist semanticsdocs/model-server/index.md— adds a Middleware section linking to the new pagedocs/index.md— adds the page to the global toctreeExample config
responses_api_models/local_vllm_model_proxy/configs/local_vllm_model_proxy_with_adapters.yaml— demonstrates alogging + log_tokens + reasoningchain.Test plan
pytest tests/unit_tests/test_adapter_*.py— 176 tests, ~3scoverage run -m pytest && coverage report— adapter coverage at 99% (overall back over 96% threshold)ruff check nemo_gym/adapters tests/unit_tests/test_adapter_*.pyruff format --check nemo_gym/adaptersaws/anthropic/bedrock-claude-sonnet-4-6endpoint: 43-test battery (per-interceptor, corner cases, load 100-concurrent / 200-sequential, stress 500-burst with fd/mem-leak check, cache stampede, mixed status traffic, slow upstream concurrency) — all greenadaptersbehave exactly as before. The architect-flagged guard prevents the one well-knownfootgun (endpointin a middleware-hosted chain) at startup rather than at request time.🤖 Generated with Claude Code