Wire SessionStorage into MCPRemoteProxy for HA support#5237
Conversation
When MCPRemoteProxy runs with multiple replicas behind a load balancer
that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple
AZs), every non-initialize request fails with `Session not found` because
the transparent proxy validates `Mcp-Session-Id` against pod-local
in-memory state on every hop. From transparent_proxy.go:
// Guard: reject non-initialize requests with unknown session IDs.
// When multiple proxyrunner replicas share a Redis session store,
// a valid session will always be found.
The transport layer already supports a Redis-backed session store via
runner.ScalingConfig.SessionRedis — MCPServer and VirtualMCPServer wire
it through. MCPRemoteProxy simply never populated it.
This change ports the symmetric work from MCPServer (PR stacklok#4368) and
VirtualMCPServer (PR stacklok#4367) to MCPRemoteProxy:
- Add MCPRemoteProxySpec.SessionStorage field (same SessionStorageConfig
shape used by MCPServer / VirtualMCPServer)
- populateScalingConfigForRemoteProxy: write the non-sensitive Redis
parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis
- buildRedisPasswordEnvVarForRemoteProxy: inject THV_SESSION_REDIS_PASSWORD
on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef
is set, so the password never lands in the RunConfig ConfigMap
Tests:
- TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig
from mcpserver_runconfig_test.go (4 cases including a check that the
password never leaks into the serialized SessionRedis)
- TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar
from virtualmcpserver_deployment_test.go (4 cases covering the matrix
of nil/memory/redis-no-pwd/redis-with-pwd)
Generated:
- zz_generated.deepcopy.go (controller-gen v0.17.3)
- toolhive.stacklok.dev_mcpremoteproxies.yaml CRD schema (controller-gen v0.17.3)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5237 +/- ##
==========================================
+ Coverage 67.86% 67.88% +0.02%
==========================================
Files 610 610
Lines 62522 62550 +28
==========================================
+ Hits 42431 42465 +34
+ Misses 16910 16902 -8
- Partials 3181 3183 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Wires Redis-backed session storage into MCPRemoteProxy so multi-replica proxy deployments behind load balancers without client-IP affinity can share session state, closing the parity gap with MCPServer (#4368) and VirtualMCPServer (#4367).
Changes:
- Adds
MCPRemoteProxySpec.SessionStorage(mirrors the existingSessionStorageConfigschema) and regenerates DeepCopy/CRD/docs. - Adds
populateScalingConfigForRemoteProxyto copy address/db/keyPrefix intorunner.ScalingConfig.SessionRedis(password intentionally excluded). - Adds
buildRedisPasswordEnvVarForRemoteProxyto injectTHV_SESSION_REDIS_PASSWORDviaSecretKeyRefon the proxy Deployment, with corresponding unit tests.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go |
New optional SessionStorage field. |
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go |
Generated DeepCopy for the new field. |
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go |
Populates ScalingConfig.SessionRedis from spec. |
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go |
Unit tests for the new populator. |
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go |
Injects Redis password env var via SecretKeyRef. |
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go |
Unit tests covering env-var injection cases. |
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yaml |
Regenerated CRD with sessionStorage subschema. |
deploy/charts/operator-crds/templates/toolhive.stacklok.dev_mcpremoteproxies.yaml |
Template counterpart of the regenerated CRD. |
docs/operator/crd-api.md |
Doc additions for the new field. |
Files not reviewed (1)
- cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Populate middleware configs from the configuration fields | ||
| // This ensures that middleware_configs is properly set for serialization | ||
| if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { | ||
| return nil, fmt.Errorf("failed to populate middleware configs: %w", err) | ||
| } | ||
|
|
||
| // Populate ScalingConfig.SessionRedis from spec.sessionStorage so the | ||
| // proxy runner has the address/db/keyPrefix needed to construct a | ||
| // shared Redis-backed session store. The Redis password is intentionally | ||
| // excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env | ||
| // var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go. | ||
| populateScalingConfigForRemoteProxy(runConfig, proxy) |
yrobla
left a comment
There was a problem hiding this comment.
Functional Review — Wire SessionStorage into MCPRemoteProxy
Context
The fix addresses a real architectural gap. When MCPRemoteProxy runs with >1 replica behind a load balancer that lacks client-IP affinity (AWS ALB across AZs, GCP NEGs, etc.), every non-initialize request fails with Session not found because the transparent proxy validates Mcp-Session-Id against pod-local in-memory state. MCPServer and VirtualMCPServer already had the Redis-backed escape hatch; MCPRemoteProxy was the only component missing it.
What changed and does it make sense?
The PR adds three wiring pieces:
MCPRemoteProxySpec.SessionStorage— same optional field shape used by MCPServer/VirtualMCPServer. Additive and backward-compatible. ✅populateScalingConfigForRemoteProxy— writes non-sensitive Redis params intorunner.ScalingConfig.SessionRedisin the RunConfig ConfigMap. ✅buildRedisPasswordEnvVarForRemoteProxy— injectsTHV_SESSION_REDIS_PASSWORDas aSecretKeyRefpod env var, so the password never lands in the ConfigMap. ✅
The security split (connection params in ConfigMap, password as pod secret) matches the established pattern and is correct.
Bug: populateScalingConfigForRemoteProxy is called after PopulateMiddlewareConfigs
In mcpserver_runconfig.go the analogous call order is:
// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }PopulateMiddlewareConfigs → addRateLimitMiddleware reads config.ScalingConfig.SessionRedis and errors when nil and rate limiting is configured. The PR reverses this in mcpremoteproxy_runconfig.go. MCPRemoteProxy has no rateLimiting field today so this is benign now, but it diverges from the documented invariant and will break the moment rate limiting is added to MCPRemoteProxy. See inline comment for suggested fix.
Design note: sessionAffinity / sessionStorage interaction needs documentation
MCPRemoteProxySpec.SessionAffinity defaults to ClientIP. The two fields now coexist without guidance:
sessionAffinity |
sessionStorage |
Result |
|---|---|---|
ClientIP (default) |
nil / memory | ✅ works (pod-sticky routing) |
ClientIP |
redis | Redis wired but Service stickiness makes it a no-op |
None |
redis | ✅ intended HA configuration |
None |
nil / memory | ❌ Session not found on any cross-pod request |
The last row is the exact failure mode this PR fixes — but users can still hit it if they set sessionAffinity: None without configuring Redis. The SessionStorage field comment should mention setting sessionAffinity: None for the HA case, or a kubebuilder XValidation could enforce it at admission time (similar to how MCPServer validates that rate limiting requires Redis).
Alternative approaches
- Consistent hashing at the ingress (hash
Mcp-Session-Idto a specific pod): avoids Redis but requires custom LB config outside ToolHive's control. - Session migration between pods: more complex with no clear benefit over Redis here.
Redis is the right call — same decision already made for MCPServer and VirtualMCPServer, and the operational overhead is low (one Redis shared across all three CRD types).
Summary
| Finding | Severity | Action |
|---|---|---|
populateScalingConfigForRemoteProxy called after PopulateMiddlewareConfigs |
Medium — latent bug | Reorder above PopulateMiddlewareConfigs |
No guidance on sessionAffinity + sessionStorage interaction |
Low — docs/UX | Add cross-reference in field comments |
| Implementation correctness, security split, API compatibility | ✅ | No action needed |
| // shared Redis-backed session store. The Redis password is intentionally | ||
| // excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env | ||
| // var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go. | ||
| populateScalingConfigForRemoteProxy(runConfig, proxy) |
There was a problem hiding this comment.
Ordering bug (latent): populateScalingConfigForRemoteProxy must be called before PopulateMiddlewareConfigs, not after.
PopulateMiddlewareConfigs → addRateLimitMiddleware reads config.ScalingConfig.SessionRedis and returns an error when it is nil and a rate-limit config is present. The equivalent call in mcpserver_runconfig.go explicitly documents this:
// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }MCPRemoteProxy has no rateLimiting field today so this is harmless right now, but the moment rate limiting is added (a natural follow-on given MCPServer already has it) this ordering will cause a confusing runtime error. Please swap the two blocks:
| populateScalingConfigForRemoteProxy(runConfig, proxy) | |
| // Populate ScalingConfig.SessionRedis from spec.sessionStorage so the | |
| // proxy runner has the address/db/keyPrefix needed to construct a | |
| // shared Redis-backed session store. The Redis password is intentionally | |
| // excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env | |
| // var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go. | |
| // Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis. | |
| populateScalingConfigForRemoteProxy(runConfig, proxy) | |
| // Populate middleware configs from the configuration fields | |
| // This ensures that middleware_configs is properly set for serialization | |
| if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { | |
| return nil, fmt.Errorf("failed to populate middleware configs: %w", err) | |
| } | |
| return runConfig, nil |
| // across replicas. | ||
| // | ||
| // Mirrors MCPServer.spec.sessionStorage and VirtualMCPServer.spec.sessionStorage. | ||
| // +optional |
There was a problem hiding this comment.
Consider documenting the sessionAffinity + sessionStorage interaction.
These two fields now coexist without any cross-reference, which creates a footgun. When sessionStorage.provider == redis the correct HA setup also requires sessionAffinity: None — otherwise the k8s Service's ClientIP stickiness makes the Redis store redundant. Conversely, sessionAffinity: None without Redis is the broken configuration this PR was written to fix.
At minimum, the SessionStorage field comment should say: "When using the Redis provider, also set sessionAffinity: None so the Service routes requests round-robin and all replicas rely on the shared session store rather than pod-local state."
A stronger option is a kubebuilder XValidation that requires sessionAffinity == "None" when sessionStorage.provider == "redis", catching the misconfiguration at admission time.
Summary
When MCPRemoteProxy runs with multiple replicas behind a load balancer that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple AZs), every non-initialize request fails with
Session not found. The transparent proxy validatesMcp-Session-Idagainst pod-local in-memory state on every hop —transparent_proxy.goeven calls this out:The transport layer already supports a Redis-backed session store via
runner.ScalingConfig.SessionRedis. MCPServer (#4368) and VirtualMCPServer (#4367) both wire it through. MCPRemoteProxy was the only one missing the symmetric work, leaving HA deployments architecturally broken.What changed:
MCPRemoteProxySpec.SessionStoragefield — sameSessionStorageConfigshape used byMCPServerandVirtualMCPServer.populateScalingConfigForRemoteProxy— write the non-sensitive Redis parameters (address/db/keyPrefix) intorunner.ScalingConfig.SessionRedis.buildRedisPasswordEnvVarForRemoteProxy— injectTHV_SESSION_REDIS_PASSWORDon the proxy Deployment viaSecretKeyRefwhensessionStorage.passwordRefis set, so the password never lands in the RunConfig ConfigMap. Mirrors VirtualMCPServer'sbuildRedisPasswordEnvVarexactly.The change is intentionally a near-verbatim mirror of the MCPServer / VirtualMCPServer implementations to keep review easy — it's the same pattern reviewers have already accepted twice.
Type of change
Test plan
go test ./cmd/thv-operator/controllers/...(4 + 4 = 8 new test cases pass; no regressions in existing MCPRemoteProxy suite)go build ./...New tests:
TestPopulateScalingConfigForRemoteProxymirrorsTestPopulateScalingConfigfrommcpserver_runconfig_test.go. 4 cases including a serialization check that verifies the password never leaks into the RunConfig.TestBuildRedisPasswordEnvVarForRemoteProxymirrorsTestBuildRedisPasswordEnvVarfromvirtualmcpserver_deployment_test.go. 4 cases covering nil / memory / redis-no-pwd / redis-with-pwd.API Compatibility
v1beta1API. The added field (spec.sessionStorage) is optional and behaves identically to the existing nil case when omitted.Changes
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.goSessionStorage *SessionStorageConfigwith kubebuilder docs explaining the transparent_proxy session-validation requirementcmd/thv-operator/api/v1beta1/zz_generated.deepcopy.gocmd/thv-operator/controllers/mcpremoteproxy_runconfig.gopopulateScalingConfigForRemoteProxyhelper, called before runConfig is returnedcmd/thv-operator/controllers/mcpremoteproxy_deployment.gobuildRedisPasswordEnvVarForRemoteProxy+ call site appending it to the proxy envcmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.goTestPopulateScalingConfigForRemoteProxy(4 cases)cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.goTestBuildRedisPasswordEnvVarForRemoteProxy(4 cases)deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yamlsessionStoragesubschema in CRDDoes this introduce a user-facing change?
Yes. New optional field
MCPRemoteProxy.spec.sessionStorageenabling Redis-backed shared session state across replicas — required for HA when the upstream load balancer doesn't preserve client-IP affinity. Same shape asMCPServer.spec.sessionStorageso existing operators already know the pattern.🤖 Generated with Claude Code