Skip to content

Wire SessionStorage into MCPRemoteProxy for HA support#5237

Open
aron-muon wants to merge 2 commits into
stacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage
Open

Wire SessionStorage into MCPRemoteProxy for HA support#5237
aron-muon wants to merge 2 commits into
stacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage

Conversation

@aron-muon
Copy link
Copy Markdown
Contributor

@aron-muon aron-muon commented May 9, 2026

Summary

When MCPRemoteProxy runs with multiple replicas behind a load balancer that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple AZs), every non-initialize request fails with Session not found. The transparent proxy validates Mcp-Session-Id against pod-local in-memory state on every hop — transparent_proxy.go even calls this out:

// Guard: reject non-initialize requests with unknown session IDs.
// When multiple proxyrunner replicas share a Redis session store,
// a valid session will always be found.

The transport layer already supports a Redis-backed session store via runner.ScalingConfig.SessionRedis. MCPServer (#4368) and VirtualMCPServer (#4367) both wire it through. MCPRemoteProxy was the only one missing the symmetric work, leaving HA deployments architecturally broken.

What changed:

  • Add MCPRemoteProxySpec.SessionStorage field — same SessionStorageConfig shape used by MCPServer and VirtualMCPServer.
  • populateScalingConfigForRemoteProxy — write the non-sensitive Redis parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis.
  • buildRedisPasswordEnvVarForRemoteProxy — inject THV_SESSION_REDIS_PASSWORD on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef is set, so the password never lands in the RunConfig ConfigMap. Mirrors VirtualMCPServer's buildRedisPasswordEnvVar exactly.

The change is intentionally a near-verbatim mirror of the MCPServer / VirtualMCPServer implementations to keep review easy — it's the same pattern reviewers have already accepted twice.

Type of change

  • Bug fix
  • New feature

Test plan

  • Unit tests — go test ./cmd/thv-operator/controllers/... (4 + 4 = 8 new test cases pass; no regressions in existing MCPRemoteProxy suite)
  • Build verification — go build ./...
  • CRD regenerated with controller-gen v0.17.3

New tests:

  • TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig from mcpserver_runconfig_test.go. 4 cases including a serialization check that verifies the password never leaks into the RunConfig.
  • TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar from virtualmcpserver_deployment_test.go. 4 cases covering nil / memory / redis-no-pwd / redis-with-pwd.

API Compatibility

  • This PR does not break the v1beta1 API. The added field (spec.sessionStorage) is optional and behaves identically to the existing nil case when omitted.

Changes

File Lines Change
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go +16 Add SessionStorage *SessionStorageConfig with kubebuilder docs explaining the transparent_proxy session-validation requirement
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go +5 Generated DeepCopyInto for the new field (controller-gen v0.17.3)
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go +28 New populateScalingConfigForRemoteProxy helper, called before runConfig is returned
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go +30 New buildRedisPasswordEnvVarForRemoteProxy + call site appending it to the proxy env
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go +88 TestPopulateScalingConfigForRemoteProxy (4 cases)
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go +63 TestBuildRedisPasswordEnvVarForRemoteProxy (4 cases)
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yaml +114 Generated sessionStorage subschema in CRD

Does this introduce a user-facing change?

Yes. New optional field MCPRemoteProxy.spec.sessionStorage enabling Redis-backed shared session state across replicas — required for HA when the upstream load balancer doesn't preserve client-IP affinity. Same shape as MCPServer.spec.sessionStorage so existing operators already know the pattern.

🤖 Generated with Claude Code

When MCPRemoteProxy runs with multiple replicas behind a load balancer
that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple
AZs), every non-initialize request fails with `Session not found` because
the transparent proxy validates `Mcp-Session-Id` against pod-local
in-memory state on every hop. From transparent_proxy.go:

    // Guard: reject non-initialize requests with unknown session IDs.
    // When multiple proxyrunner replicas share a Redis session store,
    // a valid session will always be found.

The transport layer already supports a Redis-backed session store via
runner.ScalingConfig.SessionRedis — MCPServer and VirtualMCPServer wire
it through. MCPRemoteProxy simply never populated it.

This change ports the symmetric work from MCPServer (PR stacklok#4368) and
VirtualMCPServer (PR stacklok#4367) to MCPRemoteProxy:

- Add MCPRemoteProxySpec.SessionStorage field (same SessionStorageConfig
  shape used by MCPServer / VirtualMCPServer)
- populateScalingConfigForRemoteProxy: write the non-sensitive Redis
  parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis
- buildRedisPasswordEnvVarForRemoteProxy: inject THV_SESSION_REDIS_PASSWORD
  on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef
  is set, so the password never lands in the RunConfig ConfigMap

Tests:
- TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig
  from mcpserver_runconfig_test.go (4 cases including a check that the
  password never leaks into the serialized SessionRedis)
- TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar
  from virtualmcpserver_deployment_test.go (4 cases covering the matrix
  of nil/memory/redis-no-pwd/redis-with-pwd)

Generated:
- zz_generated.deepcopy.go (controller-gen v0.17.3)
- toolhive.stacklok.dev_mcpremoteproxies.yaml CRD schema (controller-gen v0.17.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.88%. Comparing base (9211a36) to head (c52a373).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5237      +/-   ##
==========================================
+ Coverage   67.86%   67.88%   +0.02%     
==========================================
  Files         610      610              
  Lines       62522    62550      +28     
==========================================
+ Hits        42431    42465      +34     
+ Misses      16910    16902       -8     
- Partials     3181     3183       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@aron-muon aron-muon marked this pull request as ready for review May 9, 2026 06:27
@yrobla yrobla requested a review from Copilot May 14, 2026 08:25
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires Redis-backed session storage into MCPRemoteProxy so multi-replica proxy deployments behind load balancers without client-IP affinity can share session state, closing the parity gap with MCPServer (#4368) and VirtualMCPServer (#4367).

Changes:

  • Adds MCPRemoteProxySpec.SessionStorage (mirrors the existing SessionStorageConfig schema) and regenerates DeepCopy/CRD/docs.
  • Adds populateScalingConfigForRemoteProxy to copy address/db/keyPrefix into runner.ScalingConfig.SessionRedis (password intentionally excluded).
  • Adds buildRedisPasswordEnvVarForRemoteProxy to inject THV_SESSION_REDIS_PASSWORD via SecretKeyRef on the proxy Deployment, with corresponding unit tests.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go New optional SessionStorage field.
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go Generated DeepCopy for the new field.
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go Populates ScalingConfig.SessionRedis from spec.
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go Unit tests for the new populator.
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go Injects Redis password env var via SecretKeyRef.
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go Unit tests covering env-var injection cases.
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yaml Regenerated CRD with sessionStorage subschema.
deploy/charts/operator-crds/templates/toolhive.stacklok.dev_mcpremoteproxies.yaml Template counterpart of the regenerated CRD.
docs/operator/crd-api.md Doc additions for the new field.
Files not reviewed (1)
  • cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 170 to +181
// Populate middleware configs from the configuration fields
// This ensures that middleware_configs is properly set for serialization
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil {
return nil, fmt.Errorf("failed to populate middleware configs: %w", err)
}

// Populate ScalingConfig.SessionRedis from spec.sessionStorage so the
// proxy runner has the address/db/keyPrefix needed to construct a
// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
populateScalingConfigForRemoteProxy(runConfig, proxy)
Copy link
Copy Markdown
Contributor

@yrobla yrobla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Review — Wire SessionStorage into MCPRemoteProxy

Context

The fix addresses a real architectural gap. When MCPRemoteProxy runs with >1 replica behind a load balancer that lacks client-IP affinity (AWS ALB across AZs, GCP NEGs, etc.), every non-initialize request fails with Session not found because the transparent proxy validates Mcp-Session-Id against pod-local in-memory state. MCPServer and VirtualMCPServer already had the Redis-backed escape hatch; MCPRemoteProxy was the only component missing it.

What changed and does it make sense?

The PR adds three wiring pieces:

  1. MCPRemoteProxySpec.SessionStorage — same optional field shape used by MCPServer/VirtualMCPServer. Additive and backward-compatible. ✅
  2. populateScalingConfigForRemoteProxy — writes non-sensitive Redis params into runner.ScalingConfig.SessionRedis in the RunConfig ConfigMap. ✅
  3. buildRedisPasswordEnvVarForRemoteProxy — injects THV_SESSION_REDIS_PASSWORD as a SecretKeyRef pod env var, so the password never lands in the ConfigMap. ✅

The security split (connection params in ConfigMap, password as pod secret) matches the established pattern and is correct.

Bug: populateScalingConfigForRemoteProxy is called after PopulateMiddlewareConfigs

In mcpserver_runconfig.go the analogous call order is:

// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }

PopulateMiddlewareConfigsaddRateLimitMiddleware reads config.ScalingConfig.SessionRedis and errors when nil and rate limiting is configured. The PR reverses this in mcpremoteproxy_runconfig.go. MCPRemoteProxy has no rateLimiting field today so this is benign now, but it diverges from the documented invariant and will break the moment rate limiting is added to MCPRemoteProxy. See inline comment for suggested fix.

Design note: sessionAffinity / sessionStorage interaction needs documentation

MCPRemoteProxySpec.SessionAffinity defaults to ClientIP. The two fields now coexist without guidance:

sessionAffinity sessionStorage Result
ClientIP (default) nil / memory ✅ works (pod-sticky routing)
ClientIP redis Redis wired but Service stickiness makes it a no-op
None redis ✅ intended HA configuration
None nil / memory Session not found on any cross-pod request

The last row is the exact failure mode this PR fixes — but users can still hit it if they set sessionAffinity: None without configuring Redis. The SessionStorage field comment should mention setting sessionAffinity: None for the HA case, or a kubebuilder XValidation could enforce it at admission time (similar to how MCPServer validates that rate limiting requires Redis).

Alternative approaches

  • Consistent hashing at the ingress (hash Mcp-Session-Id to a specific pod): avoids Redis but requires custom LB config outside ToolHive's control.
  • Session migration between pods: more complex with no clear benefit over Redis here.

Redis is the right call — same decision already made for MCPServer and VirtualMCPServer, and the operational overhead is low (one Redis shared across all three CRD types).

Summary

Finding Severity Action
populateScalingConfigForRemoteProxy called after PopulateMiddlewareConfigs Medium — latent bug Reorder above PopulateMiddlewareConfigs
No guidance on sessionAffinity + sessionStorage interaction Low — docs/UX Add cross-reference in field comments
Implementation correctness, security split, API compatibility No action needed

// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
populateScalingConfigForRemoteProxy(runConfig, proxy)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering bug (latent): populateScalingConfigForRemoteProxy must be called before PopulateMiddlewareConfigs, not after.

PopulateMiddlewareConfigsaddRateLimitMiddleware reads config.ScalingConfig.SessionRedis and returns an error when it is nil and a rate-limit config is present. The equivalent call in mcpserver_runconfig.go explicitly documents this:

// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }

MCPRemoteProxy has no rateLimiting field today so this is harmless right now, but the moment rate limiting is added (a natural follow-on given MCPServer already has it) this ordering will cause a confusing runtime error. Please swap the two blocks:

Suggested change
populateScalingConfigForRemoteProxy(runConfig, proxy)
// Populate ScalingConfig.SessionRedis from spec.sessionStorage so the
// proxy runner has the address/db/keyPrefix needed to construct a
// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfigForRemoteProxy(runConfig, proxy)
// Populate middleware configs from the configuration fields
// This ensures that middleware_configs is properly set for serialization
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil {
return nil, fmt.Errorf("failed to populate middleware configs: %w", err)
}
return runConfig, nil

// across replicas.
//
// Mirrors MCPServer.spec.sessionStorage and VirtualMCPServer.spec.sessionStorage.
// +optional
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting the sessionAffinity + sessionStorage interaction.

These two fields now coexist without any cross-reference, which creates a footgun. When sessionStorage.provider == redis the correct HA setup also requires sessionAffinity: None — otherwise the k8s Service's ClientIP stickiness makes the Redis store redundant. Conversely, sessionAffinity: None without Redis is the broken configuration this PR was written to fix.

At minimum, the SessionStorage field comment should say: "When using the Redis provider, also set sessionAffinity: None so the Service routes requests round-robin and all replicas rely on the shared session store rather than pod-local state."

A stronger option is a kubebuilder XValidation that requires sessionAffinity == "None" when sessionStorage.provider == "redis", catching the misconfiguration at admission time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Medium PR: 300-599 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants