Skip to content

RFC: Graceful Degradation for Multi-Upstream OAuth in vMCP#72

Merged
tgrunnagle merged 5 commits into
mainfrom
vmcp_graceful_deg_2026-05-13
May 15, 2026
Merged

RFC: Graceful Degradation for Multi-Upstream OAuth in vMCP#72
tgrunnagle merged 5 commits into
mainfrom
vmcp_graceful_deg_2026-05-13

Conversation

@tgrunnagle
Copy link
Copy Markdown
Contributor

@tgrunnagle tgrunnagle commented May 13, 2026

Adds an RFC proposing opt-in partial-completion semantics for the
embedded authorization server's multi-upstream OAuth chain in
VirtualMCPServer.

Related issue: stacklok/toolhive#5162

Note: This PR body was updated post-merge to reflect the
RFC's final design after review feedback. See follow-up commits
fde11e0 and 031eda4 on the same branch for the details.

Why

Today the embedded auth server's multi-upstream chain is all-or-nothing:
one upstream IdP outage or one declined consent screen invalidates every
collected token and locks the user out of every backend on the vMCP —
including backends that have nothing to do with the failed upstream.
Operators aggregating backends across multiple SaaS IdPs (e.g. github +
slack + google) see the whole vMCP appear down for what should be a
single-provider problem.

Key design decisions

  • Single per-upstream optional flag (default false) on
    UpstreamProviderConfig. Marking any provider optional: true opts
    the deployment into partial completion for that provider; leaving
    every provider at the default reproduces today's all-or-nothing
    behavior exactly. No top-level mode toggle.
  • Primary identity provider — already named by
    authzConfig.inline.primaryUpstreamProvider — is always required;
    admission webhook rejects configs that set optional: true on it.
  • Chain completion walks every upstream and completes when every
    required upstream has a token; optional upstreams that error or are
    declined are recorded as session-scoped skip rows.
  • Skip rows reuse the existing UpstreamTokens storage row. Two
    new scalar fields (Skipped bool, SkippedReason string) on the
    existing row; the UpstreamTokenStorage interface is unchanged. The
    existing InProcessService.GetAllValidTokens / GetValidToken
    filter skip rows out of the live-token map, so identity.UpstreamTokens
    consumers behave exactly as today when no rows are skipped.
  • No /authorize restart-all branch. The existing handler already
    mints a fresh SessionID per call, so re-running /authorize is
    structurally a new session — prior session rows age out under their
    existing TTL. Per-upstream retry is explicitly rejected — identity-
    binding hazards outweigh the round-trip savings.
  • vMCP filters by skipping unauthorized backend init. vMCP freezes
    the per-session capability set at MCP initialize time, so the
    filter hooks into the pre-init backend filter loop of
    makeBaseSession: backends whose required upstream is missing from
    identity.UpstreamTokens never run initOneBackend at all — no HTTP
    connection, no handshake, no capability listing, no entry into the
    routing table or per-session SDK tool store.
  • No new dispatcher gate. Out-of-band revocation between
    initialize and call is already caught by the existing per-request
    identity.UpstreamTokens re-hydration plus each backend auth
    strategy's ErrUpstreamTokenNotFound path.
  • Recovery is whole-session. A fresh /authorize produces a new
    JWT with a new tsid; reconnecting to vMCP with the new bearer
    triggers a fresh initialize and re-aggregates capabilities.
  • vMCP does not extend the MCP protocol to signal filtered state.

Out of scope

Per-upstream retry, MCP-protocol-level signaling of filtered state,
silent backend dropping on RT expiry, proactive token introspection,
dynamic upstream addition.

Notes for reviewers

  • Implementation is sequenced runtime-first → CRD-last so chain-logic
    and filtering can land and be tested before the CRD surface
    solidifies.
  • All initial open questions (revocation detection, skip-row TTL,
    metric cardinality, status conditions) were resolved during drafting;
    see the body for the decisions.
  • The RFC originated in the toolhive repo via Claude Code (referencing
    source line numbers); file-path references are plain text so they
    remain readable when the RFC is moved to its destination repo.

🤖 Generated with Claude Code

Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
Comment thread rfcs/THV-0072-vmcp-multi-upstream-partial-auth.md Outdated
tgrunnagle and others added 3 commits May 14, 2026 08:33
Fold the partial-auth toggle into a single per-upstream `optional` flag
(default false, preserves today's all-or-nothing default), reuse the
existing `UpstreamTokens` row with `Skipped`/`SkippedReason` fields
instead of a parallel storage interface, drop the invented restart-all
branch (every `/authorize` already mints a fresh SessionID), and move
backend filtering from list-method time to session-creation time —
vMCP freezes the capability snapshot at MCP initialize, so re-auth
recovery requires a new MCP session via reconnect with the new JWT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ions

Move the per-session filter into the pre-init backend filter loop of
makeBaseSession so unauthorized backends never run initOneBackend at
all (no connection, handshake, or capability listing), rather than
filtering post-aggregation. Drop the invented dispatcher gate — the
existing strategy ErrUpstreamTokenNotFound path already covers
out-of-band revocation. Remove the standalone refresh-token-expiry
and fresh-MCP-session subsections (covered concisely elsewhere or
pre-existing behavior). Fold Phase 2's re-authorization pinning
tests into Phase 1 and renumber subsequent phases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@tgrunnagle tgrunnagle requested a review from jhrozek May 14, 2026 21:04
Copy link
Copy Markdown
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very nice thank you

@tgrunnagle tgrunnagle merged commit 414931f into main May 15, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants