Skip to content

tshttpproxy: degrade gracefully when WPAD is unavailable#118

Draft
deansheather wants to merge 2 commits into
mainfrom
wpad-6hzh
Draft

tshttpproxy: degrade gracefully when WPAD is unavailable#118
deansheather wants to merge 2 commits into
mainfrom
wpad-6hzh

Conversation

@deansheather
Copy link
Copy Markdown
Member

Problem

On Windows, tailscaled invokes WinHttpGetProxyForUrl on every outbound HTTP request to discover a proxy via WPAD (DHCP option 252 / DNS A wpad.<domain>). On networks where WPAD is not configured, this call blocks for up to 5s and returns ERROR_WINHTTP_AUTODETECTION_FAILED or ERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT.

The existing 10s negative-cache window was repeatedly torn down by InvalidateCache() on every netmon link change, so control / DERP / log uploads spent most of their time waiting on dead WPAD probes on hosts without WPAD.

Goal: use WPAD if it's available; fall back to direct (or env-var/static proxy) if it's not. No WPAD requirement.

References upstream: tailscale#17055, tailscale#10215.

Changes

All in net/tshttpproxy/tshttpproxy_windows.go:

  1. Backoff bumped from 10s → 5 minutes for ERROR_WINHTTP_AUTODETECTION_FAILED and ERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT. InvalidateCache() still resets it on real link changes, so legitimate WPAD setups keep working — we just stop re-blocking on a 5s probe for every individual HTTP request.
  2. Context-timeout no longer returns stale proxy. Previously the 5s deadline path returned cachedProxy.val from a prior network — wrong answer when roaming onto a network without WPAD. Now returns (nil, nil) with a 30s backoff.
  3. Unknown WinHTTP errors no longer propagate. Mapped to (nil, nil) + 5min backoff. The package-level ProxyFromEnvironment already swallows sysProxyFromEnv errors via shadowed-err in its wrapper, but making the contract explicit at the platform layer documents intent and prevents regressions.

WPAD continues to work unchanged on networks where it is configured; this PR only changes how we degrade when it is not.

Test

Adds a cross-platform regression test (TestProxyFromEnvironment_sysProxyError) that injects a failing sysProxyFromEnv hook and asserts ProxyFromEnvironment returns (nil, nil), respects an active noProxyUntil backoff, and re-probes after InvalidateCache().

Validation

  • go test ./net/tshttpproxy/...
  • go build ./... (Linux) ✅
  • GOOS=windows go build ./cmd/tailscaled/... ./cmd/tailscale/...
  • GOOS=windows go vet ./net/tshttpproxy/...

Scope intentionally not included

  • WinHttpGetIEProxyConfigForCurrentUser fast-path that would skip WinHTTP entirely when WPAD is administratively disabled in IE settings / group policy.
  • Honoring static IE proxy (lpszProxy) when only "Use a proxy server" is configured.

Both are pure optimizations and require new syscall bindings + zsyscall_windows.go regeneration. The user-visible goal (don't require WPAD) is achieved by this PR; those are candidates for a follow-up if telemetry shows it's worth the diff.

Behavior change to call out

If a customer today is failing fast due to an unmapped WinHTTP error, they will now silently dial direct instead. The error is still logged (rate-limited) and winhttp_proxy_err_other still increments, so it's observable.

On Windows, tailscaled invokes WinHttpGetProxyForUrl on every outbound
HTTP request to discover a proxy via WPAD (DHCP option 252 / DNS A
wpad.<domain>). On networks where WPAD is not configured this call
blocks for up to 5s and then returns ERROR_WINHTTP_AUTODETECTION_FAILED
or ERROR_WINHTTP_UNABLE_TO_DOWNLOAD_SCRIPT. The previous 10s
negative-cache window was repeatedly torn down by InvalidateCache() on
every netmon link change, so control / DERP / log uploads spent most of
their time waiting on dead WPAD probes.

This change:

  * Bumps the negative-cache backoff from 10s to 5 minutes for
    autodetection-failed / script-download-failed, and to 30s for the
    overall context timeout. InvalidateCache() still resets it on link
    changes, so we re-probe WPAD whenever the network actually changes.

  * Stops returning a stale cachedProxy.val on context-deadline timeout.
    A previous network's PAC-resolved proxy is the wrong answer when
    roaming to a network without WPAD; return (nil, nil) so the caller
    dials direct.

  * Maps every unmapped WinHTTP error to (nil, nil) plus a backoff
    instead of returning the raw error. The package-level
    ProxyFromEnvironment already discards sysProxyFromEnv errors, but
    making the contract explicit at the platform layer prevents future
    regressions and makes the intent obvious in the source.

WPAD continues to work unchanged on networks where it is configured;
this change only affects how we degrade when it is not.

Adds a cross-platform regression test that injects a failing
sysProxyFromEnv hook and asserts ProxyFromEnvironment returns
(nil, nil), respects an active noProxyUntil backoff, and re-probes after
InvalidateCache().

References: tailscale#17055, tailscale#10215
Copy link
Copy Markdown
Member Author

@deansheather deansheather left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid diagnosis of the WPAD-on-non-WPAD-network problem, and the longer backoffs + named constants land cleanly. The deep-review found a couple of structural concerns worth a second look before this merges, plus some test coverage gaps in the exact paths the PR changes.

This review contains findings that may need attention before merge. 1 P1, 4 P2, 2 P3, 4 nits across 9 inline comments.

Comment thread net/tshttpproxy/tshttpproxy_windows.go
Comment thread net/tshttpproxy/tshttpproxy_windows.go
Comment thread net/tshttpproxy/tshttpproxy_windows.go
Comment thread net/tshttpproxy/tshttpproxy_windows.go Outdated
Comment thread net/tshttpproxy/tshttpproxy_test.go
Comment thread net/tshttpproxy/tshttpproxy_windows.go
Comment thread net/tshttpproxy/tshttpproxy_windows.go Outdated
Comment thread net/tshttpproxy/tshttpproxy_windows.go Outdated
Comment thread net/tshttpproxy/tshttpproxy_test.go
Summary of changes in response to review:

* Add an InvalidateCache epoch counter (tshttpproxy.go) and snapshot it
  at the start of every WPAD probe. The Windows error/timeout branches
  only commit negative-cache state when the epoch hasn't advanced, so a
  stale probe from a previous network can no longer stomp noProxyUntil
  and lock out WPAD on a freshly-changed network for 5 minutes.

* Restore an epoch-gated cachedProxy fallback on context-deadline
  timeout. When the cached proxy was discovered under the current epoch
  and a probe times out, return the cached value instead of dialing
  direct — this keeps users on networks that legitimately require a
  proxy from silently bypassing it during transient WinHTTP slowness.
  When the cache is stale or empty, we still fall through to direct
  with a short backoff, which addresses the original 'stranded on a
  stale proxy after roaming' concern.

* Make ProxyFromEnvironment's contract explicit: it now drops
  sysProxyFromEnv errors via an explicit '_' instead of relying on
  variable shadowing. This was load-bearing for the Windows fix; a
  future refactor could have silently re-propagated WinHTTP errors and
  broken outbound traffic on WPAD-less hosts.

* Refactor proxyFromWinHTTPOrCache to take an injectable winHTTPLookup
  package var and add Windows-targeted tests covering: success path
  (with epoch stamp), autodetect-failed, unmapped error, timeout with
  no cached proxy, timeout with fresh cached proxy, timeout with
  stale-epoch cached proxy, and an explicit regression test for the
  'stale probe stomps fresh InvalidateCache' race.

* Rename backoff constants for consistency
  (wpadTimeoutBackoff -> wpadTimeoutFailedBackoff, etc.), fold the
  ERROR_INVALID_PARAMETER 1-hour value into the named-constant block,
  and drop the unmapped-error backoff from 5 minutes to 1 minute so a
  one-off transient WinHTTP error doesn't suppress autodetection for
  longer than necessary.

* Trim the duplicate Coder fork: block comment inside the function
  body; the rationale lives once on the const block now.

* Rename the cross-platform test's misleading wantErr to syntheticErr
  (the value is intentionally swallowed, not asserted on).
@deansheather
Copy link
Copy Markdown
Member Author

/coder-agents-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant