Skip to content

Conversation

@sarum90
Copy link

@sarum90 sarum90 commented Jan 16, 2026

Which issue does this PR close?

N/A - no existing issue. Happy to create one if preferred.

Rationale for this change

If a reqwest timeout occurs after consuming all bytes from an S3 response body but before receiving the EOF signal, retry_stream attempts to retry with a Range header like bytes=5-4 (where 5 is the file size). This is an invalid/backwards range.

Per RFC 7233, servers MUST ignore syntactically invalid Range headers. S3 follows this spec and returns 200 OK with the full file instead of 206 Partial Content or 416 Range Not Satisfiable.

Without validation, retry_stream would read the full file again, causing data duplication (e.g., "hellohello" instead of "hello").

What changes are included in this PR?

  1. Early EOF check: If range.start >= range.end before retrying, return EOF immediately rather than sending an invalid range request.

  2. 206 status validation: Verify the retry response is 206 Partial Content. If the server returns 200 OK (indicating it ignored our Range header), fail the retry and return the original error.

  3. Content-Range validation: Verify the Content-Range header matches the requested range. This catches cases where the server returns a different range than requested.

  4. Warning logs: Added warn! logs for the new validation failures to aid debugging.

Are there any user-facing changes?

No breaking changes. Users may see different error behavior in edge cases where retries were previously succeeding incorrectly (by silently duplicating data).

Test Coverage

Automated test coverage for this specific bug is limited for two reasons:

  1. Requires real S3: LocalStack returns 416 Range Not Satisfiable for invalid ranges, which is actually non-compliant with RFC 7233. Real S3 correctly ignores the invalid range and returns 200 OK with the full file, which is what triggers the bug.

  2. Hard to trigger: The bug requires a timeout (or other error) from reqwest after all data is read but before EOF is signaled. This is timing-dependent and would require integration tests with long sleeps to force a timeout.

A manual reproduction script is available in this gist: https://gist.github.com/sarum90/cb95633750390db690ef701f08aa20ed

The script demonstrates the bug against real S3 and shows the data duplication.

- Add defensive checks + warning logs ensuring that partial range
  requests receive partial range responses (206 status + the expected
  content_range)
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fun bug, well done for tracking this one down. Perhaps we could use the HttpServer machinery used for testing retries to test this?

@sarum90 sarum90 force-pushed the fix_stream_eof_timeout_bug branch from f280218 to 6e17d91 Compare January 18, 2026 23:09
@sarum90
Copy link
Author

sarum90 commented Jan 19, 2026

Thanks for the suggestion! I initially tried using HttpBuilder directly, but couldn't get the 206 validation test to fail on main - turns out the HTTP client has defensive checks that the S3 client lacks. So I created a minimal TestGetClient structured like the cloud store clients, which also gave me a way to inject client-side errors after content-length bytes (something you can't trigger from the server side without actual timeouts).

I've added tests for:

  1. 206 status validation - verifies we reject retries that return 200 instead of 206
  2. Content-Range validation - verifies we reject retries with mismatched Content-Range headers
  3. No retry after complete delivery - injects an error after all Content-Length bytes are received, verifying we don't attempt a spurious retry

Let me know if the TestGetClient + ErrorAfterBody infrastructure seems like overkill - happy to simplify if there's a lighter-weight approach, or skip some test coverage if we'd rather avoid potentially brittle test complexity.

@sarum90
Copy link
Author

sarum90 commented Jan 26, 2026

Hi @tustvold

Any thoughts on this test coverage? Happy to work on it more as needed. I forgot about this PR last week, but I'd love to get this fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants