fix(skills): offload blocking filesystem IO in update_skill and serialize writes#3565
Open
ly-wang19 wants to merge 1 commit into
Open
fix(skills): offload blocking filesystem IO in update_skill and serialize writes#3565ly-wang19 wants to merge 1 commit into
ly-wang19 wants to merge 1 commit into
Conversation
…lize writes
PUT /api/skills/{skill_name} ran its config read-modify-write directly on the
asyncio event loop: load_skills(), ExtensionsConfig.resolve_config_path(),
get_extensions_config(), the open()+json.dump write to extensions_config.json,
and reload_extensions_config() are all blocking filesystem IO. make
detect-blocking-io flagged the config write as DIRECT_ASYNC.
Move the read-modify-write into a worker thread via asyncio.to_thread (a
"not_found" return signals 404), guarded by a module-level asyncio.Lock.
Offloading the RMW removes the implicit serialization the single-threaded event
loop gave it, so without the lock two concurrent PUTs could interleave and
clobber each other's toggle on the shared extensions_config singleton; the lock
restores that serialization (same pattern as bytedance#3552). Behavior is unchanged.
Per the blocking-io-guard SOP:
- Candidate: update_skill config write (FILE_OPEN, DIRECT_ASYNC) -> FIX+ANCHOR.
- Re-scan: the finding no longer appears for update_skill.
- Anchors (tests/blocking_io/test_skills_update_router.py):
- test_update_skill_does_not_block_event_loop: teeth verified red (pre-fix) ->
green (post-fix) under make test-blocking-io.
- test_update_skill_serializes_concurrent_writes: proves the lock; max in-flight
RMW is 1 with it, 2 without (red). Gate-exempt because the strict gate
serializes worker threads, masking the overlap the test needs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
make detect-blocking-io(theblocking-io-guardSOP, #3503) flagsPUT /api/skills/{skill_name}:update_skillruns its config read-modify-writedirectly on the asyncio event loop —
load_skills(),ExtensionsConfig.resolve_config_path(),get_extensions_config(), theopen()+json.dumpwrite toextensions_config.json, andreload_extensions_config()are all blocking filesystem IO. Under concurrency aslow disk stalls every other in-flight request on that worker. Same class as the
offloads merged for the custom-agent router (#3457) and channels (#3529).
What changed
No caller-visible change. The read-modify-write now runs inside a nested sync
function via
asyncio.to_thread(a"not_found"return signals the 404),guarded by a module-level
asyncio.Lock.The lock matters: offloading the RMW removes the implicit serialization the
single-threaded event loop gave it, so without it two concurrent
PUTs couldinterleave and clobber each other's toggle on the shared
extensions_configsingleton (lost update). The lock restores that serialization — the same pattern
reviewed and accepted on #3552. Status codes and the response body are unchanged.
Surface area
PUT /api/skills/{name}(internal perf + concurrency-safety; request/response shape unchanged)Bug fix verification
Following the
blocking-io-guardSOP:update_skillconfig write —FILE_OPEN,DIRECT_ASYNC→ FIX+ANCHOR.make detect-blocking-iono longer listsupdate_skill.backend/tests/blocking_io/test_skills_update_router.pytest_update_skill_does_not_block_event_loop— drives the real handler; teeth verified RED (pre-fixBlockingError) → GREEN (offloaded), undermake test-blocking-io.test_update_skill_serializes_concurrent_writes— fires two concurrent calls and asserts a max in-flight RMW count of 1; teeth verified RED (max 2) when the lock is removed → GREEN with it. Marked@pytest.mark.allow_blocking_iobecause the strict gate serializes worker threads, which would otherwise mask the overlap the test needs to observe.Validation
In
backend/:python scripts/detect_blocking_io_static.py backend/app/gateway/routers/skills.py→update_skillfinding gone.uv run pytest tests/blocking_io/test_skills_update_router.py tests/test_skills_custom_router.py tests/test_skills_validation.py→ 32 passed (incl. the existingtest_update_skill_refreshes_prompt_cache_before_return; behavior preserved).uv run ruff check+uv run ruff format --check→ clean.AI assistance
Tool(s) used: Claude Code
How you used it: Ran the
blocking-io-guardSOP —make detect-blocking-ioto find the candidate, implemented theasyncio.to_threadoffload +asyncio.Lock, and wrote the two runtime anchors with their red→green teeth checks. I reviewed every line and verified locally with the detector, ruff, and the test runs above.