Skip to content

fix(skills): offload blocking filesystem IO in update_skill and serialize writes#3565

Open
ly-wang19 wants to merge 1 commit into
bytedance:mainfrom
ly-wang19:fix/skills-update-blocking-io
Open

fix(skills): offload blocking filesystem IO in update_skill and serialize writes#3565
ly-wang19 wants to merge 1 commit into
bytedance:mainfrom
ly-wang19:fix/skills-update-blocking-io

Conversation

@ly-wang19

Copy link
Copy Markdown
Contributor

Why

make detect-blocking-io (the blocking-io-guard SOP, #3503) flags
PUT /api/skills/{skill_name}: update_skill runs its config read-modify-write
directly on the asyncio event loop — load_skills(),
ExtensionsConfig.resolve_config_path(), get_extensions_config(), the
open() + json.dump write to extensions_config.json, and
reload_extensions_config() are all blocking filesystem IO. Under concurrency a
slow disk stalls every other in-flight request on that worker. Same class as the
offloads merged for the custom-agent router (#3457) and channels (#3529).

What changed

No caller-visible change. The read-modify-write now runs inside a nested sync
function via asyncio.to_thread (a "not_found" return signals the 404),
guarded by a module-level asyncio.Lock.

The lock matters: offloading the RMW removes the implicit serialization the
single-threaded event loop gave it, so without it two concurrent PUTs could
interleave and clobber each other's toggle on the shared extensions_config
singleton (lost update). The lock restores that serialization — the same pattern
reviewed and accepted on #3552. Status codes and the response body are unchanged.

Surface area

  • Backend APIPUT /api/skills/{name} (internal perf + concurrency-safety; request/response shape unchanged)

Bug fix verification

Following the blocking-io-guard SOP:

  • Candidate: update_skill config write — FILE_OPEN, DIRECT_ASYNCFIX+ANCHOR.
  • Re-scan (Step 2): after the fix make detect-blocking-io no longer lists update_skill.
  • Anchors + teeth (Step 5): backend/tests/blocking_io/test_skills_update_router.py
    • test_update_skill_does_not_block_event_loop — drives the real handler; teeth verified RED (pre-fix BlockingError) → GREEN (offloaded), under make test-blocking-io.
    • test_update_skill_serializes_concurrent_writes — fires two concurrent calls and asserts a max in-flight RMW count of 1; teeth verified RED (max 2) when the lock is removed → GREEN with it. Marked @pytest.mark.allow_blocking_io because the strict gate serializes worker threads, which would otherwise mask the overlap the test needs to observe.

Validation

In backend/:

  • python scripts/detect_blocking_io_static.py backend/app/gateway/routers/skills.pyupdate_skill finding gone.
  • uv run pytest tests/blocking_io/test_skills_update_router.py tests/test_skills_custom_router.py tests/test_skills_validation.py32 passed (incl. the existing test_update_skill_refreshes_prompt_cache_before_return; behavior preserved).
  • uv run ruff check + uv run ruff format --check → clean.

AI assistance

Tool(s) used: Claude Code

How you used it: Ran the blocking-io-guard SOP — make detect-blocking-io to find the candidate, implemented the asyncio.to_thread offload + asyncio.Lock, and wrote the two runtime anchors with their red→green teeth checks. I reviewed every line and verified locally with the detector, ruff, and the test runs above.

  • I've read and understand every line of this change and take responsibility for it — it's not unreviewed AI output.

…lize writes

PUT /api/skills/{skill_name} ran its config read-modify-write directly on the
asyncio event loop: load_skills(), ExtensionsConfig.resolve_config_path(),
get_extensions_config(), the open()+json.dump write to extensions_config.json,
and reload_extensions_config() are all blocking filesystem IO. make
detect-blocking-io flagged the config write as DIRECT_ASYNC.

Move the read-modify-write into a worker thread via asyncio.to_thread (a
"not_found" return signals 404), guarded by a module-level asyncio.Lock.
Offloading the RMW removes the implicit serialization the single-threaded event
loop gave it, so without the lock two concurrent PUTs could interleave and
clobber each other's toggle on the shared extensions_config singleton; the lock
restores that serialization (same pattern as bytedance#3552). Behavior is unchanged.

Per the blocking-io-guard SOP:
- Candidate: update_skill config write (FILE_OPEN, DIRECT_ASYNC) -> FIX+ANCHOR.
- Re-scan: the finding no longer appears for update_skill.
- Anchors (tests/blocking_io/test_skills_update_router.py):
  - test_update_skill_does_not_block_event_loop: teeth verified red (pre-fix) ->
    green (post-fix) under make test-blocking-io.
  - test_update_skill_serializes_concurrent_writes: proves the lock; max in-flight
    RMW is 1 with it, 2 without (red). Gate-exempt because the strict gate
    serializes worker threads, masking the overlap the test needs.
@github-actions github-actions Bot added area:backend Gateway / runtime / core backend under backend/ needs-validation Touches front/back contract surface; needs real-path validation risk:high High risk: backend API, agents, sandbox, auth, deps, CI size/M PR changes 100-300 lines labels Jun 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:backend Gateway / runtime / core backend under backend/ needs-validation Touches front/back contract surface; needs real-path validation risk:high High risk: backend API, agents, sandbox, auth, deps, CI size/M PR changes 100-300 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant