Skip to content

v2.5.0 production hardening — 16 P0/P1/P2 bugs from 4 audit passes#23

Merged
vietnguyentuan2019 merged 11 commits into
mainfrom
refactor/p0-p1-production-hardening
May 18, 2026
Merged

v2.5.0 production hardening — 16 P0/P1/P2 bugs from 4 audit passes#23
vietnguyentuan2019 merged 11 commits into
mainfrom
refactor/p0-p1-production-hardening

Conversation

@vietnguyentuan2019
Copy link
Copy Markdown
Contributor

Summary

Late-cycle proactive review on the cancellation × lock-order × retry-contract × data-integrity axes surfaced 16 production bugs that the existing 1300+ test suite missed. Each fix is pinned by a V250…Test.kt regression test verified to fail under a temporary revert.

  • 858 iOS + 493 Android = 1351 tests, 0 failures after fixes
  • ./gradlew check clean across all 225 tasks (lint + tests + verification)
  • Maven publish dry-run succeeds for all 5 targets (KMP common + Android + 3 iOS)

Bug breakdown

P0 — data loss / crash (6) · BaseKmpWorker overflow-file delete on cancel · iOS dynamic+dedicated dispatcher dropped Retry · replaceChainAtomic mutex bypass race · legacy queue migration OOM · built-in worker shouldRetry cascade

P1 — correctness / silent state corruption (5) · NSFileCoordinator bypass on timeout · nested withTimeout outer-cancel misattribution (chain + task layer) · wall-clock → monotonic for elapsed math · StorageMigration swallowed CancellationException

P2 — quota waste / API contract (3) · DynamicTaskDispatcher placebo scope · Failure(shouldRetry=false) chain-abandon (+ latent deleteChainProgress buffer-eviction) · iOS battery guard race with host UI

BAD — code smell / future risk (2) · NPE/IAE classified as permanent → silent data loss · IosFileStorage.close() skipped cancel+join on flush throw

Full table in release notes. Migration guidance for behavioral changes in docs/MIGRATION_V2.5.0.md §5–§12.

Verification protocol

Each fix verified TWICE:

  1. With fix → regression test passes
  2. With TEMP-REVERT → regression test fails with exact bug-symptom message
  3. Restore → passes again

One cascade caught and fixed during the audit itself: BUG 13's Failure(shouldRetry=false) semantic broke 4 built-in workers that used the default — caught and fixed before merge as L3b-1.

Deferred to v2.5.1 patch (in roadmap)

Two structural improvements documented in docs/ROADMAP.md — neither blocks v2.5.0:

  1. withTimeoutOrNull over the elapsedMs < timeout heuristic (provably correct vs CPU-starvation-defeatable)
  2. File-backed OverflowFileRegistry for multi-process Android hosts

Test plan

  • ./gradlew :kmpworker:allTests — 1351 tests pass
  • ./gradlew check — full project verification, 225 tasks, no lint warnings
  • ./gradlew :kmpworker:publishAllPublicationsToMavenCentralLocalRepository — Maven artifacts build + sign
  • Risk-matrix review — every behavioral change cross-checked against all callers (surfaced 1 additional sibling bug in IosWorkerDiagnostics, also fixed)
  • Maven Central upload (manual after merge)

…tentCodes

String.hashCode() produced collisions for UUID-style task IDs — birthday paradox
gives ~50% collision at ~65k tasks on a 32-bit hash. Both NativeTaskScheduler and
AlarmBootReceiver now call PendingIntentCodes.forTaskId() so FLAG_UPDATE_CURRENT
resolves to the same slot after reboot instead of double-arming or silently missing.

fix(android): BaseAlarmReceiver — structured scope + withTimeout(8s) lifecycle

Replace bare CoroutineScope(IO).launch (leaked work past BroadcastReceiver lifetime)
with a per-call SupervisorJob scope cancelled in finally. Add withTimeout(workTimeoutMs=8s)
so a hung worker cannot exhaust the ~10s receiver budget. pendingResult.finish() and
overflow-file cleanup are both guarded against secondary exceptions.

fix(ios): FileCompressionWorker fail-fast by default; opt-in fallback

The iOS implementation was silently copying the input file uncompressed. Default
behavior now returns WorkerResult.Failure with an actionable message. Set
FileCompressionConfig.allowIosUncompressedFallback = true to restore the copy
behavior (for demo chains / dev builds). Adds FileCompressionWorkerIosTest (3 cases).

feat(common): WorkerResult.Retry(reason, delayMs, attemptCap)

New sealed variant alongside the legacy Failure(shouldRetry=true). Android maps to
Result.retry() with attempt-cap ceiling; iOS ChainExecutor and SingleTaskExecutor
handle the new branch (telemetry only for now — per-step delay/cap on iOS tracked
in ROADMAP.md). Fail-fast for programming errors (SerializationException,
IllegalArgumentException) separated from transient network errors in HttpDownloadWorker.

feat(http): HttpDownloadWorker resumable downloads via Range/partial file

Partial in-progress download persisted to <savePath>.partial across attempts.
Next invocation sends Range: bytes=N- and handles 206/200/416/5xx distinctly:
- 206 → append to partial
- 200 → server ignored Range, overwrite (log warning)
- 416 → partial already complete, finalize
- 5xx → WorkerResult.Retry(delayMs=30s), partial preserved
HttpDownloadConfig gains resumable=true (default) and optional maxBytes cap.
Adds 5 test cases to BuiltinWorkersTest.

fix(security): block CGNAT 100.64.0.0/10 and 0.0.0.0/8 in SSRF validator

100.64.0.0/10 (RFC 6598) covers Tailscale and ISP carrier-grade NAT — reachable
from a device on mobile networks. 0.0.0.0/8 closes the "this network" gap.
Boundary tests confirm 100.63.x.x and 100.128.x.x remain unblocked.
…ackground URLSession (v2.5.0)

New built-in workers:
- ParallelHttpDownloadWorker — N-chunk concurrent RFC 7233 byte-range downloads with per-chunk
  .partN resume, SHA-256/MD5/SHA-1/SHA-512 checksum verification, DuplicatePolicy
  (OVERWRITE/RENAME/SKIP), sequential fallback when server lacks Accept-Ranges
- ParallelHttpUploadWorker — per-file parallel multipart upload with Semaphore concurrency
  cap, per-file retry on 5xx/network errors, structured JSON result (uploadedCount/failedCount)
- IosBackgroundDownloadWorker + IosBackgroundUrlSessionManager — NSURLSession background
  download that survives app termination; completion/failure emitted via TaskEventBus

HttpDownloadWorker enhancements:
- Checksum verification via Okio HashingSource (ChecksumAlgorithm enum: MD5/SHA1/SHA256/SHA512)
- DuplicatePolicy: SKIP short-circuits before any network call; RENAME probes up to 10 000 suffixes
- .partial resume with appendingSink; 416 + partial bytes → finalize as complete
- 5xx responses → WorkerResult.Retry(delayMs=30_000); parse/config errors → Failure (no retry)

iOS chain executor — honor WorkerResult.Retry.delayMs / .attemptCap:
- ChainProgress gains stepRetryCounts: Map<Int,Int>; withStepAttempt()/stepAttempts() helpers
- ChainExecutor accumulates nextBgTaskDeferMs per batch (max of all retry delays requested);
  exposes requestedNextBgTaskDelayMs for Swift BGProcessingTaskRequest.earliestBeginDate
- AttemptCap enforcement: step abandoned and chain failed when cap reached

Android BaseAlarmReceiver — structured coroutine lifecycle (regression-tested):
- Extracted runHandleAlarmScope(onFinish: () -> Unit) so tests can drive it without a real
  BroadcastReceiver.PendingResult (Robolectric adversarial suite: 5 tests)
- withTimeout(workTimeoutMs=8s) cancels stuck work; TimeoutCancellationException logged as warning
- scope.cancel() in finally tears down any unstructured child launches
- pendingResult.finish() guarded in try/catch to tolerate double-finish

Adversarial tests:
- PendingIntentCodesAdversarialTest: birthday-paradox proof that CRC32 ≤ String.hashCode
  collisions across 10 000 UUID-style IDs; "Aa"/"BB" hashCode=2112 collision demonstrated
- BaseAlarmReceiverLifecycleTest (Robolectric): stuck work cancelled, finish called once on
  throw/timeout, overflow file deleted in finally, consecutive runs don't share scope
- ChainProgressRetryTest: per-step attempt counter independent of chain-level retryCount
- HttpDownloadChecksumAndDuplicateTest, ParallelHttpDownloadWorkerTest, ParallelHttpUploadWorkerTest

Demo app updated with 4 new DemoCards; ROADMAP.md v2.5 P1 items marked done; v2.6/v3.0 added
… iOS BGTask hard-limits doc

Android — configurable FGS type (camera-app blocker for Android 14/15):
- `KmpHeavyWorker` is now `open` with `protected open val foregroundServiceType: Int`
  (default `FGS_DATA_SYNC`). Subclasses set `FGS_MEDIA_PROCESSING` (Android 15+
  image/video transcoding), `FGS_CAMERA`, `FGS_LOCATION`, … without copy-pasting
  the whole `createForegroundInfo()` body.
- Companion-object aliases (`FGS_DATA_SYNC`, `FGS_MEDIA_PROCESSING`, `FGS_CAMERA`,
  `FGS_LOCATION`, `FGS_MEDIA_PLAYBACK`, `FGS_CONNECTED_DEVICE`) so host code does
  not need `import android.content.pm.ServiceInfo`.
- `FGS_MEDIA_PROCESSING` is pinned to the documented literal `0x1000` because
  `ServiceInfo.FOREGROUND_SERVICE_TYPE_MEDIA_PROCESSING` is API 35+ and not on
  the compile classpath at compileSdk=34. Adversarial test
  (`KmpHeavyWorkerFgsTypeTest`) pins each alias and asserts no collisions.
- Bumped FGS type API range from API 31+ to API 29+ (Q) — `ForegroundInfo`'s
  type-aware constructor has always existed on API 29; gating at 31 was overly
  conservative and lost coverage on Android 10/11 devices.

Docs:
- `docs/ANDROID_FGS_GUIDE.md` — manifest snippets per FGS type
  (`mediaProcessing`, `camera`, `location`, `dataSync`), Android 15 6-hour
  `dataSync` cap, runtime permission checklist, ADB testing recipes.
- `docs/IOS_BGTASK_LIMITS.md` — pins down the four physical limits no library
  can break: opportunistic scheduling (`dasd` decides, not us), 30 s App
  Refresh budget + `0x8badf00d` SIGKILL semantics, headless DI cold-start
  drain on BGTask budget, ZIP codec absence on Kotlin/Native. Each section
  spells out what the library can guarantee vs. what host apps must design
  around.
- README docs index links both guides + the iOS background URLSession guide.

Roadmap:
- v2.5 §1 FGS guidance — marked ✅ shipped (was v2.6 ⏳).
- v2.5 P1.5 — iOS chain retry honoring marked ✅ shipped (was v2.5 stretch ⏳).
- v2.6 §7 — iOS ZIP via `libz.dylib` cinterop added (closes the camera-app
  blocker called out in IOS_BGTASK_LIMITS §4).

iOS internal: `IosBackgroundUrlSessionManager` delegate extracted from nested
`object Delegate` to top-level class for testability — internals made
`internal` so the delegate can access them from the same module.
…2.5.4 guide

QA/QC review (Senior Dev lens) flagged four production risks the existing test
suite did not pin down. This commit addresses each one:

1. File-size compaction trigger for AppendOnlyQueue
   - The pre-v2.5 trigger fired only at ≥ 80 % processed-item ratio. An
     enqueue-heavy workload (e.g. 5 000 items queued, 50 dequeued) never
     compacts and the file grows unbounded.
   - New trigger: file > 5 MB AND ≥ 20 % processed AND ≥ 50 processed items.
     Either trigger reclaims slack; compaction itself only removes processed
     items so the file-size path is a no-op when head=0.
   - Covered by `QueueScaleStressTest.fileSizeCompaction_reclaimsSpaceAfterDequeue`.

2. BackwardCompatibilityTest — pins on-disk forward-compat contract
   - Loads a v2.4.3-shaped `ChainProgress` JSON (no `stepRetryCounts` field)
     and asserts v2.5.0 reader handles it without data loss; `stepRetryCounts`
     defaults to `emptyMap()` per the additive schema rule.
   - Pins `ignoreUnknownKeys = true` against accidental flip via a future
     refactor that would silently brick pending tasks on app upgrade.
   - Pins null preservation on `lastFailedStep` / `lastError`.
   - Pins corrupt-JSON self-heal to null (no exception propagation).

3. QueueScaleStressTest — guards against O(N²) regressions
   - 2 000 enqueue + 2 000 dequeue cycle with a 120 s soft ceiling. Sized to
     stay CI-friendly while still surfacing a "load whole file into RAM"
     regression (which would push runtime well past 5 min).
   - Companion test exercises the new file-size compaction path with a 6 MB
     queue.

4. docs/APPLE_APP_STORE_REVIEW_GUIDELINES.md
   - Spells out App Store §2.5.4 risks for dynamic task dispatch under a
     single `BGTaskScheduler` identifier. Provides safe vs. rejection-bait
     patterns with concrete examples.
   - Compliance checklist (Info.plist audit, worker→identifier mapping,
     description / privacy-policy alignment, cellular constraint).
   - What the library can guarantee vs. what host apps are responsible for.

Roadmap section P1.6 lists each item and its regression test owner.
README docs index links the new App Store guide.
…urvive (P0)

QA review (Senior Dev lens) caught a critical bug in
IosBackgroundUrlSessionManager: savePaths and taskNames were stored only in
process memory. The whole point of URLSessionConfiguration.background is that
downloads keep running inside nsurlsessiond after the app is killed; iOS then
cold-launches the app to deliver completion events.

Pre-v2.5 sequence (broken):
  1. enqueueDownload → savePaths[id] = config.savePath in MutableMap
  2. User swipes the app away → process dies, MutableMap wiped
  3. Download finishes → iOS cold-launches the app, delegate fires
  4. delegate looks up savePaths[id] → null
  5. file orphaned in NSTemporaryDirectory; no completion event emitted
  6. WorkManager chain waiting on the download hangs forever

Fix:
- New BackgroundDownloadStateStore — disk-backed Map<key, Entry> where
  key = "sessionId:taskId" (composite key disambiguates iOS's per-session task
  identifier recycling).
- Entries persisted to <AppSupport>/dev.brewkits.kmpworkmanager/bg_url_session_state.json
  via writeToURL(atomically: true) so power loss mid-write keeps the previous
  version intact.
- enqueueDownload writes the entry BEFORE task.resume() so the daemon never
  owns a download we have no record of.
- Delegate uses synchronous getSync() because iOS deletes the temp file the
  moment didFinishDownloadingToURL returns — there's no time to suspend on a
  coroutine Mutex. In-memory @volatile cache mirrors the disk state and is
  loaded once on cold-launch, then served from memory for the rest of the
  process lifetime. Cache invalidated on every write.
- New sweepStaleStateOnLaunch() called from host AppDelegate cleans entries
  older than 7 days so the file does not grow unboundedly across years of
  installs.
- Schema version field on the state file for future migrations; current
  additive changes survive via ignoreUnknownKeys = true.

Adversarial coverage (BackgroundDownloadStateStoreTest):
  - Round-trip put → get.
  - Cold-launch simulation via invalidateInMemoryCacheForTest — proves
    getSync() hits disk, not the wiped in-memory cache.
  - (sessionId, taskId) collision: two sessions with task identifier=1 each
    keep their own entries.
  - Idempotent remove + survives cache invalidation.
  - sweepStale drops old entries, keeps fresh.

Doc: IOS_BACKGROUND_URL_SESSION.md gains a "How cold-launch survival actually
works" section with the timeline diagram and the rationale for the
synchronous disk read in the delegate.
A second-pass review of the v2.5.0 branch by a Senior Dev / QC Lead found
four critical bugs the first pass missed. All four are real (independently
verified against source), and all four are now fixed with regression tests.

## P0 — Data loss

1. `AppendOnlyQueue.dequeue` wiped the entire queue on CRC corruption.
   `readSingleRecordWithValidation` correctly set `corruptionOffset` to the
   precise byte where the corrupt record started; but the `dequeue`
   null-handling branch then unconditionally overwrote `corruptionOffset =
   0UL`, forcing `truncateAtCorruptionPoint` to take the full-reset path.
   Every pending task ahead of the corrupt one was lost.
   Fix: added `else if (isQueueCorrupt)` guard that preserves the precise
   offset set by the validator. The original "external truncate/replace"
   full-reset path remains for the case where read returned null WITHOUT
   `isQueueCorrupt` being set (file actually shrank under us).
   Coverage: `AppendOnlyQueueCrcCorruptionTest`.

2. `BackgroundDownloadStateStore.getSync` cache stale-write race re-introduced
   the v2.5.0 cold-launch orphan bug. The pre-fix pattern
   `cache ?: readUnlocked().also { cache = it }` had a window between the
   volatile null-check and the `.also` write-back; a concurrent put/remove
   landing in that window would publish a fresh cache, then the also-block
   would clobber it with the stale disk snapshot.
   Fix: compare-and-set publish — only write to `cache` if it's still null
   when we're ready to publish. Return whichever map is freshest (the new
   cache if a concurrent put landed, else our disk snapshot).
   Coverage:
   `BackgroundDownloadStateStoreTest.getSync_doesNotClobberCacheFromConcurrentPut`.

## P1 — Host harm / retry storm

3. `ChainExecutor` clobbered host app's battery monitoring state.
   `UIDevice.batteryMonitoringEnabled = true; …; = false` ran on every chain
   execution, unconditionally turning the flag off. Host apps that enable it
   for their own UI (settings screen showing battery level) lost their
   updates.
   Fix: capture `hostHadMonitoringEnabled` before flipping it; only restore
   to `false` if WE were the ones who turned it on. Host's prior state is
   preserved.

4. `ParallelHttpDownloadWorker` infinite retry storm on disk-full.
   `mergeChunks` failure (e.g. disk full) deleted only `.merged` but kept
   `.partN` files. On retry, `downloadOneChunk` saw all parts at expected
   size and skipped the network, then `mergeChunks` failed again with the
   same error → infinite loop, no forward progress, even after the user
   freed disk space (because parts still occupied it).
   Fix (two layers):
   - Merge-failure catch now purges `.partN` too. Retry must re-download,
     guaranteeing forward progress.
   - Outer Retry now carries `attemptCap = 4`. Even if cleanup somehow
     fails, the chain abandons after ~1 minute instead of looping forever.
   Coverage:
   `ParallelHttpDownloadWorkerTest.parallel_mergeFailure_cleansUpAllParts_breaksRetryStorm`,
   `parallel_outerRetryHasAttemptCap_toBreakInfiniteLoop`.

## Release docs

Also in this commit (pre-publish housekeeping):
- `CHANGELOG.md` — full v2.5.0 entry with Added / Fixed / Changed / Docs.
- `docs/MIGRATION_V2.5.0.md` — upgrade guide from v2.4.x.
- `docs/release-notes/v2.5.0-RELEASE-NOTES.md` — narrative release notes.
- Version bumped from 2.4.3 → 2.5.0 in install snippets across README,
  TROUBLESHOOTING, platform-setup, quickstart, ARCHITECTURE.
- README "What's new in v2.5.0" section added; v2.4.3 section preserved
  as history.

## Verification

- `./gradlew :kmpworker:allTests` — BUILD SUCCESSFUL (Android + iOS).
- `./gradlew :composeApp:assembleDebug` — APK built.
- `./gradlew :composeApp:compileKotlinIosSimulatorArm64` — iOS demo compiles.
…lish)

A third-pass review (QC Lead / Senior Architect) found four issues that
are not data-loss / crash-on-happy-path but degrade developer experience
and could create "false sense of security" or opaque failure modes:

5. **iOS Simulator fallback false sense of security** —
   `NativeTaskScheduler.submitTaskRequest` uses `delay()` to simulate
   BGTaskScheduler on the simulator. Coroutine `delay()` freezes when iOS
   suspends the app, so a developer's simulator test of "10-minute task"
   only ticks while the app is foreground. Production iOS uses wall-clock
   opportunistic scheduling decided by `dasd`. The mismatch is invisible
   to the developer until they ship and find the schedule doesn't fire.
   Fix: prominent warning log at every fallback invocation + KDoc spelling
   out the divergence.

6. **Fragile test-mode detection** —
   `NSProcessInfo.processInfo.processName.endsWith("test.kexe")` was the
   sole detection heuristic in 4 places (`IosFileCoordinator`,
   `IosFileStorage`, `NativeTaskScheduler`). A future Kotlin/Native binary
   naming change would silently break all of them with no compile error.
   Fix: extracted `IosTestEnvironment` as the single source of truth.
   Cascade of signals:
     1. `KMPWORKMANAGER_TEST_MODE=1` env var (explicit, recommended).
     2. `XCTestConfigurationFilePath` (Xcode + XCTest harness).
     3. `XCInjectBundleInto` (XCTest, alt signal).
     4. `processName.endsWith("test.kexe")` (legacy K/N suffix).
   Any one signal trips test mode; cached via `by lazy`. Four call sites
   collapsed to one detector.

7. **Android FGS crash trap** — `KmpHeavyWorker.doWork()` called
   `setForeground()` unguarded. On Android 14+, an FGS type mismatch
   between `foregroundServiceType` and the manifest throws
   `ForegroundServiceStartNotAllowedException` (subclass of
   `IllegalStateException`) which crashed the worker with an opaque
   WorkManager stack trace.
   Fix: try-catch for `SecurityException` (permission missing) and
   `IllegalStateException` (type not declared). Both log an actionable
   diagnostic pointing to docs/ANDROID_FGS_GUIDE.md and return
   `Result.failure()` instead of crashing.

8. **Exact alarm "user-ignores-notification = task waits indefinitely"
   contract** — `checkAndExecuteMissedExactAlarms` KDoc previously mentioned
   notification permission and ignored-notification cases in passing, but
   did not lead with the all-caps warning that iOS has NO equivalent of
   `AlarmManager.setExactAndAllowWhileIdle`. A developer building a
   medication reminder / alarm clock / financial-transaction trigger
   needed to surface this risk at the top of the KDoc, not midway through.
   Fix: KDoc now leads with the warning + explicit do-NOT-use-for list.
   `docs/IOS_BGTASK_LIMITS.md` §5 added with the same content.

Test runs after fixes:
- `:kmpworker:testDebugUnitTest` — BUILD SUCCESSFUL.
- `:kmpworker:iosSimulatorArm64Test` — BUILD SUCCESSFUL (4m 39s).
Fourth-pass audit (Senior Tech Lead) found three more issues. All verified
against source and fixed with regression tests. (One claimed issue —
"WorkManager < 15 min periodic crash" — was already guarded by
`TaskTrigger.Periodic.init { require(intervalMs >= MIN_INTERVAL_MS) }`
on commonMain since v2.4, so no fix needed there.)

## P0 — iOS BGTask budget exhaustion

9. `AppendOnlyQueue.compactQueue` OOM on large queues. The compaction step
   loaded all unprocessed records into a `mutableListOf<String>()` before
   writing them to the compacted file. With `MAX_QUEUE_SIZE = 10 000` and
   typical 1–4 KB JSON payloads, peak RAM = 10–40 MB — above the iOS
   background-task budget (Watchdog kills with `EXC_RESOURCE /
   RESOURCE_TYPE_MEMORY` at ~30 MB on older devices). This broke the O(1)
   RAM contract that the rest of the queue carefully preserved.

   Fix: new `streamCompactionToFile` helper. Opens both src and dst file
   handles, seeks past the head pointer (via cache when available, else
   sequential skip-read), then reads → writes one record at a time. Peak
   RAM is bounded by the largest single record, not by N records.

## P1 — Thread starvation + file leak

10. `compactionScope = CoroutineScope(SupervisorJob() + Dispatchers.Default)`
    used the CPU-bounded `Default` dispatcher for blocking file I/O
    (`NSFileHandle`, `NSFileManager.replaceItemAtURL`). On a saturated
    thread pool this starved every other `Default`-bound coroutine. The
    library has `IosDispatchers.IO` (GCD global queue, unbounded worker
    spawn) specifically for this purpose — used everywhere else in the
    iOS code except here.
    Fix: switched `compactionScope` to `IosDispatchers.IO`.

11. `NativeTaskScheduler.cancel(id)` on Android orphaned the overflow
    `cacheDir/kmp_input_<uuid>.json` file when the input JSON exceeded
    the 8 KB inline-data threshold. The file lived in cacheDir until the
    24 h `cleanupStaleAlarms` sweep ran. Apps that frequently schedule
    + cancel large-input tasks (camera draft "save → cancel → save → cancel"
    workflows) accumulated megabytes of orphaned files in the meantime.

    Fix: new `OverflowFileRegistry` — small `SharedPreferences`-backed
    `taskId → absolutePath` map. Populated on schedule (sync `commit()`
    so the path survives a process kill); consumed + file-deleted on
    cancel. Wired into both call sites:
    - `scheduleExactAlarm` (AlarmManager path, line 309)
    - `buildWorkData` (WorkManager path, line 384, now takes `taskId`)

    Coverage: `OverflowFileRegistryTest` — 6 tests including a 100-cycle
    "no-leak" stress assertion that pins zero residue after schedule +
    cancel loops.

## Out of scope (claimed but already correct)

- "WorkManager silently coerces < 15 min periodic intervals" — already
  enforced in `commonMain` `TaskTrigger.Periodic.init { require(intervalMs
  >= MIN_INTERVAL_MS) }` with a clear error message ("intervalMs must be
  >= 15 minutes. Android WorkManager silently ignores shorter intervals;
  this validation surfaces the error at the call site instead"). No
  change needed.

## Verification

- `./gradlew :kmpworker:testDebugUnitTest` — BUILD SUCCESSFUL.
- `./gradlew :kmpworker:iosSimulatorArm64Test` — BUILD SUCCESSFUL (4m 46s).
Four targeted audit passes (initial review + 3 proactive 4-lens passes) on
the cancellation/lock-ordering/error-classification/data-integrity axes
surfaced 16 production bugs that the existing 1300+ test suite missed —
plus 1 latent bug (deleteChainProgress buffer eviction) surfaced by a
regression test. Each fix is pinned by a regression test verified to
fail under a temporary revert.

## P0 (data loss / crash)

- **BUG 1**: BaseKmpWorker.doWorkInternal's `finally` deleted the overflow
  input file on CancellationException → WorkManager rerun lost user
  payload. The `wasCancelled` flag was set but never read by the delete
  guard. Pin: V250CancellationPreservesOverflowFileTest (Android).

- **BUG 8**: iOS DynamicTaskDispatcher silently dropped WorkerResult.Retry
  and Failure(shouldRetry=true) → tasks vanished on flaky-network
  failures. Mirror WorkManager's contract: re-enqueue with kmpAttemptCount
  capped at DEFAULT_ATTEMPT_CAP=5. Pin: V250DynamicRetryTest (6 cases).

- **BUG 9**: IosFileStorage.replaceChainAtomic bypassed enqueueMutex →
  concurrent enqueueChain could race through the size check, exceeding
  MAX_QUEUE_SIZE. Acquire enqueueMutex INSIDE queueMutex per the
  documented lock order. Pin: V250ReplaceChainMutexRaceTest reproduces
  1001 > 1000 under revert.

- **BUG 12**: IosBackgroundTaskHandler.handleSingleTask (entry point for
  dedicated-identifier BGTasks — the common iOS user path) silently
  dropped Retry / shouldRetry. Same bug class as BUG 8, different code
  path. Pin: V250HandleSingleTaskRetryTest (6 cases).

- **BUG 14**: AppendOnlyQueue.migrateFromTextToBinary loaded the entire
  legacy file via NSString.stringWithContentsOfFile + split → 3× RAM
  spike → EXC_RESOURCE silent kill → users permanently stuck on pre-v2.5
  version. Streaming line-by-line via NSFileHandle. Pin:
  V250LegacyMigrationStreamingTest.

- **L3b-1** (cascade from BUG 13): HttpUpload, HttpSync, FileCompression,
  ParallelHttpUpload caught generic Exception and returned Failure() with
  default shouldRetry=false. Combined with BUG 13's permanent-failure
  abandonment, a single transient network blip would abandon any chain
  containing these workers. Add explicit shouldRetry=true. Pin:
  V250BuiltinRetryContractTest (common).

## P1 (correctness / silent state corruption)

- **BUG 2**: iOS NSFileCoordinator bypassed the lock on timeout (returned
  `block(url)` directly) → App↔Extension interleaved writes corrupt
  queue.jsonl (CRC32 fail → silent record loss). Throw
  FileCoordinationTimeoutException instead so callers retry. Pin:
  V250FileCoordinatorTimeoutTest.

- **BUG 10**: ChainExecutor's inner `catch (TimeoutCancellationException)`
  mis-attributed outer-scope cancellations as chain-timeout AND swallowed
  them, defeating the outer's cancellation. Disambiguate by `elapsedMs <
  chainTimeout` and rethrow if outer-caused. Pin:
  V250NestedTimeoutMisattributionTest.

- **BUG 11**: Same outer-cancel misattribution at executeTask's deeper
  `withTimeout(taskTimeout)` layer — emitted spurious "Timeout after Xms"
  telemetry to TaskEventBus when an outer scope cancelled a worker
  mid-delay. Pin: V250TaskTimeoutMisattributionTest.

- **BUG 15**: ChainExecutor used wall-clock (NSDate.timeIntervalSince1970)
  for elapsed-time math. NTP sync mid-execution would corrupt the BUG
  10/11 outer-cancel disambiguation, the batch budget check, and the
  cleanup-duration adaptive-budget feedback. Use kotlin.time.TimeSource.
  Monotonic for elapsed math; keep wall-clock only for public timestamps
  (telemetry startedAtMs, ExecutionMetrics endTime, iOS BGTask deadlines).

- **L4b-1**: StorageMigration.migrate catch-all swallowed
  CancellationException, breaking structured concurrency. Rethrow CE
  cleanly; partial-state semantics unchanged (migration flag stays unset,
  next launch retries idempotently).

## P2 (quota waste / API contract)

- **BUG 3**: DynamicTaskDispatcher declared an unused SupervisorJob +
  CoroutineScope. requestShutdownSync (iOS BGTask expirationHandler — a
  non-suspend callback) called job.cancel() on that unused scope, so the
  cancel never reached the in-flight worker. Capture parent Job via
  AtomicReference<Job?> on entry to executePendingTasks. Pin:
  V250DispatcherShutdownTest.

- **BUG 13**: ChainExecutor.executeStep dropped Failure.shouldRetry —
  workers returning Failure(shouldRetry=false) (terminal contract) still
  consumed the chain-level retry budget. Plumb isPermanentFailure through
  TaskOutcome → StepOutcome → executeChain's abandon path. Surfaced a
  latent IosFileStorage.deleteChainProgress buffer-eviction bug (deleted
  disk file but left in-memory buffer to be re-flushed) — also fixed.
  Pin: V250PermanentFailureAbandonTest.

- **BUG 16**: iOS ChainExecutor battery guard toggled
  UIDevice.isBatteryMonitoringEnabled non-atomically + from non-main-
  thread, racing with the host app's battery UI. Now read-only; if the
  host hasn't enabled monitoring, skip the guard. Documented as opt-in.

## BAD (code smell / future risk)

- **BUG 6**: BaseKmpWorker classified NPE / IllegalArgumentException /
  NumberFormatException as permanent → workers dropped on transient
  null/empty server responses from third-party SDKs. Narrow the permanent
  list to truly-programming-error types (SerializationException,
  ClassNotFoundException, InvocationTargetException,
  InstantiationException). Pin: V250ExceptionClassificationTest.

- **BUG 7**: IosFileStorage.close() wrapped flushNow + cancel + join in a
  single try/catch — if flushNow threw (disk full, EACCES,
  NSFileCoordinator error), cancel + join were skipped → coroutine leak.
  Move cancel + join to `finally`. Pin: V250CloseFinallyTest.

## Verification

Each fix was verified twice: pass with fix; fail with exact regression
message under TEMP-REVERT; restore. Full suite: 858 iOS + 493 Android
tests, 0 failures.

Test hooks added to production code (all `internal`, default off, not
exposed in public API):
- IosFileStorage.testForceFlushFailure (for V250CloseFinallyTest)
- IosFileStorage.testEnqueueInternalDelayMs (for V250ReplaceChainMutexRaceTest)
- IosFileStorage.isBackgroundScopeActive (read-only inspector)
- IosFileCoordinator.coordinateBlocking (visibility internal, was private)
- IosBackgroundTaskHandler.handleOneTimeTaskResult (visibility internal)

Documentation:
- Migration sections §5–§12 added to docs/MIGRATION_V2.5.0.md
- v2.5.0 release notes expanded with two "Late-cycle audit pass" tables
Two technically-improvable structural choices surfaced in the late-cycle
audit but don't rise to v2.5.0 blockers — track them as the v2.5.1 patch
scope so they don't get lost.

1. `elapsedMs < timeout` heuristic → `withTimeoutOrNull` refactor.
   The BUG 10 / BUG 11 fixes use an elapsed-time heuristic to disambiguate
   inner-vs-outer cancellation. Correct under normal load but defeatable
   by CPU starvation / iOS process throttling. `withTimeoutOrNull` is
   provably correct: null = inner fired, TCE = outer fired.

2. File-backed `OverflowFileRegistry` for multi-process Android apps.
   SharedPreferences `MODE_PRIVATE` doesn't sync across processes
   real-time; cross-process register/consume can leak overflow files in
   cacheDir until the 24h janitor sweeps. File-per-entry with atomic
   rename eliminates the cache-coherency dependency.

Both deferred because (a) no observed correctness regression in 1351-test
suite, (b) the existing safety nets (heuristic works in normal load;
24h janitor catches leaked files) bound the worst case, and (c) both
refactors carry non-trivial risk that's better landed in isolation
post-dogfood.
…ANGELOG

Pre-publish risk-matrix review surfaced a sibling site for BUG 16:
`IosWorkerDiagnostics.getSystemHealth` had the same auto-toggle pattern as
`ChainExecutor.executeTask` — unconditionally setting
`UIDevice.isBatteryMonitoringEnabled = true` then `= false` in finally. Same
race + threading concerns; apply the same opt-in fix here.

If the host has not enabled battery monitoring, `getSystemHealth` reports
`batteryLevel = -1f` (rendered as 0% after the existing `coerceIn(0, 100)`
on line 72) and `isCharging = false`. Callers should treat 0 as "unavailable".

Also: CHANGELOG updated with the 16-bug Fixed section pointing at the
release notes and migration §5–§12 for full details.
@vietnguyentuan2019 vietnguyentuan2019 merged commit c3c6054 into main May 18, 2026
6 of 13 checks passed
@vietnguyentuan2019 vietnguyentuan2019 deleted the refactor/p0-p1-production-hardening branch May 18, 2026 06:02
@github-advanced-security
Copy link
Copy Markdown

You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool.

What Enabling Code Scanning Means:

  • The 'Security' tab will display more code scanning analysis results (e.g., for the default branch).
  • Depending on your configuration and choice of analysis tool, future pull requests will be annotated with code scanning analysis results.
  • You will be able to see the analysis results for the pull request's branch on this overview once the scans have completed and the checks have passed.

For more information about GitHub Code Scanning, check out the documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants