v2.5.0 production hardening — 16 P0/P1/P2 bugs from 4 audit passes#23
Merged
Conversation
…tentCodes String.hashCode() produced collisions for UUID-style task IDs — birthday paradox gives ~50% collision at ~65k tasks on a 32-bit hash. Both NativeTaskScheduler and AlarmBootReceiver now call PendingIntentCodes.forTaskId() so FLAG_UPDATE_CURRENT resolves to the same slot after reboot instead of double-arming or silently missing. fix(android): BaseAlarmReceiver — structured scope + withTimeout(8s) lifecycle Replace bare CoroutineScope(IO).launch (leaked work past BroadcastReceiver lifetime) with a per-call SupervisorJob scope cancelled in finally. Add withTimeout(workTimeoutMs=8s) so a hung worker cannot exhaust the ~10s receiver budget. pendingResult.finish() and overflow-file cleanup are both guarded against secondary exceptions. fix(ios): FileCompressionWorker fail-fast by default; opt-in fallback The iOS implementation was silently copying the input file uncompressed. Default behavior now returns WorkerResult.Failure with an actionable message. Set FileCompressionConfig.allowIosUncompressedFallback = true to restore the copy behavior (for demo chains / dev builds). Adds FileCompressionWorkerIosTest (3 cases). feat(common): WorkerResult.Retry(reason, delayMs, attemptCap) New sealed variant alongside the legacy Failure(shouldRetry=true). Android maps to Result.retry() with attempt-cap ceiling; iOS ChainExecutor and SingleTaskExecutor handle the new branch (telemetry only for now — per-step delay/cap on iOS tracked in ROADMAP.md). Fail-fast for programming errors (SerializationException, IllegalArgumentException) separated from transient network errors in HttpDownloadWorker. feat(http): HttpDownloadWorker resumable downloads via Range/partial file Partial in-progress download persisted to <savePath>.partial across attempts. Next invocation sends Range: bytes=N- and handles 206/200/416/5xx distinctly: - 206 → append to partial - 200 → server ignored Range, overwrite (log warning) - 416 → partial already complete, finalize - 5xx → WorkerResult.Retry(delayMs=30s), partial preserved HttpDownloadConfig gains resumable=true (default) and optional maxBytes cap. Adds 5 test cases to BuiltinWorkersTest. fix(security): block CGNAT 100.64.0.0/10 and 0.0.0.0/8 in SSRF validator 100.64.0.0/10 (RFC 6598) covers Tailscale and ISP carrier-grade NAT — reachable from a device on mobile networks. 0.0.0.0/8 closes the "this network" gap. Boundary tests confirm 100.63.x.x and 100.128.x.x remain unblocked.
…ackground URLSession (v2.5.0) New built-in workers: - ParallelHttpDownloadWorker — N-chunk concurrent RFC 7233 byte-range downloads with per-chunk .partN resume, SHA-256/MD5/SHA-1/SHA-512 checksum verification, DuplicatePolicy (OVERWRITE/RENAME/SKIP), sequential fallback when server lacks Accept-Ranges - ParallelHttpUploadWorker — per-file parallel multipart upload with Semaphore concurrency cap, per-file retry on 5xx/network errors, structured JSON result (uploadedCount/failedCount) - IosBackgroundDownloadWorker + IosBackgroundUrlSessionManager — NSURLSession background download that survives app termination; completion/failure emitted via TaskEventBus HttpDownloadWorker enhancements: - Checksum verification via Okio HashingSource (ChecksumAlgorithm enum: MD5/SHA1/SHA256/SHA512) - DuplicatePolicy: SKIP short-circuits before any network call; RENAME probes up to 10 000 suffixes - .partial resume with appendingSink; 416 + partial bytes → finalize as complete - 5xx responses → WorkerResult.Retry(delayMs=30_000); parse/config errors → Failure (no retry) iOS chain executor — honor WorkerResult.Retry.delayMs / .attemptCap: - ChainProgress gains stepRetryCounts: Map<Int,Int>; withStepAttempt()/stepAttempts() helpers - ChainExecutor accumulates nextBgTaskDeferMs per batch (max of all retry delays requested); exposes requestedNextBgTaskDelayMs for Swift BGProcessingTaskRequest.earliestBeginDate - AttemptCap enforcement: step abandoned and chain failed when cap reached Android BaseAlarmReceiver — structured coroutine lifecycle (regression-tested): - Extracted runHandleAlarmScope(onFinish: () -> Unit) so tests can drive it without a real BroadcastReceiver.PendingResult (Robolectric adversarial suite: 5 tests) - withTimeout(workTimeoutMs=8s) cancels stuck work; TimeoutCancellationException logged as warning - scope.cancel() in finally tears down any unstructured child launches - pendingResult.finish() guarded in try/catch to tolerate double-finish Adversarial tests: - PendingIntentCodesAdversarialTest: birthday-paradox proof that CRC32 ≤ String.hashCode collisions across 10 000 UUID-style IDs; "Aa"/"BB" hashCode=2112 collision demonstrated - BaseAlarmReceiverLifecycleTest (Robolectric): stuck work cancelled, finish called once on throw/timeout, overflow file deleted in finally, consecutive runs don't share scope - ChainProgressRetryTest: per-step attempt counter independent of chain-level retryCount - HttpDownloadChecksumAndDuplicateTest, ParallelHttpDownloadWorkerTest, ParallelHttpUploadWorkerTest Demo app updated with 4 new DemoCards; ROADMAP.md v2.5 P1 items marked done; v2.6/v3.0 added
… iOS BGTask hard-limits doc Android — configurable FGS type (camera-app blocker for Android 14/15): - `KmpHeavyWorker` is now `open` with `protected open val foregroundServiceType: Int` (default `FGS_DATA_SYNC`). Subclasses set `FGS_MEDIA_PROCESSING` (Android 15+ image/video transcoding), `FGS_CAMERA`, `FGS_LOCATION`, … without copy-pasting the whole `createForegroundInfo()` body. - Companion-object aliases (`FGS_DATA_SYNC`, `FGS_MEDIA_PROCESSING`, `FGS_CAMERA`, `FGS_LOCATION`, `FGS_MEDIA_PLAYBACK`, `FGS_CONNECTED_DEVICE`) so host code does not need `import android.content.pm.ServiceInfo`. - `FGS_MEDIA_PROCESSING` is pinned to the documented literal `0x1000` because `ServiceInfo.FOREGROUND_SERVICE_TYPE_MEDIA_PROCESSING` is API 35+ and not on the compile classpath at compileSdk=34. Adversarial test (`KmpHeavyWorkerFgsTypeTest`) pins each alias and asserts no collisions. - Bumped FGS type API range from API 31+ to API 29+ (Q) — `ForegroundInfo`'s type-aware constructor has always existed on API 29; gating at 31 was overly conservative and lost coverage on Android 10/11 devices. Docs: - `docs/ANDROID_FGS_GUIDE.md` — manifest snippets per FGS type (`mediaProcessing`, `camera`, `location`, `dataSync`), Android 15 6-hour `dataSync` cap, runtime permission checklist, ADB testing recipes. - `docs/IOS_BGTASK_LIMITS.md` — pins down the four physical limits no library can break: opportunistic scheduling (`dasd` decides, not us), 30 s App Refresh budget + `0x8badf00d` SIGKILL semantics, headless DI cold-start drain on BGTask budget, ZIP codec absence on Kotlin/Native. Each section spells out what the library can guarantee vs. what host apps must design around. - README docs index links both guides + the iOS background URLSession guide. Roadmap: - v2.5 §1 FGS guidance — marked ✅ shipped (was v2.6 ⏳). - v2.5 P1.5 — iOS chain retry honoring marked ✅ shipped (was v2.5 stretch ⏳). - v2.6 §7 — iOS ZIP via `libz.dylib` cinterop added (closes the camera-app blocker called out in IOS_BGTASK_LIMITS §4). iOS internal: `IosBackgroundUrlSessionManager` delegate extracted from nested `object Delegate` to top-level class for testability — internals made `internal` so the delegate can access them from the same module.
…2.5.4 guide
QA/QC review (Senior Dev lens) flagged four production risks the existing test
suite did not pin down. This commit addresses each one:
1. File-size compaction trigger for AppendOnlyQueue
- The pre-v2.5 trigger fired only at ≥ 80 % processed-item ratio. An
enqueue-heavy workload (e.g. 5 000 items queued, 50 dequeued) never
compacts and the file grows unbounded.
- New trigger: file > 5 MB AND ≥ 20 % processed AND ≥ 50 processed items.
Either trigger reclaims slack; compaction itself only removes processed
items so the file-size path is a no-op when head=0.
- Covered by `QueueScaleStressTest.fileSizeCompaction_reclaimsSpaceAfterDequeue`.
2. BackwardCompatibilityTest — pins on-disk forward-compat contract
- Loads a v2.4.3-shaped `ChainProgress` JSON (no `stepRetryCounts` field)
and asserts v2.5.0 reader handles it without data loss; `stepRetryCounts`
defaults to `emptyMap()` per the additive schema rule.
- Pins `ignoreUnknownKeys = true` against accidental flip via a future
refactor that would silently brick pending tasks on app upgrade.
- Pins null preservation on `lastFailedStep` / `lastError`.
- Pins corrupt-JSON self-heal to null (no exception propagation).
3. QueueScaleStressTest — guards against O(N²) regressions
- 2 000 enqueue + 2 000 dequeue cycle with a 120 s soft ceiling. Sized to
stay CI-friendly while still surfacing a "load whole file into RAM"
regression (which would push runtime well past 5 min).
- Companion test exercises the new file-size compaction path with a 6 MB
queue.
4. docs/APPLE_APP_STORE_REVIEW_GUIDELINES.md
- Spells out App Store §2.5.4 risks for dynamic task dispatch under a
single `BGTaskScheduler` identifier. Provides safe vs. rejection-bait
patterns with concrete examples.
- Compliance checklist (Info.plist audit, worker→identifier mapping,
description / privacy-policy alignment, cellular constraint).
- What the library can guarantee vs. what host apps are responsible for.
Roadmap section P1.6 lists each item and its regression test owner.
README docs index links the new App Store guide.
…urvive (P0) QA review (Senior Dev lens) caught a critical bug in IosBackgroundUrlSessionManager: savePaths and taskNames were stored only in process memory. The whole point of URLSessionConfiguration.background is that downloads keep running inside nsurlsessiond after the app is killed; iOS then cold-launches the app to deliver completion events. Pre-v2.5 sequence (broken): 1. enqueueDownload → savePaths[id] = config.savePath in MutableMap 2. User swipes the app away → process dies, MutableMap wiped 3. Download finishes → iOS cold-launches the app, delegate fires 4. delegate looks up savePaths[id] → null 5. file orphaned in NSTemporaryDirectory; no completion event emitted 6. WorkManager chain waiting on the download hangs forever Fix: - New BackgroundDownloadStateStore — disk-backed Map<key, Entry> where key = "sessionId:taskId" (composite key disambiguates iOS's per-session task identifier recycling). - Entries persisted to <AppSupport>/dev.brewkits.kmpworkmanager/bg_url_session_state.json via writeToURL(atomically: true) so power loss mid-write keeps the previous version intact. - enqueueDownload writes the entry BEFORE task.resume() so the daemon never owns a download we have no record of. - Delegate uses synchronous getSync() because iOS deletes the temp file the moment didFinishDownloadingToURL returns — there's no time to suspend on a coroutine Mutex. In-memory @volatile cache mirrors the disk state and is loaded once on cold-launch, then served from memory for the rest of the process lifetime. Cache invalidated on every write. - New sweepStaleStateOnLaunch() called from host AppDelegate cleans entries older than 7 days so the file does not grow unboundedly across years of installs. - Schema version field on the state file for future migrations; current additive changes survive via ignoreUnknownKeys = true. Adversarial coverage (BackgroundDownloadStateStoreTest): - Round-trip put → get. - Cold-launch simulation via invalidateInMemoryCacheForTest — proves getSync() hits disk, not the wiped in-memory cache. - (sessionId, taskId) collision: two sessions with task identifier=1 each keep their own entries. - Idempotent remove + survives cache invalidation. - sweepStale drops old entries, keeps fresh. Doc: IOS_BACKGROUND_URL_SESSION.md gains a "How cold-launch survival actually works" section with the timeline diagram and the rationale for the synchronous disk read in the delegate.
A second-pass review of the v2.5.0 branch by a Senior Dev / QC Lead found
four critical bugs the first pass missed. All four are real (independently
verified against source), and all four are now fixed with regression tests.
## P0 — Data loss
1. `AppendOnlyQueue.dequeue` wiped the entire queue on CRC corruption.
`readSingleRecordWithValidation` correctly set `corruptionOffset` to the
precise byte where the corrupt record started; but the `dequeue`
null-handling branch then unconditionally overwrote `corruptionOffset =
0UL`, forcing `truncateAtCorruptionPoint` to take the full-reset path.
Every pending task ahead of the corrupt one was lost.
Fix: added `else if (isQueueCorrupt)` guard that preserves the precise
offset set by the validator. The original "external truncate/replace"
full-reset path remains for the case where read returned null WITHOUT
`isQueueCorrupt` being set (file actually shrank under us).
Coverage: `AppendOnlyQueueCrcCorruptionTest`.
2. `BackgroundDownloadStateStore.getSync` cache stale-write race re-introduced
the v2.5.0 cold-launch orphan bug. The pre-fix pattern
`cache ?: readUnlocked().also { cache = it }` had a window between the
volatile null-check and the `.also` write-back; a concurrent put/remove
landing in that window would publish a fresh cache, then the also-block
would clobber it with the stale disk snapshot.
Fix: compare-and-set publish — only write to `cache` if it's still null
when we're ready to publish. Return whichever map is freshest (the new
cache if a concurrent put landed, else our disk snapshot).
Coverage:
`BackgroundDownloadStateStoreTest.getSync_doesNotClobberCacheFromConcurrentPut`.
## P1 — Host harm / retry storm
3. `ChainExecutor` clobbered host app's battery monitoring state.
`UIDevice.batteryMonitoringEnabled = true; …; = false` ran on every chain
execution, unconditionally turning the flag off. Host apps that enable it
for their own UI (settings screen showing battery level) lost their
updates.
Fix: capture `hostHadMonitoringEnabled` before flipping it; only restore
to `false` if WE were the ones who turned it on. Host's prior state is
preserved.
4. `ParallelHttpDownloadWorker` infinite retry storm on disk-full.
`mergeChunks` failure (e.g. disk full) deleted only `.merged` but kept
`.partN` files. On retry, `downloadOneChunk` saw all parts at expected
size and skipped the network, then `mergeChunks` failed again with the
same error → infinite loop, no forward progress, even after the user
freed disk space (because parts still occupied it).
Fix (two layers):
- Merge-failure catch now purges `.partN` too. Retry must re-download,
guaranteeing forward progress.
- Outer Retry now carries `attemptCap = 4`. Even if cleanup somehow
fails, the chain abandons after ~1 minute instead of looping forever.
Coverage:
`ParallelHttpDownloadWorkerTest.parallel_mergeFailure_cleansUpAllParts_breaksRetryStorm`,
`parallel_outerRetryHasAttemptCap_toBreakInfiniteLoop`.
## Release docs
Also in this commit (pre-publish housekeeping):
- `CHANGELOG.md` — full v2.5.0 entry with Added / Fixed / Changed / Docs.
- `docs/MIGRATION_V2.5.0.md` — upgrade guide from v2.4.x.
- `docs/release-notes/v2.5.0-RELEASE-NOTES.md` — narrative release notes.
- Version bumped from 2.4.3 → 2.5.0 in install snippets across README,
TROUBLESHOOTING, platform-setup, quickstart, ARCHITECTURE.
- README "What's new in v2.5.0" section added; v2.4.3 section preserved
as history.
## Verification
- `./gradlew :kmpworker:allTests` — BUILD SUCCESSFUL (Android + iOS).
- `./gradlew :composeApp:assembleDebug` — APK built.
- `./gradlew :composeApp:compileKotlinIosSimulatorArm64` — iOS demo compiles.
…lish)
A third-pass review (QC Lead / Senior Architect) found four issues that
are not data-loss / crash-on-happy-path but degrade developer experience
and could create "false sense of security" or opaque failure modes:
5. **iOS Simulator fallback false sense of security** —
`NativeTaskScheduler.submitTaskRequest` uses `delay()` to simulate
BGTaskScheduler on the simulator. Coroutine `delay()` freezes when iOS
suspends the app, so a developer's simulator test of "10-minute task"
only ticks while the app is foreground. Production iOS uses wall-clock
opportunistic scheduling decided by `dasd`. The mismatch is invisible
to the developer until they ship and find the schedule doesn't fire.
Fix: prominent warning log at every fallback invocation + KDoc spelling
out the divergence.
6. **Fragile test-mode detection** —
`NSProcessInfo.processInfo.processName.endsWith("test.kexe")` was the
sole detection heuristic in 4 places (`IosFileCoordinator`,
`IosFileStorage`, `NativeTaskScheduler`). A future Kotlin/Native binary
naming change would silently break all of them with no compile error.
Fix: extracted `IosTestEnvironment` as the single source of truth.
Cascade of signals:
1. `KMPWORKMANAGER_TEST_MODE=1` env var (explicit, recommended).
2. `XCTestConfigurationFilePath` (Xcode + XCTest harness).
3. `XCInjectBundleInto` (XCTest, alt signal).
4. `processName.endsWith("test.kexe")` (legacy K/N suffix).
Any one signal trips test mode; cached via `by lazy`. Four call sites
collapsed to one detector.
7. **Android FGS crash trap** — `KmpHeavyWorker.doWork()` called
`setForeground()` unguarded. On Android 14+, an FGS type mismatch
between `foregroundServiceType` and the manifest throws
`ForegroundServiceStartNotAllowedException` (subclass of
`IllegalStateException`) which crashed the worker with an opaque
WorkManager stack trace.
Fix: try-catch for `SecurityException` (permission missing) and
`IllegalStateException` (type not declared). Both log an actionable
diagnostic pointing to docs/ANDROID_FGS_GUIDE.md and return
`Result.failure()` instead of crashing.
8. **Exact alarm "user-ignores-notification = task waits indefinitely"
contract** — `checkAndExecuteMissedExactAlarms` KDoc previously mentioned
notification permission and ignored-notification cases in passing, but
did not lead with the all-caps warning that iOS has NO equivalent of
`AlarmManager.setExactAndAllowWhileIdle`. A developer building a
medication reminder / alarm clock / financial-transaction trigger
needed to surface this risk at the top of the KDoc, not midway through.
Fix: KDoc now leads with the warning + explicit do-NOT-use-for list.
`docs/IOS_BGTASK_LIMITS.md` §5 added with the same content.
Test runs after fixes:
- `:kmpworker:testDebugUnitTest` — BUILD SUCCESSFUL.
- `:kmpworker:iosSimulatorArm64Test` — BUILD SUCCESSFUL (4m 39s).
Fourth-pass audit (Senior Tech Lead) found three more issues. All verified
against source and fixed with regression tests. (One claimed issue —
"WorkManager < 15 min periodic crash" — was already guarded by
`TaskTrigger.Periodic.init { require(intervalMs >= MIN_INTERVAL_MS) }`
on commonMain since v2.4, so no fix needed there.)
## P0 — iOS BGTask budget exhaustion
9. `AppendOnlyQueue.compactQueue` OOM on large queues. The compaction step
loaded all unprocessed records into a `mutableListOf<String>()` before
writing them to the compacted file. With `MAX_QUEUE_SIZE = 10 000` and
typical 1–4 KB JSON payloads, peak RAM = 10–40 MB — above the iOS
background-task budget (Watchdog kills with `EXC_RESOURCE /
RESOURCE_TYPE_MEMORY` at ~30 MB on older devices). This broke the O(1)
RAM contract that the rest of the queue carefully preserved.
Fix: new `streamCompactionToFile` helper. Opens both src and dst file
handles, seeks past the head pointer (via cache when available, else
sequential skip-read), then reads → writes one record at a time. Peak
RAM is bounded by the largest single record, not by N records.
## P1 — Thread starvation + file leak
10. `compactionScope = CoroutineScope(SupervisorJob() + Dispatchers.Default)`
used the CPU-bounded `Default` dispatcher for blocking file I/O
(`NSFileHandle`, `NSFileManager.replaceItemAtURL`). On a saturated
thread pool this starved every other `Default`-bound coroutine. The
library has `IosDispatchers.IO` (GCD global queue, unbounded worker
spawn) specifically for this purpose — used everywhere else in the
iOS code except here.
Fix: switched `compactionScope` to `IosDispatchers.IO`.
11. `NativeTaskScheduler.cancel(id)` on Android orphaned the overflow
`cacheDir/kmp_input_<uuid>.json` file when the input JSON exceeded
the 8 KB inline-data threshold. The file lived in cacheDir until the
24 h `cleanupStaleAlarms` sweep ran. Apps that frequently schedule
+ cancel large-input tasks (camera draft "save → cancel → save → cancel"
workflows) accumulated megabytes of orphaned files in the meantime.
Fix: new `OverflowFileRegistry` — small `SharedPreferences`-backed
`taskId → absolutePath` map. Populated on schedule (sync `commit()`
so the path survives a process kill); consumed + file-deleted on
cancel. Wired into both call sites:
- `scheduleExactAlarm` (AlarmManager path, line 309)
- `buildWorkData` (WorkManager path, line 384, now takes `taskId`)
Coverage: `OverflowFileRegistryTest` — 6 tests including a 100-cycle
"no-leak" stress assertion that pins zero residue after schedule +
cancel loops.
## Out of scope (claimed but already correct)
- "WorkManager silently coerces < 15 min periodic intervals" — already
enforced in `commonMain` `TaskTrigger.Periodic.init { require(intervalMs
>= MIN_INTERVAL_MS) }` with a clear error message ("intervalMs must be
>= 15 minutes. Android WorkManager silently ignores shorter intervals;
this validation surfaces the error at the call site instead"). No
change needed.
## Verification
- `./gradlew :kmpworker:testDebugUnitTest` — BUILD SUCCESSFUL.
- `./gradlew :kmpworker:iosSimulatorArm64Test` — BUILD SUCCESSFUL (4m 46s).
Four targeted audit passes (initial review + 3 proactive 4-lens passes) on the cancellation/lock-ordering/error-classification/data-integrity axes surfaced 16 production bugs that the existing 1300+ test suite missed — plus 1 latent bug (deleteChainProgress buffer eviction) surfaced by a regression test. Each fix is pinned by a regression test verified to fail under a temporary revert. ## P0 (data loss / crash) - **BUG 1**: BaseKmpWorker.doWorkInternal's `finally` deleted the overflow input file on CancellationException → WorkManager rerun lost user payload. The `wasCancelled` flag was set but never read by the delete guard. Pin: V250CancellationPreservesOverflowFileTest (Android). - **BUG 8**: iOS DynamicTaskDispatcher silently dropped WorkerResult.Retry and Failure(shouldRetry=true) → tasks vanished on flaky-network failures. Mirror WorkManager's contract: re-enqueue with kmpAttemptCount capped at DEFAULT_ATTEMPT_CAP=5. Pin: V250DynamicRetryTest (6 cases). - **BUG 9**: IosFileStorage.replaceChainAtomic bypassed enqueueMutex → concurrent enqueueChain could race through the size check, exceeding MAX_QUEUE_SIZE. Acquire enqueueMutex INSIDE queueMutex per the documented lock order. Pin: V250ReplaceChainMutexRaceTest reproduces 1001 > 1000 under revert. - **BUG 12**: IosBackgroundTaskHandler.handleSingleTask (entry point for dedicated-identifier BGTasks — the common iOS user path) silently dropped Retry / shouldRetry. Same bug class as BUG 8, different code path. Pin: V250HandleSingleTaskRetryTest (6 cases). - **BUG 14**: AppendOnlyQueue.migrateFromTextToBinary loaded the entire legacy file via NSString.stringWithContentsOfFile + split → 3× RAM spike → EXC_RESOURCE silent kill → users permanently stuck on pre-v2.5 version. Streaming line-by-line via NSFileHandle. Pin: V250LegacyMigrationStreamingTest. - **L3b-1** (cascade from BUG 13): HttpUpload, HttpSync, FileCompression, ParallelHttpUpload caught generic Exception and returned Failure() with default shouldRetry=false. Combined with BUG 13's permanent-failure abandonment, a single transient network blip would abandon any chain containing these workers. Add explicit shouldRetry=true. Pin: V250BuiltinRetryContractTest (common). ## P1 (correctness / silent state corruption) - **BUG 2**: iOS NSFileCoordinator bypassed the lock on timeout (returned `block(url)` directly) → App↔Extension interleaved writes corrupt queue.jsonl (CRC32 fail → silent record loss). Throw FileCoordinationTimeoutException instead so callers retry. Pin: V250FileCoordinatorTimeoutTest. - **BUG 10**: ChainExecutor's inner `catch (TimeoutCancellationException)` mis-attributed outer-scope cancellations as chain-timeout AND swallowed them, defeating the outer's cancellation. Disambiguate by `elapsedMs < chainTimeout` and rethrow if outer-caused. Pin: V250NestedTimeoutMisattributionTest. - **BUG 11**: Same outer-cancel misattribution at executeTask's deeper `withTimeout(taskTimeout)` layer — emitted spurious "Timeout after Xms" telemetry to TaskEventBus when an outer scope cancelled a worker mid-delay. Pin: V250TaskTimeoutMisattributionTest. - **BUG 15**: ChainExecutor used wall-clock (NSDate.timeIntervalSince1970) for elapsed-time math. NTP sync mid-execution would corrupt the BUG 10/11 outer-cancel disambiguation, the batch budget check, and the cleanup-duration adaptive-budget feedback. Use kotlin.time.TimeSource. Monotonic for elapsed math; keep wall-clock only for public timestamps (telemetry startedAtMs, ExecutionMetrics endTime, iOS BGTask deadlines). - **L4b-1**: StorageMigration.migrate catch-all swallowed CancellationException, breaking structured concurrency. Rethrow CE cleanly; partial-state semantics unchanged (migration flag stays unset, next launch retries idempotently). ## P2 (quota waste / API contract) - **BUG 3**: DynamicTaskDispatcher declared an unused SupervisorJob + CoroutineScope. requestShutdownSync (iOS BGTask expirationHandler — a non-suspend callback) called job.cancel() on that unused scope, so the cancel never reached the in-flight worker. Capture parent Job via AtomicReference<Job?> on entry to executePendingTasks. Pin: V250DispatcherShutdownTest. - **BUG 13**: ChainExecutor.executeStep dropped Failure.shouldRetry — workers returning Failure(shouldRetry=false) (terminal contract) still consumed the chain-level retry budget. Plumb isPermanentFailure through TaskOutcome → StepOutcome → executeChain's abandon path. Surfaced a latent IosFileStorage.deleteChainProgress buffer-eviction bug (deleted disk file but left in-memory buffer to be re-flushed) — also fixed. Pin: V250PermanentFailureAbandonTest. - **BUG 16**: iOS ChainExecutor battery guard toggled UIDevice.isBatteryMonitoringEnabled non-atomically + from non-main- thread, racing with the host app's battery UI. Now read-only; if the host hasn't enabled monitoring, skip the guard. Documented as opt-in. ## BAD (code smell / future risk) - **BUG 6**: BaseKmpWorker classified NPE / IllegalArgumentException / NumberFormatException as permanent → workers dropped on transient null/empty server responses from third-party SDKs. Narrow the permanent list to truly-programming-error types (SerializationException, ClassNotFoundException, InvocationTargetException, InstantiationException). Pin: V250ExceptionClassificationTest. - **BUG 7**: IosFileStorage.close() wrapped flushNow + cancel + join in a single try/catch — if flushNow threw (disk full, EACCES, NSFileCoordinator error), cancel + join were skipped → coroutine leak. Move cancel + join to `finally`. Pin: V250CloseFinallyTest. ## Verification Each fix was verified twice: pass with fix; fail with exact regression message under TEMP-REVERT; restore. Full suite: 858 iOS + 493 Android tests, 0 failures. Test hooks added to production code (all `internal`, default off, not exposed in public API): - IosFileStorage.testForceFlushFailure (for V250CloseFinallyTest) - IosFileStorage.testEnqueueInternalDelayMs (for V250ReplaceChainMutexRaceTest) - IosFileStorage.isBackgroundScopeActive (read-only inspector) - IosFileCoordinator.coordinateBlocking (visibility internal, was private) - IosBackgroundTaskHandler.handleOneTimeTaskResult (visibility internal) Documentation: - Migration sections §5–§12 added to docs/MIGRATION_V2.5.0.md - v2.5.0 release notes expanded with two "Late-cycle audit pass" tables
Two technically-improvable structural choices surfaced in the late-cycle audit but don't rise to v2.5.0 blockers — track them as the v2.5.1 patch scope so they don't get lost. 1. `elapsedMs < timeout` heuristic → `withTimeoutOrNull` refactor. The BUG 10 / BUG 11 fixes use an elapsed-time heuristic to disambiguate inner-vs-outer cancellation. Correct under normal load but defeatable by CPU starvation / iOS process throttling. `withTimeoutOrNull` is provably correct: null = inner fired, TCE = outer fired. 2. File-backed `OverflowFileRegistry` for multi-process Android apps. SharedPreferences `MODE_PRIVATE` doesn't sync across processes real-time; cross-process register/consume can leak overflow files in cacheDir until the 24h janitor sweeps. File-per-entry with atomic rename eliminates the cache-coherency dependency. Both deferred because (a) no observed correctness regression in 1351-test suite, (b) the existing safety nets (heuristic works in normal load; 24h janitor catches leaked files) bound the worst case, and (c) both refactors carry non-trivial risk that's better landed in isolation post-dogfood.
…ANGELOG Pre-publish risk-matrix review surfaced a sibling site for BUG 16: `IosWorkerDiagnostics.getSystemHealth` had the same auto-toggle pattern as `ChainExecutor.executeTask` — unconditionally setting `UIDevice.isBatteryMonitoringEnabled = true` then `= false` in finally. Same race + threading concerns; apply the same opt-in fix here. If the host has not enabled battery monitoring, `getSystemHealth` reports `batteryLevel = -1f` (rendered as 0% after the existing `coerceIn(0, 100)` on line 72) and `isCharging = false`. Callers should treat 0 as "unavailable". Also: CHANGELOG updated with the 16-bug Fixed section pointing at the release notes and migration §5–§12 for full details.
|
You are seeing this message because GitHub Code Scanning has recently been set up for this repository, or this pull request contains the workflow file for the Code Scanning tool. What Enabling Code Scanning Means:
For more information about GitHub Code Scanning, check out the documentation. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Late-cycle proactive review on the cancellation × lock-order × retry-contract × data-integrity axes surfaced 16 production bugs that the existing 1300+ test suite missed. Each fix is pinned by a
V250…Test.ktregression test verified to fail under a temporary revert../gradlew checkclean across all 225 tasks (lint + tests + verification)Bug breakdown
P0 — data loss / crash (6) ·
BaseKmpWorkeroverflow-file delete on cancel · iOS dynamic+dedicated dispatcher droppedRetry·replaceChainAtomicmutex bypass race · legacy queue migration OOM · built-in workershouldRetrycascadeP1 — correctness / silent state corruption (5) ·
NSFileCoordinatorbypass on timeout · nestedwithTimeoutouter-cancel misattribution (chain + task layer) · wall-clock → monotonic for elapsed math ·StorageMigrationswallowedCancellationExceptionP2 — quota waste / API contract (3) ·
DynamicTaskDispatcherplacebo scope ·Failure(shouldRetry=false)chain-abandon (+ latentdeleteChainProgressbuffer-eviction) · iOS battery guard race with host UIBAD — code smell / future risk (2) · NPE/IAE classified as permanent → silent data loss ·
IosFileStorage.close()skipped cancel+join on flush throwFull table in release notes. Migration guidance for behavioral changes in docs/MIGRATION_V2.5.0.md §5–§12.
Verification protocol
Each fix verified TWICE:
One cascade caught and fixed during the audit itself: BUG 13's
Failure(shouldRetry=false)semantic broke 4 built-in workers that used the default — caught and fixed before merge as L3b-1.Deferred to v2.5.1 patch (in roadmap)
Two structural improvements documented in docs/ROADMAP.md — neither blocks v2.5.0:
withTimeoutOrNullover theelapsedMs < timeoutheuristic (provably correct vs CPU-starvation-defeatable)OverflowFileRegistryfor multi-process Android hostsTest plan
./gradlew :kmpworker:allTests— 1351 tests pass./gradlew check— full project verification, 225 tasks, no lint warnings./gradlew :kmpworker:publishAllPublicationsToMavenCentralLocalRepository— Maven artifacts build + signIosWorkerDiagnostics, also fixed)