Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377
Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377dbernstein wants to merge 2 commits into
Conversation
|
Claude finished @dbernstein's task in 5m 4s —— View job Code Review
SummaryThe Redis-lock approach correctly serialises folder creation across same-instance Celery workers, and the The jitter sleep runs inside the DetailsMinor:
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3377 +/- ##
==========================================
- Coverage 93.34% 93.34% -0.01%
==========================================
Files 507 507
Lines 46434 46448 +14
Branches 6336 6338 +2
==========================================
+ Hits 43345 43357 +12
- Misses 1999 2000 +1
- Partials 1090 1091 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Reopening — the Redis lock is still needed for within-group serialization. Will layer pick-oldest and jitter on top to handle the cross-shard case. |
…(PP-XXXX) Multiple Celery workers running generate_playtime_report concurrently all call create_nested_folders_if_not_exist at the same time. Google Drive allows same-named folders and its list-API has indexing latency, so all workers see "folder not found" and each creates a duplicate. A distributed Redis lock scoped per (root_folder_id, data_source_name) serialises the check-and-create step. If a worker cannot acquire the lock within FOLDER_LOCK_ACQUIRE_TIMEOUT seconds it logs a warning and proceeds without the lock (graceful degradation). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aces Three independent layers now guard against duplicate folder creation: 1. Redis lock (existing) — serialises workers within the same Redis instance (i.e. the same CM group), eliminating intra-group races entirely. 2. Random jitter (new) — each worker sleeps random.uniform(0, 30) seconds before acquiring the lock, spreading independent CM groups in time and reducing the probability that two groups race to create the same folder. 3. Pick-oldest (new) — get_file now passes orderBy=createdTime to files.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every worker converges on the same canonical (oldest) folder for all subsequent levels and the file upload. Orphaned empty duplicates may exist but no files are misplaced. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
59955fd to
67ed73a
Compare
Description
Three independent layers guard against duplicate Google Drive folder creation in
generate_playtime_report:Redis lock — a
RedisLockscoped per(root_folder_id, data_source_name)serialises workers within the same Redis instance (i.e. the same shard). If the lock cannot be acquired withinFOLDER_LOCK_ACQUIRE_TIMEOUT(60 s), the worker logs a warning and proceeds without it (graceful degradation — no task stalling).Random jitter — each worker sleeps
random.uniform(0, FOLDER_CREATION_JITTER_MAX)(up to 30 s) before acquiring the lock. Independent shards share no Redis instance, so the lock alone cannot coordinate them; spreading their start times in time reduces the probability that two groups race to create the same folder at the same moment.Pick-oldest —
get_filenow passesorderBy=createdTimetofiles.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every subsequent worker converges on the same canonical (oldest) folder for all levels of the hierarchy and the file upload. Orphaned empty duplicates may exist but no files are misplaced.Residual risk: The three layers together make the collision window very narrow but do not provide a 100% guarantee. A fully watertight fix would require a coordination mechanism shared across all shards (e.g. storing canonical folder IDs in a shared database). We have chosen to accept this low-stakes residual risk to keep the code simple: the task runs monthly, any orphaned folders are cosmetic (no data loss), and the previous behaviour created duplicates on every concurrent run.
Motivation and Context
Multiple CMs grouped in a shard all write reports to the same Google Drive root. When
generate_playtime_reportfires at the same scheduled time, all workers check for the folder hierarchy simultaneously. Because Google Drive allows same-named folders and itsfiles.listAPI is eventually consistent, every worker sees "folder not found" and creates its own copy — producing duplicate same-named directories (e.g. multiple "BiblioBoard" folders). Different CM groups share no Redis instance, so a lock alone is insufficient.JIRA: PP-4434
How Has This Been Tested?
redis_fixtureadded totest_generate_playtime_reportsso the task's Redis call resolves to a real client;FOLDER_CREATION_JITTER_MAXmonkeypatched to 0 to keep the test fast.test_generate_playtime_report_folder_lock_contention: pre-acquires the folder lock, patchesFOLDER_LOCK_ACQUIRE_TIMEOUTto 0.1 s and jitter to 0, asserts the warning is logged and the file upload still completes (graceful degradation).test_get_file_orders_by_created_time: verifiesorderBy=createdTimeis present in the Drive API request URI and that the first (oldest) result is returned when multiple exist.Checklist