Skip to content

Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377

Open
dbernstein wants to merge 2 commits into
mainfrom
bugfix/playtime-report-folder-race-condition
Open

Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434)#3377
dbernstein wants to merge 2 commits into
mainfrom
bugfix/playtime-report-folder-race-condition

Conversation

@dbernstein
Copy link
Copy Markdown
Contributor

@dbernstein dbernstein commented May 22, 2026

Description

Three independent layers guard against duplicate Google Drive folder creation in generate_playtime_report:

  1. Redis lock — a RedisLock scoped per (root_folder_id, data_source_name) serialises workers within the same Redis instance (i.e. the same shard). If the lock cannot be acquired within FOLDER_LOCK_ACQUIRE_TIMEOUT (60 s), the worker logs a warning and proceeds without it (graceful degradation — no task stalling).

  2. Random jitter — each worker sleeps random.uniform(0, FOLDER_CREATION_JITTER_MAX) (up to 30 s) before acquiring the lock. Independent shards share no Redis instance, so the lock alone cannot coordinate them; spreading their start times in time reduces the probability that two groups race to create the same folder at the same moment.

  3. Pick-oldestget_file now passes orderBy=createdTime to files.list, so it always returns the oldest matching folder first. If two groups do race and each creates a copy, every subsequent worker converges on the same canonical (oldest) folder for all levels of the hierarchy and the file upload. Orphaned empty duplicates may exist but no files are misplaced.

Residual risk: The three layers together make the collision window very narrow but do not provide a 100% guarantee. A fully watertight fix would require a coordination mechanism shared across all shards (e.g. storing canonical folder IDs in a shared database). We have chosen to accept this low-stakes residual risk to keep the code simple: the task runs monthly, any orphaned folders are cosmetic (no data loss), and the previous behaviour created duplicates on every concurrent run.

Motivation and Context

Multiple CMs grouped in a shard all write reports to the same Google Drive root. When generate_playtime_report fires at the same scheduled time, all workers check for the folder hierarchy simultaneously. Because Google Drive allows same-named folders and its files.list API is eventually consistent, every worker sees "folder not found" and creates its own copy — producing duplicate same-named directories (e.g. multiple "BiblioBoard" folders). Different CM groups share no Redis instance, so a lock alone is insufficient.

JIRA: PP-4434

How Has This Been Tested?

  • redis_fixture added to test_generate_playtime_reports so the task's Redis call resolves to a real client; FOLDER_CREATION_JITTER_MAX monkeypatched to 0 to keep the test fast.
  • test_generate_playtime_report_folder_lock_contention: pre-acquires the folder lock, patches FOLDER_LOCK_ACQUIRE_TIMEOUT to 0.1 s and jitter to 0, asserts the warning is logged and the file upload still completes (graceful degradation).
  • test_get_file_orders_by_created_time: verifies orderBy=createdTime is present in the Drive API request URI and that the first (oldest) result is returned when multiple exist.
  • All 20 affected tests pass.

Checklist

  • I have updated the documentation accordingly.
  • All new and existing tests passed.

@dbernstein dbernstein added the bug Something isn't working label May 22, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 22, 2026

Claude finished @dbernstein's task in 5m 4s —— View job


Code Review

  • Gather context and read changed files
  • Review source changes (playtime_entries.py, google_drive.py)
  • Review test changes (test_playtime_entries.py, test_google_drive.py)
  • Post findings

Summary

The Redis-lock approach correctly serialises folder creation across same-instance Celery workers, and the orderBy="createdTime" ("pick-oldest") change to get_file is a clean complementary fix that ensures all workers converge on the same canonical folder when a race does slip through. Tests are well-structured and cover the contention/graceful-degradation path. Two observations worth a look:

The jitter sleep runs inside the for data_source_name in data_source_names: loop, so a task with N data sources can sleep up to N × 30 s (average N × 15 s) plus block up to N × 60 s waiting for locks — potentially many minutes for larger deployments. Separately, the PR description says a trashed = false filter was included from the now-closed PR #3376, but neither the query string nor the tests reflect this; the Google Drive files.list API excludes trashed items by default so no code change is needed, but the description should be updated to avoid confusion.

Details

Minor: src/palace/manager/celery/tasks/playtime_entries.py:245

time.sleep(random.uniform(0, FOLDER_CREATION_JITTER_MAX)) is called on every loop iteration. With FOLDER_CREATION_JITTER_MAX = 30 and N data sources, the task accumulates up to N × 30 s of jitter sleep in addition to N × FOLDER_LOCK_ACQUIRE_TIMEOUT (60 s) of potential lock-wait time. Deployments with ten or more data sources could add 10+ minutes to a monthly task run. The cross-group spreading goal is met equally well by a single sleep taken once before the loop — two independent CM groups will still see different random offsets for every folder creation because their time-of-arrival at each iteration differs by that fixed offset.

time.sleep(random.uniform(0, FOLDER_CREATION_JITTER_MAX))

Nit: src/palace/manager/celery/tasks/playtime_entries.py:153

FOLDER_LOCK_ACQUIRE_TIMEOUT is annotated int, but acquire_blocking accepts float | int and the contention test monkeypatches the constant to 0.1 (a float). Annotating it float | int would match the full range of values it can take.

FOLDER_LOCK_ACQUIRE_TIMEOUT: int = 60

@dbernstein dbernstein changed the title Serialise Google Drive folder creation with a distributed Redis lock Serialize Google Drive folder creation with a distributed Redis lock May 22, 2026
@dbernstein dbernstein closed this May 22, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 22, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.34%. Comparing base (5315bfb) to head (67ed73a).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3377      +/-   ##
==========================================
- Coverage   93.34%   93.34%   -0.01%     
==========================================
  Files         507      507              
  Lines       46434    46448      +14     
  Branches     6336     6338       +2     
==========================================
+ Hits        43345    43357      +12     
- Misses       1999     2000       +1     
- Partials     1090     1091       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dbernstein
Copy link
Copy Markdown
Contributor Author

dbernstein commented May 22, 2026

Reopening — the Redis lock is still needed for within-group serialization. Will layer pick-oldest and jitter on top to handle the cross-shard case.

@dbernstein dbernstein reopened this May 22, 2026
dbernstein and others added 2 commits May 22, 2026 09:53
…(PP-XXXX)

Multiple Celery workers running generate_playtime_report concurrently all call
create_nested_folders_if_not_exist at the same time. Google Drive allows
same-named folders and its list-API has indexing latency, so all workers see
"folder not found" and each creates a duplicate. A distributed Redis lock
scoped per (root_folder_id, data_source_name) serialises the check-and-create
step. If a worker cannot acquire the lock within FOLDER_LOCK_ACQUIRE_TIMEOUT
seconds it logs a warning and proceeds without the lock (graceful degradation).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…aces

Three independent layers now guard against duplicate folder creation:

1. Redis lock (existing) — serialises workers within the same Redis instance
   (i.e. the same CM group), eliminating intra-group races entirely.

2. Random jitter (new) — each worker sleeps random.uniform(0, 30) seconds
   before acquiring the lock, spreading independent CM groups in time and
   reducing the probability that two groups race to create the same folder.

3. Pick-oldest (new) — get_file now passes orderBy=createdTime to files.list,
   so it always returns the oldest matching folder first.  If two groups do
   race and each creates a copy, every worker converges on the same canonical
   (oldest) folder for all subsequent levels and the file upload.  Orphaned
   empty duplicates may exist but no files are misplaced.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@dbernstein dbernstein force-pushed the bugfix/playtime-report-folder-race-condition branch from 59955fd to 67ed73a Compare May 22, 2026 16:53
@dbernstein dbernstein changed the title Serialize Google Drive folder creation with a distributed Redis lock Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest May 22, 2026
@dbernstein dbernstein changed the title Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest Reduce Google Drive folder race condition with Redis lock, jitter, and pick-oldest (PP-4434) May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant