Skip to content

fix: Moco Fixes#203

Open
jthakkar04 wants to merge 4 commits into
v2from
jthakkar/moco-replica-pvc-fixes
Open

fix: Moco Fixes#203
jthakkar04 wants to merge 4 commits into
v2from
jthakkar/moco-replica-pvc-fixes

Conversation

@jthakkar04
Copy link
Copy Markdown

@jthakkar04 jthakkar04 commented Jun 3, 2026

In this PR, we address 3 bugs from the MySQL→Moco migration. Images below are from Moco documentation: LINK

  1. Moco only accepts a positive odd number of replicas, but the operator fills the count in from the manifest sizing after its own validation runs — so an even value (e.g. micro = 2) or an unset/0 value reaches Moco unchecked and is rejected, leaving the install stuck. We now refuse to forward an invalid count to Moco, surfacing a clear InvalidReplicaCount condition instead of letting the install fail.
    Note We also have another PR to change the replica sizes from 2 → 3: https://github.com/wandb/core/pull/45000
Screenshot 2026-06-03 at 10 51 46 AM
  1. The operator looked for the data disks by an old name (datadir-) instead of the one Moco uses (mysql-data-moco-…), so it never tagged them — and purge, which finds disks by that tag, skipped them. We now build the name from Moco's own constant and naming method, so it stays in sync.
Screenshot 2026-06-03 at 10 53 49 AM
  1. Moco forbids reducing a running cluster's replica count via an in-place update. The operator kept retrying and failing each reconcile with an opaque admission error. It now rejects an explicit decrease at admission (webhook) so it fails immediately, and backstops the cases the webhook can't see at reconcile with a clear Reconciled=False / ScaleDownUnsupported condition.
    Note: Reducing replicas is still possible, but the operator won't do it automatically — it requires Moco's manual stop-clustering procedure.
Screenshot 2026-06-03 at 10 52 48 AM

Summary by CodeRabbit

  • Bug Fixes

    • MySQL replica counts now validated to be positive odd numbers, preventing invalid cluster configurations
    • Prevents accidental scale-down of existing MySQL clusters once deployed
    • Added validation to prevent reducing MySQL replica counts during cluster updates
  • Chores

    • Updated default MySQL micro sizing configuration to use 3 replicas instead of 2

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enforces MySQL replica count validation (positive odd numbers) across the operator stack. It adds a new validator function, enforces it at webhook admission and controller reconciliation boundaries, prevents replica scale-down, updates PVC naming for consistency, and bumps manifest defaults to 3 replicas.

Changes

MySQL replica count validation and constraints

Layer / File(s) Summary
Replica count validator contract
api/v2/weightsandbiases_types.go
New exported ValidMysqlReplicaCount(r int32) bool function validates that replica count is positive and odd.
Webhook admission validation
internal/webhook/v2/weightsandbiases_webhook.go, internal/webhook/v2/weightsandbiases_webhook_test.go
Update-admission checks prevent replica count decreases via new validateMySQLChanges helper; replica validation switches to use the new ValidMysqlReplicaCount validator. Tests cover replica decrease rejection and increase allowance.
Moco controller write state validation
internal/controller/infra/managed/mysql/moco/write.go
WriteState validates replica counts and prevents cluster scale-down using new validator and condition reason constants (InvalidReplicaCountReason, ScaleDownUnsupportedReason). Returns appropriate reconciliation conditions for constraint violations.
Moco PVC naming constant and labeling
internal/controller/infra/managed/mysql/moco/spec.go, internal/controller/infra/managed/mysql/moco/write.go
Introduces dataVolumeName constant; updates VolumeClaimTemplates to use constant instead of hardcoded value; refactors ensurePVCLabels prefix derivation to use MySQLCluster.PrefixedName().
Manifest default replica count updates
hack/testing-manifests/server-manifest/*/sizing.yaml, hack/testing-manifests/server-manifest/*/manifest.yaml, pkg/wandb/manifest/load_manifest_from_files_test.go
Updates MySQL micro sizing replica count from 2 to 3 across all test manifest variants and updates test assertions.
Spec controller test coverage
internal/controller/infra/managed/mysql/moco/spec_test.go
New test cases verify invalid replica count rejection, PVC labeling behavior, and unsupported scale-down prevention; test scheme registration updated to include corev1 types.

Sequence Diagram(s)

sequenceDiagram
  participant WeightsAndBiases
  participant validateMySQLChanges
  participant ValidMysqlReplicaCount
  participant ValidationError
  WeightsAndBiases->>validateMySQLChanges: ValidateUpdate(new, old)
  validateMySQLChanges->>validateMySQLChanges: Check old and new replicas
  alt decrease detected
    validateMySQLChanges->>ValidationError: field error "replicas cannot be decreased"
  else increase or unset
    validateMySQLChanges->>WeightsAndBiases: allowed
  end
  WeightsAndBiases->>ValidMysqlReplicaCount: validate spec.mysql.managedMysql.replicas
  alt invalid (not positive odd)
    ValidMysqlReplicaCount->>ValidationError: field invalid
  end
Loading
sequenceDiagram
  participant WriteState
  participant ValidMysqlReplicaCount
  participant Reconciliation
  WriteState->>ValidMysqlReplicaCount: check desired.Spec.Replicas
  alt invalid replicas
    ValidMysqlReplicaCount->>Reconciliation: ReconciledType=False, InvalidReplicaCountReason
  end
  WriteState->>WriteState: check if actual exists and scale-down
  alt scale-down detected
    WriteState->>Reconciliation: ReconciledType=False, ScaleDownUnsupportedReason
  else valid change
    WriteState->>Reconciliation: proceed
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A rabbit hops through MySQL lanes,
Three replicas now dance in chains,
No scale-down tricks, no even spree,
Odd counts only—that's the key!
Labels align, defaults now gleam,
Validating every Kubernetes dream. 🗼

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'fix: Moco Fixes' is vague and generic, failing to clearly describe the specific changes (replica validation, PVC naming, scale-down handling). Use a more descriptive title that highlights the primary changes, such as 'fix: Enforce Moco replica validation and prevent scale-down' or 'fix: Add Moco replica count validation and PVC naming corrections'.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch jthakkar/moco-replica-pvc-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Comment thread internal/controller/infra/managed/mysql/moco/spec.go Outdated
Comment thread internal/controller/infra/managed/mysql/moco/write.go Outdated
jthakkar04 and others added 3 commits June 3, 2026 14:02
The micro tier specified 2 MySQL replicas, but Moco requires a positive
odd number, so a micro install was rejected and never reached
status.ready. Fix the sizing fixtures (micro: 2 -> 3) so the manifest
emits a value Moco accepts.

Sizing is resolved from the manifest at reconcile time, after the CR
admission webhook, so a bad count still reaches Moco unvalidated. Rather
than silently rewriting it, the operator now refuses to forward an
even or zero count and surfaces a clear Reconciled=False /
InvalidReplicaCount condition, keeping the manifest the source of truth.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensurePVCLabels looked for PVCs prefixed "datadir-<cluster>-", the name
used by the pre-Moco MySQL operator. Moco names its data PVCs
"mysql-data-moco-<cluster>-<n>", so the lookup matched nothing and the
wandb labels were never stamped. Because purgeAssociatedResources selects
PVCs by those labels, a Purge-retention teardown silently leaked the
MySQL data volumes.

Derive the prefix from Moco's own MySQLDataVolumeName constant and
PrefixedName() so it tracks upstream's naming and can't drift again.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moco does not support reducing a running cluster's replica count in
place. Reject an explicit decrease of spec.mysql.managedMysql.replicas
in the validating webhook so `kubectl apply` fails immediately with a
clear message instead of an opaque Moco admission error.

The webhook only sees the CR, so it cannot catch a shrink driven by the
manifest-resolved count (set in ApplyInfraSizing at reconcile) or compare
against the live cluster. A thin reconcile-time guard backstops those
cases, surfacing a clear Reconciled=False / ScaleDownUnsupported condition
and leaving the running cluster untouched. Reducing replicas still
requires Moco's manual stop-clustering procedure.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jthakkar04 jthakkar04 force-pushed the jthakkar/moco-replica-pvc-fixes branch from a89c1bf to bf113a1 Compare June 3, 2026 19:51
@jthakkar04
Copy link
Copy Markdown
Author

@CodeRabbit review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 3, 2026

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/webhook/v2/weightsandbiases_webhook.go (1)

383-388: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Mention the positive-count requirement in the validation error.

ValidMysqlReplicaCount rejects negative odd values too, but this message only talks about oddness. For example, replicas: -1 will fail here even though the error suggests odd is sufficient.

✏️ Suggested fix
 		if spec.Replicas != 0 && !appsv2.ValidMysqlReplicaCount(spec.Replicas) {
 			errors = append(errors, field.Invalid(
 				mysqlPath.Child("managedMysql").Child("replicas"),
 				spec.Replicas,
-				"replicas must be an odd number (Moco enforces quorum-based replication)",
+				"replicas must be a positive odd number (Moco enforces quorum-based replication)",
 			))
 		}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/webhook/v2/weightsandbiases_webhook.go` around lines 383 - 388, The
validation error for MySQL replica counts is incomplete: when spec.Replicas
fails ValidMysqlReplicaCount it can be negative as well as even, so update the
field.Invalid message in the block that appends to errors (the call using
mysqlPath.Child("managedMysql").Child("replicas") and spec.Replicas) to state
both requirements (must be a positive odd number) or similar wording; ensure the
message mentions "positive" (or "greater than zero") and "odd" to match
ValidMysqlReplicaCount's behavior.
🧹 Nitpick comments (1)
internal/controller/infra/managed/mysql/moco/spec_test.go (1)

118-139: ⚡ Quick win

Build the expected PVC name from the same helpers as production.

This test re-hardcodes mysql-data-moco-mysql-0, so it can drift again even though spec.go/write.go now use dataVolumeName and PrefixedName() specifically to avoid that.

♻️ Suggested refactor
-		// Moco names PVCs "<dataVolumeName>-<PrefixedName()>-<ordinal>".
-		mocoPVC := &corev1.PersistentVolumeClaim{
-			ObjectMeta: metav1.ObjectMeta{Name: "mysql-data-moco-mysql-0", Namespace: "wandb"},
-		}
+		cluster := &mocov1beta2.MySQLCluster{ObjectMeta: metav1.ObjectMeta{Name: "mysql"}}
+		mocoPVCName := dataVolumeName + "-" + cluster.PrefixedName() + "-0"
+		mocoPVC := &corev1.PersistentVolumeClaim{
+			ObjectMeta: metav1.ObjectMeta{Name: mocoPVCName, Namespace: "wandb"},
+		}
@@
-		Expect(cl.Get(ctx, types.NamespacedName{Name: "mysql-data-moco-mysql-0", Namespace: "wandb"}, got)).To(Succeed())
+		Expect(cl.Get(ctx, types.NamespacedName{Name: mocoPVCName, Namespace: "wandb"}, got)).To(Succeed())
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/controller/infra/managed/mysql/moco/spec_test.go` around lines 118 -
139, The test hardcodes the PVC name "mysql-data-moco-mysql-0" which can drift;
update the spec_test to construct the expected PVC name using the same helpers
as production (use dataVolumeName and PrefixedName() as used in
spec.go/write.go) and pass that computed name into the fake client and the
subsequent Get calls and assertions around ensurePVCLabels so the test stays in
sync with implementation changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@internal/webhook/v2/weightsandbiases_webhook.go`:
- Around line 383-388: The validation error for MySQL replica counts is
incomplete: when spec.Replicas fails ValidMysqlReplicaCount it can be negative
as well as even, so update the field.Invalid message in the block that appends
to errors (the call using mysqlPath.Child("managedMysql").Child("replicas") and
spec.Replicas) to state both requirements (must be a positive odd number) or
similar wording; ensure the message mentions "positive" (or "greater than zero")
and "odd" to match ValidMysqlReplicaCount's behavior.

---

Nitpick comments:
In `@internal/controller/infra/managed/mysql/moco/spec_test.go`:
- Around line 118-139: The test hardcodes the PVC name "mysql-data-moco-mysql-0"
which can drift; update the spec_test to construct the expected PVC name using
the same helpers as production (use dataVolumeName and PrefixedName() as used in
spec.go/write.go) and pass that computed name into the fake client and the
subsequent Get calls and assertions around ensurePVCLabels so the test stays in
sync with implementation changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4ccc6f63-bd87-4d78-bf48-21527366b6e3

📥 Commits

Reviewing files that changed from the base of the PR and between 47e3cfd and 9c2854e.

📒 Files selected for processing (11)
  • api/v2/weightsandbiases_types.go
  • hack/testing-manifests/server-manifest/0.78.0-pre/sizing.yaml
  • hack/testing-manifests/server-manifest/0.78.0-single-file.yaml
  • hack/testing-manifests/server-manifest/0.78.0/manifest.yaml
  • hack/testing-manifests/server-manifest/0.79.0/manifest.yaml
  • internal/controller/infra/managed/mysql/moco/spec.go
  • internal/controller/infra/managed/mysql/moco/spec_test.go
  • internal/controller/infra/managed/mysql/moco/write.go
  • internal/webhook/v2/weightsandbiases_webhook.go
  • internal/webhook/v2/weightsandbiases_webhook_test.go
  • pkg/wandb/manifest/load_manifest_from_files_test.go

@jthakkar04 jthakkar04 requested a review from danielpanzella June 3, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants