fix: Moco Fixes#203
Conversation
📝 WalkthroughWalkthroughThis PR enforces MySQL replica count validation (positive odd numbers) across the operator stack. It adds a new validator function, enforces it at webhook admission and controller reconciliation boundaries, prevents replica scale-down, updates PVC naming for consistency, and bumps manifest defaults to 3 replicas. ChangesMySQL replica count validation and constraints
Sequence Diagram(s)sequenceDiagram
participant WeightsAndBiases
participant validateMySQLChanges
participant ValidMysqlReplicaCount
participant ValidationError
WeightsAndBiases->>validateMySQLChanges: ValidateUpdate(new, old)
validateMySQLChanges->>validateMySQLChanges: Check old and new replicas
alt decrease detected
validateMySQLChanges->>ValidationError: field error "replicas cannot be decreased"
else increase or unset
validateMySQLChanges->>WeightsAndBiases: allowed
end
WeightsAndBiases->>ValidMysqlReplicaCount: validate spec.mysql.managedMysql.replicas
alt invalid (not positive odd)
ValidMysqlReplicaCount->>ValidationError: field invalid
end
sequenceDiagram
participant WriteState
participant ValidMysqlReplicaCount
participant Reconciliation
WriteState->>ValidMysqlReplicaCount: check desired.Spec.Replicas
alt invalid replicas
ValidMysqlReplicaCount->>Reconciliation: ReconciledType=False, InvalidReplicaCountReason
end
WriteState->>WriteState: check if actual exists and scale-down
alt scale-down detected
WriteState->>Reconciliation: ReconciledType=False, ScaleDownUnsupportedReason
else valid change
WriteState->>Reconciliation: proceed
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
The micro tier specified 2 MySQL replicas, but Moco requires a positive odd number, so a micro install was rejected and never reached status.ready. Fix the sizing fixtures (micro: 2 -> 3) so the manifest emits a value Moco accepts. Sizing is resolved from the manifest at reconcile time, after the CR admission webhook, so a bad count still reaches Moco unvalidated. Rather than silently rewriting it, the operator now refuses to forward an even or zero count and surfaces a clear Reconciled=False / InvalidReplicaCount condition, keeping the manifest the source of truth. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ensurePVCLabels looked for PVCs prefixed "datadir-<cluster>-", the name used by the pre-Moco MySQL operator. Moco names its data PVCs "mysql-data-moco-<cluster>-<n>", so the lookup matched nothing and the wandb labels were never stamped. Because purgeAssociatedResources selects PVCs by those labels, a Purge-retention teardown silently leaked the MySQL data volumes. Derive the prefix from Moco's own MySQLDataVolumeName constant and PrefixedName() so it tracks upstream's naming and can't drift again. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Moco does not support reducing a running cluster's replica count in place. Reject an explicit decrease of spec.mysql.managedMysql.replicas in the validating webhook so `kubectl apply` fails immediately with a clear message instead of an opaque Moco admission error. The webhook only sees the CR, so it cannot catch a shrink driven by the manifest-resolved count (set in ApplyInfraSizing at reconcile) or compare against the live cluster. A thin reconcile-time guard backstops those cases, surfacing a clear Reconciled=False / ScaleDownUnsupported condition and leaving the running cluster untouched. Reducing replicas still requires Moco's manual stop-clustering procedure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a89c1bf to
bf113a1
Compare
|
@CodeRabbit review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/webhook/v2/weightsandbiases_webhook.go (1)
383-388:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winMention the positive-count requirement in the validation error.
ValidMysqlReplicaCountrejects negative odd values too, but this message only talks about oddness. For example,replicas: -1will fail here even though the error suggests odd is sufficient.✏️ Suggested fix
if spec.Replicas != 0 && !appsv2.ValidMysqlReplicaCount(spec.Replicas) { errors = append(errors, field.Invalid( mysqlPath.Child("managedMysql").Child("replicas"), spec.Replicas, - "replicas must be an odd number (Moco enforces quorum-based replication)", + "replicas must be a positive odd number (Moco enforces quorum-based replication)", )) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/webhook/v2/weightsandbiases_webhook.go` around lines 383 - 388, The validation error for MySQL replica counts is incomplete: when spec.Replicas fails ValidMysqlReplicaCount it can be negative as well as even, so update the field.Invalid message in the block that appends to errors (the call using mysqlPath.Child("managedMysql").Child("replicas") and spec.Replicas) to state both requirements (must be a positive odd number) or similar wording; ensure the message mentions "positive" (or "greater than zero") and "odd" to match ValidMysqlReplicaCount's behavior.
🧹 Nitpick comments (1)
internal/controller/infra/managed/mysql/moco/spec_test.go (1)
118-139: ⚡ Quick winBuild the expected PVC name from the same helpers as production.
This test re-hardcodes
mysql-data-moco-mysql-0, so it can drift again even thoughspec.go/write.gonow usedataVolumeNameandPrefixedName()specifically to avoid that.♻️ Suggested refactor
- // Moco names PVCs "<dataVolumeName>-<PrefixedName()>-<ordinal>". - mocoPVC := &corev1.PersistentVolumeClaim{ - ObjectMeta: metav1.ObjectMeta{Name: "mysql-data-moco-mysql-0", Namespace: "wandb"}, - } + cluster := &mocov1beta2.MySQLCluster{ObjectMeta: metav1.ObjectMeta{Name: "mysql"}} + mocoPVCName := dataVolumeName + "-" + cluster.PrefixedName() + "-0" + mocoPVC := &corev1.PersistentVolumeClaim{ + ObjectMeta: metav1.ObjectMeta{Name: mocoPVCName, Namespace: "wandb"}, + } @@ - Expect(cl.Get(ctx, types.NamespacedName{Name: "mysql-data-moco-mysql-0", Namespace: "wandb"}, got)).To(Succeed()) + Expect(cl.Get(ctx, types.NamespacedName{Name: mocoPVCName, Namespace: "wandb"}, got)).To(Succeed())🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/controller/infra/managed/mysql/moco/spec_test.go` around lines 118 - 139, The test hardcodes the PVC name "mysql-data-moco-mysql-0" which can drift; update the spec_test to construct the expected PVC name using the same helpers as production (use dataVolumeName and PrefixedName() as used in spec.go/write.go) and pass that computed name into the fake client and the subsequent Get calls and assertions around ensurePVCLabels so the test stays in sync with implementation changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@internal/webhook/v2/weightsandbiases_webhook.go`:
- Around line 383-388: The validation error for MySQL replica counts is
incomplete: when spec.Replicas fails ValidMysqlReplicaCount it can be negative
as well as even, so update the field.Invalid message in the block that appends
to errors (the call using mysqlPath.Child("managedMysql").Child("replicas") and
spec.Replicas) to state both requirements (must be a positive odd number) or
similar wording; ensure the message mentions "positive" (or "greater than zero")
and "odd" to match ValidMysqlReplicaCount's behavior.
---
Nitpick comments:
In `@internal/controller/infra/managed/mysql/moco/spec_test.go`:
- Around line 118-139: The test hardcodes the PVC name "mysql-data-moco-mysql-0"
which can drift; update the spec_test to construct the expected PVC name using
the same helpers as production (use dataVolumeName and PrefixedName() as used in
spec.go/write.go) and pass that computed name into the fake client and the
subsequent Get calls and assertions around ensurePVCLabels so the test stays in
sync with implementation changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4ccc6f63-bd87-4d78-bf48-21527366b6e3
📒 Files selected for processing (11)
api/v2/weightsandbiases_types.gohack/testing-manifests/server-manifest/0.78.0-pre/sizing.yamlhack/testing-manifests/server-manifest/0.78.0-single-file.yamlhack/testing-manifests/server-manifest/0.78.0/manifest.yamlhack/testing-manifests/server-manifest/0.79.0/manifest.yamlinternal/controller/infra/managed/mysql/moco/spec.gointernal/controller/infra/managed/mysql/moco/spec_test.gointernal/controller/infra/managed/mysql/moco/write.gointernal/webhook/v2/weightsandbiases_webhook.gointernal/webhook/v2/weightsandbiases_webhook_test.gopkg/wandb/manifest/load_manifest_from_files_test.go
In this PR, we address 3 bugs from the MySQL→Moco migration. Images below are from Moco documentation: LINK
micro = 2) or an unset/0 value reaches Moco unchecked and is rejected, leaving the install stuck. We now refuse to forward an invalid count to Moco, surfacing a clearInvalidReplicaCountcondition instead of letting the install fail.Note We also have another PR to change the replica sizes from 2 → 3: https://github.com/wandb/core/pull/45000
datadir-) instead of the one Moco uses (mysql-data-moco-…), so it never tagged them — and purge, which finds disks by that tag, skipped them. We now build the name from Moco's own constant and naming method, so it stays in sync.Reconciled=False/ScaleDownUnsupportedcondition.Note: Reducing replicas is still possible, but the operator won't do it automatically — it requires Moco's manual stop-clustering procedure.
Summary by CodeRabbit
Bug Fixes
Chores