Skip to content

chart: add optional scheduler/worker split (v0.4.0) and HPA support#6

Merged
Bierchermuesli merged 4 commits into
mainfrom
feat/scheduler-worker-split
Jun 21, 2026
Merged

chart: add optional scheduler/worker split (v0.4.0) and HPA support#6
Bierchermuesli merged 4 commits into
mainfrom
feat/scheduler-worker-split

Conversation

@Bierchermuesli

@Bierchermuesli Bierchermuesli commented Jun 19, 2026

Copy link
Copy Markdown
Contributor
  • Adds an optional scheduler Deployment that runs in scheduler-only mode (NETDISCO_WORKERS_TASKS=0, no pollers)
  • When scheduler.enabled=true, the backend Deployment automatically receives NETDISCO_NO_SCHEDULER=1, making it safe to run multiple backend replicas without duplicate job submissions into the PostgreSQL queue
  • Adds optional CPU+memory HPA for the backend Deployment (backend.hpa.enabled, disabled by default) with configurable scale-down stabilization to avoid killing pods mid-job
  • Omits replicas from the backend Deployment when HPA is enabled so ArgoCD/Flux don't fight the HPA
  • Adds Vault agent resource limits/requests annotations to avoid consuming namespace CPU quota unexpectedly
  • Adds revisionHistoryLimit: 3 to all Deployments to cap stale ReplicaSets
  • Disabled by default — no behaviour change for existing deployments
image

How?

The netdisco backend runs three MCE roles in one process: Scheduler (submits jobs on cron), Manager (pulls from PG queue), and Poller (executes jobs). Running multiple replicas today causes the Scheduler to submit duplicate jobs every minute.

With this split:

  • 1× scheduler pod: Scheduler only, zero pollers
  • N× backend pods: Manager + Pollers, no Scheduler

Requires NETDISCO_NO_SCHEDULER and NETDISCO_WORKERS_TASKS env var support in the netdisco backend binary.

HPA example (CPU+memory)

backend:
  hpa:
    enabled: true
    minReplicas: 2
    maxReplicas: 4
    targetCPUUtilization: 80
    targetMemoryUtilization: 80
    scaleDownStabilizationSeconds: 300  # avoid killing pods mid-job

HPA example (KEDA / queue depth) — untested, requires KEDA on the cluster

netdisco exposes netdisco_jobs{status="queued"} on the web /metrics endpoint which makes it a natural KEDA trigger.

Replicas are calculated as ceil(queueDepth / threshold). Tune the threshold to match workers.tasks and your expected queue depth — a full discovery run can queue hundreds of jobs, so ~50 gives a gradual ramp without immediately pegging at maxReplicas:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: netdisco-backend
spec:
  scaleTargetRef:
    name: netdisco-backend
  minReplicaCount: 1
  maxReplicaCount: 4
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus:9090
        metricName: netdisco_queued_jobs
        query: netdisco_jobs{status="queued",tenant="netdisco"}
        threshold: "50"   # ceil(queueDepth/threshold) replicas — 200 jobs → 4 replicas
Todo
  • helm template with scheduler.enabled: false produces identical output to before
  • helm template with scheduler.enabled: true produces both netdisco-backend and netdisco-scheduler Deployments
  • Scheduler pod has NETDISCO_WORKERS_TASKS=0, backend pods have NETDISCO_NO_SCHEDULER=1
  • HPA applies cleanly with autoscaling/v2

When scheduler.enabled=true, a dedicated scheduler-only Deployment is
created (NETDISCO_WORKERS_TASKS=0) and the backend Deployment receives
NETDISCO_NO_SCHEDULER=1, allowing multiple backend replicas to run
safely without duplicate job submissions into the PostgreSQL queue.
@Bierchermuesli Bierchermuesli changed the title chart: add optional scheduler/worker split (v0.4.0) chart: add optional scheduler/worker split (v0.4.0) and HPA support Jun 20, 2026
@ollyg

ollyg commented Jun 21, 2026

Copy link
Copy Markdown
Member

I've no comments - happy for you to merge @Bierchermuesli when you feel it's ready!

@ollyg

ollyg commented Jun 21, 2026

Copy link
Copy Markdown
Member

BTW yes, this is very cool to see, thank you! :-D

…storyLimit

- Add optional autoscaling/v2 HPA for backend (backend.hpa.enabled, off by default)
  with CPU and memory metrics and configurable scale-down stabilization window
- Omit replicas from backend Deployment when HPA is enabled to avoid ArgoCD fighting HPA
- Add vault agent resource limits/requests annotations to avoid consuming default quota
- Add revisionHistoryLimit: 3 to all Deployments to cap stale ReplicaSets
@Bierchermuesli Bierchermuesli merged commit 2bcd15e into main Jun 21, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants