-
Notifications
You must be signed in to change notification settings - Fork 44
Description
What happened:
When provisioning on-demand GPU node, if ZONE_RESOURCE_POOL_EXHAUSTED happened, the karpenter-gcp-provider logs show created instance, but the node didn't boot up due to the resource exhausted.
{"level":"INFO","time":"2025-12-04T12:00:51.845Z","logger":"controller","message":"Created instance","commit":"195a383-dirty","controller":"nodeclaim.lifecycle","controllerGroup":"karpenter.sh","controllerKind":"NodeClaim","NodeClaim":{"name":"ws-l4-gpu1-test-jqxp7"},"namespace":"","name":"ws-l4-gpu1-test-jqxp7","reconcileID":"733aa975-e75f-42e3-a806-aa0fa4bcc025","instanceName":"karpenter-ws-l4-gpu1-test-jqxp7","instanceType":"g2-standard-16","zone":"europe-west6-b","projectID":"iprally-ai-dev","region":"europe-west6","providerID":"karpenter-ws-l4-gpu1-test-jqxp7","providerID":"karpenter-ws-l4-gpu1-test-jqxp7","Labels":{"env":"dev","goog-k8s-cluster-name":"mlops-west6-dev","karpenter-k8s-gcp-gcenodeclass":"ws-nodeclass-test","karpenter-sh-nodepool":"ws-l4-gpu1-test"},"Tags":{"items":["gke-mlops-west6-dev-745419f3-node"]},"Status":""}
It looks code already handles the problems https://github.com/cloudpilot-ai/karpenter-provider-gcp/blob/main/pkg/providers/instance/instance.go#L125, but somehow it doesn't catch the error.
What you expected to happen:
Errors in the karpenter logs, and when executing kubectl describe pod, we can see the error below.
Warning FailedScheduling 35s karpenter Failed to schedule pod, nodepool requirements filtered out all available instance types
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Karpenter-provider-gcp version (use
git describe --tags --dirty --always): - GKE version:
- Others: