Skip to content

nvidia-k8s-device-plugin: allow disabling plugin through API#914

Draft
arnaldo2792 wants to merge 1 commit into
bottlerocket-os:developfrom
arnaldo2792:k8s-device-plugin-disable/core-kit
Draft

nvidia-k8s-device-plugin: allow disabling plugin through API#914
arnaldo2792 wants to merge 1 commit into
bottlerocket-os:developfrom
arnaldo2792:k8s-device-plugin-disable/core-kit

Conversation

@arnaldo2792
Copy link
Copy Markdown
Contributor

Description of changes:

Add an API setting to disable the NVIDIA k8s device plugin. The new setting defaults to enabled, preserving existing behavior. Template rendering gracefully handles the setting being absent for downstreams that don't define it.

Testing done:

  • Confirmed the template renders and the device plugin starts even when the setting is missing:
Details
[root@admin]# apiclient set settings.kubelet-device-plugins.nvidia.enabled=false
Failed to change settings: Failed PATCH request to '/settings/keypair?tx=apiclient-set-cNt2Gszti4V0VSpg': Status 400 when PATCHing /settings/keypair?tx=apiclient-set-cNt2Gszti4V0VSpg: Unable to match your input to the data model.  We may not have enough type information.  Please try the --json input form.  Cause: Error during deserialization: unknown field `enabled`, expected one of `pass-device-specs`, `device-id-strategy`, `device-list-strategy`, `device-sharing-strategy`, `time-slicing`, `mps`, `device-partitioning-strategy`, `mig` at line 1 column 46
[root@admin]# sheltie systemctl status nvidia-k8s-device-plugin.service
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf, 10-requires-tmp.conf
             /etc/systemd/system/nvidia-k8s-device-plugin.service.d
             └─exec-start.conf
             /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service.d
             └─grid-license-file-check.conf
     Active: active (running) since Fri 2026-04-24 01:04:13 UTC; 11min ago
 Invocation: 24a100149424447488c8e7aa70c6b3b9
   Main PID: 3374 (nvidia-device-p)
      Tasks: 10 (limit: 36988)
     Memory: 52.6M (peak: 53.5M)
        CPU: 101ms
     CGroup: /system.slice/nvidia-k8s-device-plugin.service
             └─3374 /usr/bin/nvidia-device-plugin --device-list-strategy volume-mounts --device-id-strategy index --pass-device-specs=true

Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]:   "sharing": {
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]:     "timeSlicing": {}
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]:   },
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]:   "imex": {}
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: }
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: I0424 01:04:13.393914    3374 main.go:369] Retrieving plugins.
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: I0424 01:04:13.421322    3374 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: I0424 01:04:13.422624    3374 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: I0424 01:04:13.424616    3374 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Apr 24 01:04:13 ip-192-168-5-124.us-west-2.compute.internal nvidia-device-plugin[3374]: I0424 01:04:13.426032    3374 health.go:64] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]
[root@admin]# cat /.bottlerocket/rootfs/etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
# Ensure that the kubelet device plugin socket exists before we start
# A brief sleep is needed to avoid the `test` failing its first check
ExecStartPre=/usr/bin/sleep 0.1
ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock
ExecStart=/usr/bin/nvidia-device-plugin --device-list-strategy volume-mounts --device-id-strategy index --pass-device-specs=true
Type=simple
RestartSec=2
Restart=always
[root@admin]#

In combination with: bottlerocket-os/bottlerocket-settings-sdk#135

  • Confirmed that the service starts by default:
Details
[root@admin]# apiclient get settings.kubelet-device-plugins.nvidia
{
  "settings": {
    "kubelet-device-plugins": {
      "nvidia": {
        "device-id-strategy": "index",
        "device-list-strategy": "cdi-cri",
        "device-partitioning-strategy": "none",
        "device-sharing-strategy": "none",
        "enabled": true,
        "pass-device-specs": true
      }
    }
  }
}
[root@admin]# cat /.bottlerocket/rootfs/etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
# Ensure that the kubelet device plugin socket exists before we start
# A brief sleep is needed to avoid the `test` failing its first check
ExecStartPre=/usr/bin/sleep 0.1
ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock
ExecStart=/usr/bin/nvidia-device-plugin --device-list-strategy volume-mounts --device-id-strategy index --pass-device-specs=true
Type=simple
RestartSec=2
Restart=always
[root@admin]# sheltie systemctl status nvidia-k8s-device-plugin.service
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf, 10-requires-tmp.conf
             /etc/systemd/system/nvidia-k8s-device-plugin.service.d
             └─exec-start.conf
             /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service.d
             └─grid-license-file-check.conf
     Active: active (running) since Fri 2026-04-24 00:53:55 UTC; 26s ago
 Invocation: d90a78c91bcc4d559be7af4e5bf45e63
    Process: 8135 ExecStartPre=/usr/bin/sleep 0.1 (code=exited, status=0/SUCCESS)
    Process: 8141 ExecStartPre=/usr/bin/test -S /var/lib/kubelet/device-plugins/kubelet.sock (code=exited, status=0/SUCCESS)
    Process: 8146 ExecStartPre=/usr/bin/test -f /etc/drivers/.grid-licensed (code=exited, status=0/SUCCESS)
   Main PID: 8149 (nvidia-device-p)
      Tasks: 10 (limit: 36988)
     Memory: 21.4M (peak: 22.4M)
        CPU: 65ms
     CGroup: /system.slice/nvidia-k8s-device-plugin.service
             └─8149 /usr/bin/nvidia-device-plugin --device-list-strategy volume-mounts --device-id-strategy index --pass-device-specs=true

Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]:   "sharing": {
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]:     "timeSlicing": {}
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]:   },
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]:   "imex": {}
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: }
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: I0424 00:53:55.589175    8149 main.go:369] Retrieving plugins.
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: I0424 00:53:55.621061    8149 server.go:197] Starting GRPC server for 'nvidia.com/gpu'
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: I0424 00:53:55.621608    8149 server.go:141] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: I0424 00:53:55.624470    8149 server.go:148] Registered device plugin for 'nvidia.com/gpu' with Kubelet
Apr 24 00:53:55 ip-192-168-36-98.us-west-2.compute.internal nvidia-device-plugin[8149]: I0424 00:53:55.625674    8149 health.go:64] Ignoring the following XIDs for health checks: map[13:true 31:true 43:true 45:true 68:true 109:true]
  • Confirmed that the service doesn't run when enabled = false:
Details
[root@admin]# apiclient set settings.kubelet-device-plugins.nvidia.enabled=false
[root@admin]# cat /.bottlerocket/rootfs/etc/systemd/system/nvidia-k8s-device-plugin.service.d/exec-start.conf
[Service]
ExecStart=
Type=oneshot
ExecStart=/usr/bin/true
RemainAfterExit=true
[root@admin]# sheltie systemctl status nvidia-k8s-device-plugin.service
● nvidia-k8s-device-plugin.service - Start NVIDIA kubernetes device plugin
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service; enabled; preset: enabled)
    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
             └─00-aws-config.conf, 10-requires-tmp.conf
             /etc/systemd/system/nvidia-k8s-device-plugin.service.d
             └─exec-start.conf
             /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/nvidia-k8s-device-plugin.service.d
             └─grid-license-file-check.conf
     Active: active (exited) since Fri 2026-04-24 00:54:36 UTC; 7s ago
 Invocation: 4fe207d9febe4cd7bad75191150967bb
    Process: 8494 ExecStartPre=/usr/bin/test -f /etc/drivers/.grid-licensed (code=exited, status=0/SUCCESS)
    Process: 8498 ExecStart=/usr/bin/true (code=exited, status=0/SUCCESS)
   Main PID: 8498 (code=exited, status=0/SUCCESS)
   Mem peak: 1.2M
        CPU: 9ms

Apr 24 00:54:36 ip-192-168-36-98.us-west-2.compute.internal systemd[1]: Starting Start NVIDIA kubernetes device plugin...
Apr 24 00:54:36 ip-192-168-36-98.us-west-2.compute.internal systemd[1]: Finished Start NVIDIA kubernetes device plugin.
[root@admin]#

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

@arnaldo2792
Copy link
Copy Markdown
Contributor Author

Pending commit to bump the Settings SDK, as I have to do a release for that first.

@arnaldo2792 arnaldo2792 force-pushed the k8s-device-plugin-disable/core-kit branch from 2e3d89f to 9dfd873 Compare May 2, 2026 00:07
@arnaldo2792
Copy link
Copy Markdown
Contributor Author

Forced push includes:

  • Simplify how the dropins to be more streamlined
  • Prevent the MPS daemon from starting if the device plugin is disabled

Add an API setting to disable the NVIDIA k8s device plugin. The new
setting defaults to enabled, preserving existing behavior. Template
rendering gracefully handles the setting being absent for downstreams
that don't define it.

Signed-off-by: Arnaldo Garcia Rincon <agarrcia@amazon.com>
@arnaldo2792 arnaldo2792 force-pushed the k8s-device-plugin-disable/core-kit branch from 9dfd873 to 0f85aba Compare May 2, 2026 00:09
@arnaldo2792
Copy link
Copy Markdown
Contributor Author

(Forced push to rebase)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant