Skip to content

feat: update quickstart script & gpu docs#1261

Open
ndrpp wants to merge 2 commits intomainfrom
feat/1258-extend-gpu-resource-config
Open

feat: update quickstart script & gpu docs#1261
ndrpp wants to merge 2 commits intomainfrom
feat/1258-extend-gpu-resource-config

Conversation

@ndrpp
Copy link
Member

@ndrpp ndrpp commented Mar 10, 2026

Fixes # .

Changes proposed in this PR:

  • add driverVersion, memoryTotal and platform as top level fields on resources
  • auto configure these fields in node quickstart
  • update quickstart script properties for amd & intel
  • update gpu docs

add `driverVersion`, `memoryTotal` and `platform` as top level fields on
resources
@ndrpp ndrpp linked an issue Mar 10, 2026 that may be closed by this pull request
4 tasks
@ndrpp ndrpp marked this pull request as ready for review March 10, 2026 14:50
@bogdanfazakas
Copy link
Member

/run-security-scan

Copy link
Member

@alexcos20 alexcos20 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI automated code review (Gemini 3).

Overall risk: medium

Summary:
This pull request extends the GPU resource configuration by adding driverVersion, memoryTotal, and platform fields. It updates the scripts/list_gpus.sh and scripts/ocean-node-quickstart.sh to automatically detect and include this information for NVIDIA, AMD, and Intel GPUs. The changes involve new shell functions for driver and memory detection using various system tools (nvidia-smi, modinfo, clinfo, dpkg, rpm, lspci, sysfs). The docs/GPU.md is updated with examples reflecting these new fields and the ComputeResource interface and schema are extended accordingly.

Comments:
• [WARNING][style] The GPU detection logic (functions get_nvidia_gpus, get_driver_version, get_intel_driver_version, get_amd_driver_version, get_amd_vram, get_intel_vram, map_pci_to_primary, process_pci_line, get_generic_gpus, get_all_gpus_json) appears to be almost entirely duplicated between scripts/list_gpus.sh and scripts/ocean-node-quickstart.sh. This is a significant maintenance burden. Consider extracting this common logic into a shared script or a function that can be sourced by both scripts to avoid redundancy and potential inconsistencies when future updates are made.
• [INFO][other] The jq expression within get_all_gpus_json has become quite complex, especially with the conditional init block structure for NVIDIA vs. other GPUs. While it achieves the desired outcome, its readability and maintainability are reduced. Adding more inline comments to explain specific parts of the jq logic, especially the conditional init block and the null coalescing for driverVersion/memoryTotal, would be beneficial for future debugging and understanding.
• [INFO][other] The functions like get_amd_vram and get_intel_vram attempt to retrieve memory information from various sysfs paths and lspci output. While this provides good coverage, sysfs paths can occasionally vary across different kernel versions or distributions. It would be beneficial to add comments explaining the rationale behind checking multiple paths and any known limitations or fallback behaviors if detection fails silently (2>/dev/null).
• [INFO][documentation] The ShmSize value 8589934592 (8 GiB) is a very specific magic number in the example DOCKER_COMPUTE_ENVIRONMENTS configuration for AMD. It would be helpful to add a small note or comment explaining why this particular value is chosen, or if it's a common recommendation for AMD GPU compute environments to prevent shared memory-related issues within containers.
• [INFO][documentation] The addition of driverVersion, memoryTotal, and platform fields to the ComputeResource interface, along with the JSDoc for platform, is clear and correctly reflects the new data. This is a good improvement.

Devices: $dev,
Capabilities: [["gpu"]]
Capabilities: [["gpu"]],
DriverVersion: $driver_version,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DriverVersion & MemoryTotal should be at root level for device, not under deviceRequests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extend resources config for GPU

3 participants