Conversation
add `driverVersion`, `memoryTotal` and `platform` as top level fields on resources
|
/run-security-scan |
alexcos20
left a comment
There was a problem hiding this comment.
AI automated code review (Gemini 3).
Overall risk: medium
Summary:
This pull request extends the GPU resource configuration by adding driverVersion, memoryTotal, and platform fields. It updates the scripts/list_gpus.sh and scripts/ocean-node-quickstart.sh to automatically detect and include this information for NVIDIA, AMD, and Intel GPUs. The changes involve new shell functions for driver and memory detection using various system tools (nvidia-smi, modinfo, clinfo, dpkg, rpm, lspci, sysfs). The docs/GPU.md is updated with examples reflecting these new fields and the ComputeResource interface and schema are extended accordingly.
Comments:
• [WARNING][style] The GPU detection logic (functions get_nvidia_gpus, get_driver_version, get_intel_driver_version, get_amd_driver_version, get_amd_vram, get_intel_vram, map_pci_to_primary, process_pci_line, get_generic_gpus, get_all_gpus_json) appears to be almost entirely duplicated between scripts/list_gpus.sh and scripts/ocean-node-quickstart.sh. This is a significant maintenance burden. Consider extracting this common logic into a shared script or a function that can be sourced by both scripts to avoid redundancy and potential inconsistencies when future updates are made.
• [INFO][other] The jq expression within get_all_gpus_json has become quite complex, especially with the conditional init block structure for NVIDIA vs. other GPUs. While it achieves the desired outcome, its readability and maintainability are reduced. Adding more inline comments to explain specific parts of the jq logic, especially the conditional init block and the null coalescing for driverVersion/memoryTotal, would be beneficial for future debugging and understanding.
• [INFO][other] The functions like get_amd_vram and get_intel_vram attempt to retrieve memory information from various sysfs paths and lspci output. While this provides good coverage, sysfs paths can occasionally vary across different kernel versions or distributions. It would be beneficial to add comments explaining the rationale behind checking multiple paths and any known limitations or fallback behaviors if detection fails silently (2>/dev/null).
• [INFO][documentation] The ShmSize value 8589934592 (8 GiB) is a very specific magic number in the example DOCKER_COMPUTE_ENVIRONMENTS configuration for AMD. It would be helpful to add a small note or comment explaining why this particular value is chosen, or if it's a common recommendation for AMD GPU compute environments to prevent shared memory-related issues within containers.
• [INFO][documentation] The addition of driverVersion, memoryTotal, and platform fields to the ComputeResource interface, along with the JSDoc for platform, is clear and correctly reflects the new data. This is a good improvement.
scripts/list_gpus.sh
Outdated
| Devices: $dev, | ||
| Capabilities: [["gpu"]] | ||
| Capabilities: [["gpu"]], | ||
| DriverVersion: $driver_version, |
There was a problem hiding this comment.
DriverVersion & MemoryTotal should be at root level for device, not under deviceRequests
Fixes # .
Changes proposed in this PR:
driverVersion,memoryTotalandplatformas top level fields on resources