Collect GPU index assignments from SLURM gres_detail and filter Job Analyzer GPU charts (#129)#129
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
luccabb
left a comment
There was a problem hiding this comment.
since squeue CLI has no GRES_DETAIL this field would always be null, we'd need to add support for Slurm REST (SlurmRestClient) in the collection pipeline
|
@claude did we miss anything? |
|
Claude finished @luccabb's task in 1m 24s —— View job Review: Did we miss anything?
Beyond the items you already flagged, here are additional issues I found: 1.
|
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
a744964 to
828670e
Compare
|
@lushengt-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99787988. |
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
828670e to
7eafc8d
Compare
Correct — GRES_DETAIL is a REST API-only field and is already mapped via "gres_detail": "GRES_DETAIL" in REST_TO_SQUEUE_FIELD_MAP. The _map_job_fields method in SlurmRestClient populates it from the REST API response. For CLI-based collection, slurm_field is set to False so GRES_DETAIL is excluded from JOB_DATA_SLURM_FIELDS and won't be injected into the squeue format spec. The field defaults to None when not populated. |
addressed issues that report by claude. |
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
7eafc8d to
6b7e07f
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
6b7e07f to
f2bafdd
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Pull Request resolved: facebookresearch#129 Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
f2bafdd to
da069ae
Compare
…nalyzer GPU charts (facebookresearch#129) Summary: Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node). ## Background: When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned. The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job <id> -d | grep GRES → GRES=gpu:ampere:1(IDX:7) ## Pipeline change (Python): - parsing.py: Added parse_gres_gpu_indices() that parses gres_detail strings into GPU index lists. Returns a comma-separated string of indices for single-node partial-GPU jobs (e.g., "7" or "0,3,5"), None for full-node (8 GPUs) or multi-node jobs. This avoids storing unnecessary data. - squeue.py: Added GRES_GPU_INDICES field (nullable, defaults to None) and "gres_detail" → "GRES_DETAIL" REST API mapping. Adds one string column to existing fair_job_data rows — no extra entries. - test_parsers.py: Added 12 test cases covering single GPU, multiple GPUs, range notation, full-node, multi-node, and edge cases (empty, null, N/A). ## Job Analyzer change (Hack): - FairJob.php: Added $gresGpuIndices property - FAIRJobAnalyzerLatestJobInfoModule.php: Queries GRES_GPU_INDICES from fair_job_data Scuba table - FAIRJobAnalyzerPerfAnalyzerModule.php: When gresGpuIndices is available (e.g., "7"), filters all 5 GPU ODS charts (utilization, temperature, SM util, SM occupancy, memory) to gpu=(7) with per-GPU reduceTerm. When null (full-node or multi-node), shows all GPUs with the original averaged reduceTerm. ### Scope: - Single-node partial-GPU jobs: shows only the assigned GPUs (100% accurate) - Single-node full-GPU jobs: shows all GPUs (unchanged, no filtering needed) - Multi-node jobs: shows all GPUs (unchanged — gres_detail has per-node values that can't be stored in fair_job_data's 1-row-per-job format) Differential Revision: D99787988
da069ae to
82c9f41
Compare
Summary:
Adds GPU index collection from the SLURM REST API's gres_detail field to the GCM pipeline, and uses it in the FAIR Job Analyzer to show only the GPUs assigned to a job (instead of all 8 GPUs on the node).
Background:
When a job uses fewer GPUs than available on a node (e.g., --gpus-per-task=1 on an 8-GPU node), the Job Analyzer previously showed metrics for all 8 GPUs. The existing GPUS_REQUESTED field comes from TRES-PER-NODE (always 8 for the full node), not the per-task allocation. TRES_GPUS_ALLOCATED correctly reports the count (e.g., 1 ) but not which specific GPU indices are assigned.
The SLURM REST API provides gres_detail — an array of strings with exact GPU index assignments per node (e.g., "gpu:ampere:1(IDX:7)"). Verified on AVA RSC: scontrol show job -d | grep GRES → GRES=gpu:ampere:1(IDX:7)
Pipeline change (Python):
Job Analyzer change (Hack):
Scope:
Differential Revision: D99787988