Skip to content

Add PAPI metrics collection#248

Open
Dr. Michele Guidolin (mo-mguidolin) wants to merge 63 commits into
MetOffice:mainfrom
mo-mguidolin:201-papi
Open

Add PAPI metrics collection#248
Dr. Michele Guidolin (mo-mguidolin) wants to merge 63 commits into
MetOffice:mainfrom
mo-mguidolin:201-papi

Conversation

@mo-mguidolin

@mo-mguidolin Dr. Michele Guidolin (mo-mguidolin) commented Apr 24, 2026

Copy link
Copy Markdown

Description

These changes add the PAPI (Performance Application Programming Interface) library as an optional feature, allowing Vernier to collect hardware performance metrics (e.g. PAPI_FP_OPS, PAPI_TOT_INS, PAPI_TOT_CYC) alongside existing wall-time measurements.

Summary

Extra:

  • The maximum number of PAPI metrics that can be collected is set at compile time.
  • This is set to 5, which is usually what hardware counters are limited to.
  • Only total cumulative metrics are stored. Self metrics (excluding child) are not computed.
  • Added a new env variable VERNIER_PAPI_EVENTS that takes a comma-separated list of events to collect.
  • PAPI metrics collection works with OpenMP threads.
  • The metrics are collected also if a Vernier region starts and stops in a single thread but creates a parallel region inside without callipers.
  • The new code has not been tested (and probably will not work) with other threading systems.
  • OMP nested parallelism has not been tested and it might not work.

Build:

  • Added ENABLE_PAPI (default OFF).
  • Added PAPI_DEBUG option for verbose PAPI call logging.
  • Added FindPAPI.cmake module to locate PAPI headers and library.

Output:

  • The threads-format output was modified to append PAPI metric columns when events are active.
  • The columns of the threads-format were realigned.
  • The DRHOOK output was not modified, thus it will not print any PAPI metrics.

Source:

  • Added new vernier_papi.h and vernier_papi.cpp.
  • The PAPI events are read from the VERNIER_PAPI_EVENTS env variable and loaded into an events_code vector.
  • The majority of code added into vernier.cpp is gated by events_code.empty().
  • The RegionRecords object has a new metrics_array_t element to store the total PAPI metrics.
  • The TraceBackEntry object has a new metrics_vector_t element (vector of metrics_array_t).
  • The metrics_vector_t is used to store the metrics if a parallel region is spawned inside the Vernier region.

Performance:

  • The metrics_vector_t data is not copied into TraceBackEntry but moved using std::move.
  • The new code can be used with and without PAPI being available on the system.
  • If PAPI is not available:
    • events_code.empty() is hardcoded to true.
    • The other objects and functions are empty.
    • Thus, the compiler should be able to remove the unused code.
    • The impact of these changes if PAPI is not compiled should be minimal.

Tests:

  • Added test_papi_fp_ops.cpp for single-event PAPI_FP_OPS unit test.
  • Added test_papi_cyc_ins.cpp for two-event PAPI_TOT_CYC + PAPI_TOT_INS unit test.
  • Added test_papi_tot_ins.cpp for multi-thread, multi-call PAPI_TOT_INS unit test.
  • Added test-papi-omp-runs.cpp as a system test for verifying three OpenMP/profiling patterns (region inside parallel, region wrapping parallel, nested regions).
  • Modified test_proftests.cpp by adding GTEST_FLAG(death_test_style, "threadsafe") to three death tests (required when PAPI spawns threads). Without this, the death tests were hanging on the HPC.

Extra:

  • The tests and some of the new code have been produced with the assistance of Met Office GitHub Copilot Enterprise.

Test Results:

  • Tests were performed on HPC (without pFUnit) and on VDI.
  • On HPC
    • 8 configurations were used.
    • 2 Compilers GNU and CC
    • With and without PAPI
    • Debug and Release.
    • The direcotory is exc:/data/users/michele.guidolin/Vernier/cmake-build-full-tests/./20260430095713
    ./20260430095713/NOPAPI-CCE-DEBUG/ctest.o8348755:100% tests passed, 0 tests failed out of 21
    ./20260430095713/NOPAPI-GNU-DEBUG/ctest.o8348933:100% tests passed, 0 tests failed out of 23
    ./20260430095713/NOPAPI-CCE-RELEASE/ctest.o8349969:100% tests passed, 0 tests failed out of 21
    ./20260430095713/PAPI-CCE-DEBUG/ctest.o8349281:100% tests passed, 0 tests failed out of 25
    ./20260430095713/NOPAPI-GNU-RELEASE/ctest.o8350268:100% tests passed, 0 tests failed out of 23
    ./20260430095713/PAPI-GNU-DEBUG/ctest.o8349640:100% tests passed, 0 tests failed out of 27
    ./20260430095713/PAPI-CCE-RELEASE/ctest.o8350531:100% tests passed, 0 tests failed out of 25
    ./20260430095713/PAPI-GNU-RELEASE/ctest.o8350819:100% tests passed, 0 tests failed out of 27
    
  • On VDI
    • 4 configurations were used
    • GNU compiler
    • With and without PAPI
      • PAPI return no events available so the PAPI tests are skipped.
    • Debug and Relese
    • The directory is VDI:/data/users/michele.guidolin/VSCODE/Vernier/cmake-build-full-tests/20260430101736
    ./20260430101736/NOPAPI-GNU-DEBUG/output.txt:100% tests passed, 0 tests failed out of 23
    ./20260430101736/PAPI-GNU-RELEASE/output.txt:100% tests passed, 0 tests failed out of 27
    ./20260430101736/PAPI-GNU-DEBUG/output.txt:100% tests passed, 0 tests failed out of 27
    ./20260430101736/NOPAPI-GNU-RELEASE/output.txt:100% tests passed, 0 tests failed out of 23
    

Linked issues

Closes #201

Type of change

  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How has this been tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

  • New tests have been added
  • Tests have been modified to accommodate this change

Checklist:

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes, for both debug and optimised builds

…eadlock. Added the use of threadsafe for these tests
@mo-mguidolin Dr. Michele Guidolin (mo-mguidolin) marked this pull request as ready for review April 30, 2026 12:14
@mo-mguidolin

Copy link
Copy Markdown
Author

A new overhead analysis ( I deleted the previous one)

First one about deep callstack overhead:

void do_simple_work(volatile int *j) {
    volatile int x = 0;
    for (int i = 0; i < 1000; ++i) x += i;
    if(x>10000000) (*j)+=x;
}

void deep_call(int depth, int *j) {
    size_t hash = vernier.start("deep_base"+std::to_string(depth));
    do_simple_work(j);
    if (depth > 0)
      deep_call(depth - 1,j);
    vernier.stop(hash);
}

The app is run on three configurations

  • cmake-build-main Build with the main version of Vernier
  • cmake-build-nopapi Build with the PR version of Vernier but without PAPI
  • cmake-build-papi Build with the PR version of Vernier and with PAPI

The app is run on a Genoa node twice. One without PAPI metrics and one with PAPI metrics. Naturally only the configuration with active PAPI will collect the metrics on the second run

export VERNIER_OUTPUT_FORMAT=threads
export VERNIER_OUTPUT_MODE=single

export VERNIER_OUTPUT_FILENAME="vernier-output-no-metrics-genoa"
export OMP_NUM_THREADS=1
mpiexec --cpu-bind=depth -d $OMP_NUM_THREADS -n 192 ./src/deep_callstack_overhead 1000000 10

export VERNIER_PAPI_EVENTS=PAPI_TOT_INS,PAPI_TOT_CYC,ANY_DATA_CACHE_FILLS_FROM_SYSTEM:LCL_L2,ANY_DATA_CACHE_FILLS_FROM_SYSTEM:LOCAL_CCX,ANY_DATA_CACHE_FILLS_FROM_SYSTEM:DRAM_IO_NEAR
export VERNIER_OUTPUT_FILENAME="vernier-output-5-metrics-genoa"
export OMP_NUM_THREADS=1
mpiexec --cpu-bind=depth -d $OMP_NUM_THREADS -n 192 ./src/deep_callstack_overhead 1000000 10

The results

Main Vernier:

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-main/vernier-output-no-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0          1.1119     1.1215     1.1314     1.1324     1.2238    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task      1.0161     1.0256     1.0344     1.0349     1.1269    192
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-main/vernier-output-5-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0          1.1179     1.1214     1.1295     1.1308     1.2096    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task      1.0221     1.0254     1.0325     1.0342     1.1132    192

PR without PAPI

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-nopapi/vernier-output-no-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0           1.084     1.0933     1.1011     1.1033     1.1786    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task     0.99119     1.0002     1.0072     1.0094     1.0866    192
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-nopapi/vernier-output-5-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0          1.0852     1.0935     1.1017      1.104     1.1899    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task     0.99243     1.0006     1.0081       1.01     1.0921    192

PR with PAPI

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-papi/vernier-output-no-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0           1.108      1.114     1.1234     1.1241     1.1888    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task      1.0122     1.0173      1.026     1.0262     1.0874    192
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-papi/vernier-output-5-metrics-genoa-deep_callstack_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0          14.346     14.631     14.941     15.152     16.874    192
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task      12.979      13.24     13.529     13.716     15.309    192

@mo-mguidolin

Copy link
Copy Markdown
Author

Second one with openmp

void do_simple_work(volatile int *j) {
    volatile int x = 0;
    for (int i = 0; i < 1000; ++i) x += i;
    if(x>10000000) (*j)+=x;
}
...
   #pragma omp parallel for
    for (int i = 0; i < iterations; ++i) {
        size_t hash = vernier.start("simple_work");
	do_simple_work(&j);
        vernier.stop(hash);
    }
...

This app is run on a Genoa node but with 4 threads and 48 mpi tasks

The results

Main Vernier:

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-main/vernier-output-no-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0         0.23098    0.23207    0.23377    0.23406    0.24662     48
__vernier__@1         0.23049     0.2321    0.23368    0.23419    0.24451     48
__vernier__@2         0.23029    0.23235    0.23399    0.23388     0.2597     48
__vernier__@3         0.23083    0.23204    0.23446    0.23389    0.28066     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-main/vernier-output-5-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0         0.23074    0.23136    0.23349    0.23365      0.253     48
__vernier__@1         0.23124    0.23225     0.2345    0.23449     0.2594     48
__vernier__@2         0.23019    0.23205    0.23351    0.23391    0.25689     48
__vernier__@3         0.23106     0.2319    0.23525    0.23368    0.27955     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48

PR without PAPI

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-nopapi/vernier-output-no-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0         0.22745    0.22849    0.23001    0.23116    0.23527     48
__vernier__@1         0.22769    0.22896    0.23051    0.23118    0.24057     48
__vernier__@2         0.22789    0.22904    0.23034    0.23125    0.23444     48
__vernier__@3         0.22758    0.22933    0.23095    0.23143    0.24011     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-nopapi/vernier-output-5-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0         0.22753    0.22819    0.22989    0.23125    0.23586     48
__vernier__@1         0.22781    0.22933    0.23137    0.23193    0.25956     48
__vernier__@2         0.22856    0.22976    0.23157    0.23195     0.2584     48
__vernier__@3         0.22859     0.2299    0.23215    0.23214    0.26184     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48

PR with PAPI

michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-papi/vernier-output-no-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0         0.23356    0.23458    0.23655    0.23693    0.24812     48
__vernier__@1         0.23927    0.24209    0.24293    0.24401     0.2457     48
__vernier__@2         0.23635    0.23827    0.23974    0.24063    0.24518     48
__vernier__@3         0.23727    0.23835    0.23955     0.2405    0.24349     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48
michele.guidolin@exc-login06:/data/users/michele.guidolin/OverheadVernier $ ~/bin/vernier_stats.sh cmake-build-papi/vernier-output-5-metrics-genoa-openmp_loop_overhead-collated 
Region                Min (s)   Q1 (25%)    Avg (s)   Q3 (75%)    Max (s)  Count
------------------ ---------- ---------- ---------- ---------- ---------- ------
__vernier__@0          2.2411     2.2704     2.3072     2.3132     2.5526     48
__vernier__@1          2.2358     2.2519     2.2823     2.2845      2.587     48
__vernier__@2           2.235     2.2569     2.2916     2.3048     2.5117     48
__vernier__@3          2.2351     2.2544     2.2914     2.3082     2.5631     48
------------------ ---------- ---------- ---------- ---------- ---------- ------
Overhead per task           0          0          0          0          0     48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed The CLA has been signed as part of this PR - added by GA

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add PAPI counter calls

1 participant