Skip to content

Add new extension yagp hooks collector#1629

Open
leborchuk wants to merge 143 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks
Open

Add new extension yagp hooks collector#1629
leborchuk wants to merge 143 commits intoapache:mainfrom
open-gpdb:mergeYagpHooks

Conversation

@leborchuk
Copy link
Copy Markdown
Contributor

@leborchuk leborchuk commented Mar 18, 2026

A diagnostic and monitoring extension for Cloudberry clusters

Briefly, the interaction scheme is:

Cloudberry (Run Executor hooks) -> hooks collector extension (Create protobuf messages and send them using UDS) -> local yagpcc agent

This is the part that sends data from the PG instance to an external agent.

External agent receives telemetry, store, aggregate and expose to the consumers. The agent source code is https://github.com/open-gpdb/yagpcc We're going to propose add it to Apache Cloudberry infrastructure later, after fixing all the tests.

This is quite similar to the #1085 idea. However, it has a different implementation. It is completely separate from database application and written in Go. Could be separately maintained. No influence to the main database, and so different development requirements.

Extension has pg_regress tests, they were copied from pg_stat_statements (PG18) and adjusted to the extension behaviour.

Original extension authors are Smyatkin-Maxim and NJrslv I left their commits as-is to have blame history commited source.

Just create the overal project structure with a Makefile to
generate protobufs, compile it into a shared library extension
and install it
as a fundation to build on
- borrow GlowByte code to generate plan text and SessionInfo
- borrow code from our in-house pg_stat_statements to generate
  query id and plan id
- refactor code to follow common name conventions and identations
- do some minor refactoring to follow common naming convention
- add additional message right after ExecutorStart hook
1) Query instrumentation
2) /proc/self/* stats
It allows finer granularity than executor hooks. Also removed some code
duplication and data duplicaton
1. Initialize query instrumentation to NULL so that it can be properly
checked later (temporary solution, need to find a proper fix)
2. Don't collect spillinfo on query end. Reason: a) it will always be
zero and b) it could crash if we failed to enlarge a spillfile. Seems
like we need some cummulative statistics for spillinfo. Need to check
what explain analyze use.
1. Sync with protobuf changes to collect segment info
2. Remove noisy logging
3. Fix some missing node types in pg_stat_statements
Reason: when query info hook is called with status 'DONE' planstate
is already deallocated by ExecutorEnd
1) Give higher gRPC timeouts to query dispatcher as losing messages
there is more critical
2) If we've failed to send a message via gRPC we notify a background
thread about it and refuse sending any new message until this thread
re-establishes the lost connection
Don't collect system queries with empty query text and ccnt == 0
Rethrowing them might break other extensions and even query execution
pipeline itself
@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Mar 30, 2026

Hi, I did a test on a local machine. After building the extension successfully, it showed that some files are deleted when running make distclean:

[gpadmin@cdw cloudberry]$ git status
On branch mergeYagpHooks
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)

        deleted:    gpcontrib/gp_stats_collector/results/gpsc_cursors.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_dist.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_guc_cache.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_locale.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_select.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_uds.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utf8_trim.out
        deleted:    gpcontrib/gp_stats_collector/results/gpsc_utility.out

Do we need to remove or keep them in the source? FYI.

NJrslv added 5 commits March 31, 2026 08:34
Build gp_stats_collector by default and disable it.

Add missing if in verify_query() to make sure the
extension does not execute main code while being
disabled. Plus, always verify protobuf version
once shared library is preloaded - no matter if
extension is disabled/enabled via GUCs.

Change CI tests accordingly.
Copy link
Copy Markdown
Contributor

@my-ship-it my-ship-it left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Great work on this extension — I can see a lot of effort went into it, and the earlier
review feedback has been well addressed.

Before merging, it might be nice to squash or tidy up the commit history a bit. The current log includes quite a few incremental fix-up commits from the review process.

Squashing these into a small number of logical commits — or even a single clean commit — would keep git log tidy and consistent with the project's commit conventions. Just a suggestion!

Other than these, everything looks good from my side. Thanks again for the great contribution!

@NJrslv
Copy link
Copy Markdown
Contributor

NJrslv commented Apr 1, 2026

@tuhaihe
Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

I think we're now good to merge.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

*Also, give me some time to write a complete commit message, if we squash it.

@tuhaihe
Copy link
Copy Markdown
Member

tuhaihe commented Apr 1, 2026

@tuhaihe Dianjin, Hi!

I restored the config file here (the one I was editing manually): f26c287. We'll regenerate it cleanly from configure.ac in the next PRs, I guess.

Got it, thanks. Yes, we can regenerate it later, and welcome to create a new PR to update it then. BTW, my PR on generating the new configure file is still in review: #1651.

The only question is: squash or merge? This PR has quite a history of commits, so it might be useful to preserve it. But throughout our work on this extension, we've been making fix-up commits, and the history should be cleaned up - grouped before the merge.

I don't think preserving the commit history is an extremely important option. The cleanest and easiest approach would be to squash all the commits into one and merge. What do you think?

I prefer to keep the history commits, which will be important for the potential contributors to learn more about the context. Here just need some time to tidy them up ( my common way is squashing some simple and related continuous commits into one and rewording the commit title/commit message body ) and make the commits look more logical, as suggested by @my-ship-it. But it looks good to me for now. FYI. @NJrslv

@NJrslv
Copy link
Copy Markdown
Contributor

NJrslv commented Apr 1, 2026

I reset this commit (f26c287) in the force push above because autoconf is not run during the configure-cloudberry.sh script. As a result, the extension was not built, so the tests failed.

The ideal case would be to make the change in configure.ac, regenerate configure, and push both files. However, since we agreed to regenerate configure later, the only thing that matters for now is configure.ac.

I left configure file as it was to ensure the tests pass.

Now I'm tidying up the commit history.

*I will approve the review once I am ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants