Skip to content

Conversation

@smoreinis
Copy link
Collaborator

@smoreinis smoreinis commented Jan 9, 2026

Summary

  • Add comprehensive PostgreSQL connection pool and query metrics
  • Dual export: OpenTelemetry SDK (Grafana/Mimir) + StatsD (Datadog)
  • OTel metrics only exported when OTEL_EXPORTER_OTLP_ENDPOINT is configured
  • StatsD metrics always exported when DD_AGENT_HOST is available

Metrics Exported

Connection Pool (periodic, every 30s)

Metric Type Description
db.client.connection.count Gauge Connections by state (idle/used)
db.client.connection.max Gauge Max allowed connections
db.client.connection.overflow Gauge Current overflow connections

Connection Events (on event)

Metric Type Description
db.client.connection.created Counter New connections created
db.client.connection.use_time Histogram Time connection was checked out
db.client.connection.invalidated Counter Connections invalidated

Query Performance (on every query)

Metric Type Description
db.client.operation.duration Histogram Query execution time
db.client.operation.slow Counter Queries exceeding threshold
db.client.operation.errors Counter Query errors by type
db.client.response.returned_rows Histogram Rows returned by SELECT

Health (periodic, every 30s)

Metric Type Description
db.client.connection.health Gauge Pool health (1=healthy, 0=unhealthy)
db.client.connection.health_check_failures Counter Health check failures

Tags/Attributes

  • service / service.name: agentex
  • db_system / db.system.name: postgresql
  • pool / db.client.connection.pool.name: main, middleware, readonly
  • server / server.address: database hostname
  • db_name / db.namespace: database name
  • env / deployment.environment: staging, production
  • operation / db.operation.name: SELECT, INSERT, UPDATE, DELETE
  • table / db.collection.name: table name

Configuration

Variable Description Default
OTEL_EXPORTER_OTLP_ENDPOINT OTLP endpoint URL (enables OTel export) None (disabled)
DD_AGENT_HOST Datadog agent host (enables StatsD export) localhost
POSTGRES_SLOW_QUERY_THRESHOLD Slow query threshold in seconds 0.5

Test plan

  • Unit tests pass
  • Verified OTel metrics in Grafana/Mimir (sgp-dev cluster)
  • Verified StatsD metrics flowing to Datadog agent
  • Tested all three pools: main, middleware, readonly

Add comprehensive PostgreSQL connection pool and query metrics using
OpenTelemetry SDK with OTLP export. Metrics are only collected when
OTEL_EXPORTER_OTLP_ENDPOINT is configured.

Metrics added:
- db.client.connection.count (by state: idle/used)
- db.client.connection.max, overflow.current, idle.max
- db.client.connection.created_total, invalidated_total
- db.client.connection.use_time (histogram)
- db.client.operation.duration (histogram)
- db.client.operation.slow_total (>500ms threshold)
- db.client.operation.errors_total
- db.client.response.returned_rows (histogram)
- db.client.connection.health (1=healthy, 0=unhealthy)
- db.client.replica.lag (for read replicas)

Configuration:
- OTEL_EXPORTER_OTLP_ENDPOINT: OTLP endpoint (required to enable)
- POSTGRES_SLOW_QUERY_THRESHOLD: Slow query threshold in seconds (default: 0.5)
- Add OTel Collector service to docker-compose.yml with Prometheus exporter
- Create otel/otel-collector-config.yaml with OTLP receiver and debug/prometheus exporters
- Configure health_check extension for collector health monitoring
- Fix pool.overflow() handling in db_metrics.py - use max(0, overflow) since
  SQLAlchemy returns negative values relative to max_overflow
- Remove overflow from used count calculation to avoid double-counting
- Add warning-level logging to database error handler to capture
  actual exception messages, types, and truncated SQL statements
- Add service.name attribute to all PostgreSQL metrics for better
  filtering in Grafana dashboards
- Pass OTEL_SERVICE_NAME environment variable to metrics collector

This change helps diagnose database errors that were previously only
captured as metrics without detailed logging.
Aurora PostgreSQL doesn't support pg_last_xact_replay_timestamp()
as it uses a different replication mechanism. This was causing
continuous errors on the readonly pool.

Removed:
- db.client.replica.lag metric and histogram
- check_replica_lag() method
- is_replica parameter from metrics registration
Add dual export of PostgreSQL metrics to both OTel (Grafana/Mimir) and
Datadog (StatsD). This provides visibility in both observability platforms.

New StatsD metrics exported:
- db.client.connection.count (gauge, by state)
- db.client.connection.max (gauge)
- db.client.connection.overflow (gauge)
- db.client.connection.created (counter)
- db.client.connection.use_time (histogram, ms)
- db.client.connection.invalidated (counter)
- db.client.operation.duration (histogram, ms)
- db.client.operation.slow (counter)
- db.client.operation.errors (counter)
- db.client.response.returned_rows (histogram)
- db.client.connection.health (gauge)
- db.client.connection.health_check_failures (counter)

Tags: service, db_system, pool, server, db_name, env, operation, table
@smoreinis smoreinis marked this pull request as ready for review January 12, 2026 18:14
@smoreinis smoreinis requested a review from a team as a code owner January 12, 2026 18:14
environment = self.environment_variables.ENVIRONMENT
service_name = os.environ.get("OTEL_SERVICE_NAME", "agentex")

if self.database_async_read_write_engine:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need each of these engines?

logger = make_logger(__name__)


def configure_statsd():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait sorry, why are we removing datadog here? don't we want the option to have it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, whoops. this was not supposed to be getting removed

Copy link
Collaborator

@RoxyFarhad RoxyFarhad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving if you just add the DD metrics

Fixes AttributeError when OTel is disabled but StatsD metrics are enabled.
The base_attributes dict is now set before the early return, matching the
pattern used by PostgresQueryMetrics and PostgresHealthMetrics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants