fix: improve CI resilience for gradle wrapper download and DataNode by MisterRaindrop · Pull Request #88 · apache/cloudberry-pxf

MisterRaindrop · 2026-03-26T14:30:33Z

fix ci build

ostinru · 2026-03-27T07:26:59Z

ci/singlecluster/Dockerfile

-ENV     HBASE_URL="https://$APACHE_MIRROR/hbase/$HBASE_VERSION/hbase-$HBASE_VERSION-bin.tar.gz"
-ENV       TEZ_URL="https://$APACHE_MIRROR/tez/$TEZ_VERSION/apache-tez-$TEZ_VERSION-bin.tar.gz"
+# Mirror list: try fast mirrors first, fall back to official archive
+ENV APACHE_MIRRORS="repo.huaweicloud.com/apache archive.apache.org/dist"


I think that we can cache singlecluster image in github action's cache for 7 days... And we will not see this issue often.

- Set TZ=UTC and -Duser.timezone=UTC for PXF JVM to ensure consistent Parquet INT96 timestamp conversion (ZoneId.systemDefault() in ParquetTypeConverter.java returns OS timezone which differs on Rocky 9) - Pre-cleanup stale Hadoop processes before start-gphd.sh to prevent DataNode BindException on port 50020 - Improve wait_for_hbase() with port 16020 check and 5s stabilization wait instead of simple pgrep (RegionServer can crash after startup) - Add retry logic to HBase RegionServer check in health_check()

The singlecluster configures hbase.regionserver.port=6002<nodeid>, so node 0 listens on port 60020, not the HBase default 16020. Also increase the port wait timeout from 30s to 60s.

The /dev/tcp/localhost/60020 check failed in Docker containers because HBase RegionServer binds to the container IP, not localhost. Revert to simple pgrep + 10s stabilization sleep. Make HBase startup non-fatal so test groups that don't need HBase can still run. Also simplify DataNode pre-cleanup: only kill if stale processes exist.

…e restart - Use fuser -k to force-release DataNode ports (50010/50020/50075/50080) before start-gphd.sh, preventing BindException on CI runners - Fix wait_for_datanode() restart: replace ss|grep pipeline (crashed by set -euo pipefail when grep found no match) with fuser -k - Remove duplicate DataNode start call in restart path - Make HBase/DataNode health checks non-fatal (warn instead of die) so test groups that don't need HBase are not blocked

Add RetryAnalyzer (1 retry) + RetryListener (IAnnotationTransformer) to automatically retry failed tests once. Handles transient failures like HDFS multi-block write timeouts on resource-constrained CI runners.

…lyzerClass()

…RetryAnalyzer not Class

@listeners

IAnnotationTransformer cannot be registered via @listeners annotation (TestNG limitation - it must be applied before annotations are read). Move registration to maven-surefire-plugin <listener> property.

fix ci

e5bf595

MisterRaindrop force-pushed the fix_build_ci branch from 4e4abda to e5bf595 Compare March 27, 2026 01:37

MisterRaindrop changed the title ~~fix: improve CI resilience for gradle wrapper download and DataNode s…~~ fix: improve CI resilience for gradle wrapper download and DataNode Mar 27, 2026

MisterRaindrop added 2 commits March 27, 2026 09:47

fix

8acabec

fix

01d5b26

ostinru reviewed Mar 27, 2026

View reviewed changes

MisterRaindrop added 10 commits March 27, 2026 15:28

fix

52e04d2

fix

d53ca88

fix: use correct HBase RegionServer port 60020 in wait_for_hbase

eb69809

The singlecluster configures hbase.regionserver.port=6002<nodeid>, so node 0 listens on port 60020, not the HBase default 16020. Also increase the port wait timeout from 30s to 60s.

feat: add TestNG retry analyzer for transient CI test failures

60c28e3

Add RetryAnalyzer (1 retry) + RetryListener (IAnnotationTransformer) to automatically retry failed tests once. Handles transient failures like HDFS multi-block write timeouts on resource-constrained CI runners.

fix: use TestNG 6.x API getRetryAnalyzer() instead of 7.x getRetryAna…

9d19168

…lyzerClass()

fix: remove type assignment - TestNG 6.x getRetryAnalyzer() returns I…

3ec0a0b

…RetryAnalyzer not Class

fix: register RetryListener via surefire config instead of @listeners

fc74036

IAnnotationTransformer cannot be registered via @listeners annotation (TestNG limitation - it must be applied before annotations are read). Move registration to maven-surefire-plugin <listener> property.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve CI resilience for gradle wrapper download and DataNode#88

fix: improve CI resilience for gradle wrapper download and DataNode#88
MisterRaindrop wants to merge 13 commits intoapache:mainfrom
MisterRaindrop:fix_build_ci

MisterRaindrop commented Mar 26, 2026 •

edited

Loading

Uh oh!

ostinru Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MisterRaindrop commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ostinru Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MisterRaindrop commented Mar 26, 2026 •

edited

Loading