Skip to content

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223

Open
markap14 wants to merge 1 commit intoapache:mainfrom
markap14:NIFI-15918
Open

NIFI-15918 Stabilize flaky system tests around processor lifecycle#11223
markap14 wants to merge 1 commit intoapache:mainfrom
markap14:NIFI-15918

Conversation

@markap14
Copy link
Copy Markdown
Contributor

@markap14 markap14 commented May 8, 2026

Framework:

  • LifecycleState.getFutures returns a defensive copy.
  • StandardProcessorNode stop runnable releases its future and decrements active threads if an exception escapes.
  • StandardProcessorNode logs INFO breadcrumbs on silent start-reject and start-abort paths and surfaces validation errors on first retry.
  • KubernetesConfigMapStateProvider.shutdown is null-safe.

System tests:

  • waitForProcessorState throws IOException on timeout.
  • runProcessorOnce waits for validation and fails fast on INVALID.
  • SpawnedStandaloneNiFiInstanceFactory retries process spawn for one minute, polling Management Server health, to absorb TIME_WAIT.
  • ContentClaimTruncationIT pins the shared resource claim and identifies the replayed FlowFile by UUID.

Workspace rules covering single-element-array lambdas and the system-test troubleshooting-logs artifact.

Summary

NIFI-00000

Tracking

Please complete the following tracking steps prior to pull request creation.

Issue Tracking

Pull Request Tracking

  • Pull Request title starts with Apache NiFi Jira issue number, such as NIFI-00000
  • Pull Request commit message starts with Apache NiFi Jira issue number, as such NIFI-00000
  • Pull request contains commits signed with a registered key indicating Verified status

Pull Request Formatting

  • Pull Request based on current revision of the main branch
  • Pull Request refers to a feature branch with one commit containing changes

Verification

Please indicate the verification steps performed prior to pull request creation.

Build

  • Build completed using ./mvnw clean install -P contrib-check
    • JDK 21
    • JDK 25

Licensing

  • New dependencies are compatible with the Apache License 2.0 according to the License Policy
  • New dependencies are documented in applicable LICENSE and NOTICE files

Documentation

  • Documentation formatting appears as expected in rendered files

Copy link
Copy Markdown
Contributor

@exceptionfactory exceptionfactory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on improving test stability @markap14.

On initial review, it seems like no changes to the KubernetesConfigMapStateProvider should be included, because it is not used for any system tests. Was that resulting in unit test failures somewhere?

@markap14
Copy link
Copy Markdown
Contributor Author

markap14 commented May 8, 2026

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

```
java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()"
because "this.kubernetesClient" is null
at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144)
at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)
```

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

@exceptionfactory
Copy link
Copy Markdown
Contributor

[claude-opus-4.7] Thanks for the review @exceptionfactory. The Kubernetes change addresses a unit-test flake, not a system test. KubernetesConfigMapStateProviderTest's @AfterEach calls provider.shutdown(), and when a @BeforeEach throws (or initialization is skipped) shutdown() NPEs because kubernetesClient is null. The NPE then masks the real failure cause. Observed on PR #11211 in actions run 25513810851:

java.lang.NullPointerException: Cannot invoke "io.fabric8.kubernetes.client.KubernetesClient.close()" because "this.kubernetesClient" is null at KubernetesConfigMapStateProvider.shutdown(KubernetesConfigMapStateProvider.java:144) at KubernetesConfigMapStateProviderTest.shutdownProvider(KubernetesConfigMapStateProviderTest.java:113)

The fix is just a null check in shutdown() plus a unit test verifying it.

Happy to split it out into a separate PR if you'd prefer the lifecycle PR stays narrowly scoped — let me know.

Thanks for the background reference, keeping it in this PR sounds good.

- Defensive copy in LifecycleState.getFutures() to prevent ConcurrentModificationException.
- Ensure StandardProcessorNode.stop() decrements active threads and completes the
  stop action even if the stop task throws.
- Promote silent processor start abort paths to INFO logs to aid diagnosis.
- Null-safe shutdown in KubernetesConfigMapStateProvider when initialization was skipped.
- SpawnedStandaloneNiFiInstanceFactory: retry NiFi startup until the Management Server
  is reachable (handles HttpServer port reuse race), but only when start() is asked to
  wait for completion. start(false) now spawns the process and returns immediately.
- NiFiClientUtil.runProcessorOnce waits for validation completion and fails fast on
  INVALID; system tests updated to use it where appropriate.
- ContentClaimTruncationIT: keep one small FlowFile queued so its resource claim is
  pinned during replay, preventing archive-cleanup-induced truncation flakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants