fix: avoid resources lock contention utilizing channel #629

mpelekh · 2024-10-04T07:57:23Z

Problem statement is in argoproj/argo-cd#8172 (comment)

The IterateHierrchyV2 significantly improved performance, getting us ~90% of the way there. But on huge clusters, we still have significant lock contention.

The fix in this pull request approaches the problem differently - it avoids lock contention by utilizing a channel to process events from the cluster.

More details are in the comments.

mpelekh · 2024-10-04T08:01:46Z

The issue

In large clusters where Argo CD monitors numerous resources, the processing of watches becomes significantly slow—in our case (total k8s resources in cluster: ~400k, Pods: ~76k, ReplicaSets: ~52k), taking around 10 minutes. As a result, the Argo CD UI displays outdated information, impacting several features reliant on sync waves, like PruneLast. Eventually, the sheer volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely.

To address this, we disabled the tracking of Pods and ReplicaSets, although this compromises one of the main benefits of the Argo CD UI. We also filtered out irrelevant events and tried to optimize various settings in the application controller. However, vertical scaling of the application controller had no effect, and horizontal scaling is not an option for a single cluster due to sharding limitations.

Issue causes

During the issue investigation, it was found that the problem lies in the following:

resource lock contention
slow performance of IterateHierarchy

Patched v2.10.9

v2.10.9 was patched with the following commits.

Though patches significantly improve performance, Argo CD still can not handle the load from large clusters.

In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running.

till 12:50, pods and replica sets are disabled from tracking
from 12:50 to 13:34, pods and replica sets are enabled to be tracked
after 13:34, pods and replica sets are disabled from tracking

As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically.

Number of pods in cluster: ~76k
Number of rs in cluster: ~52k

A more detailed comparison of different patched versions is added to this comment - argoproj/argo-cd#8172 (comment)

The potential reason is lock contention.

Here, a few more metrics were added, and it was found that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation.
mpelekh@560ef00#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR1372

The suggested fix #602 to optimize the lock usage has not improved the issue in large clusters.

Avoid resources lock contention utilizing channel

Since we still have significant lock contention in massive clusters, and the approaches above didn’t resolve the issue, another approach has been considered. It is a part of this PR.

When we must acquire a write lock in each goroutine, we can’t handle more than one event at a time. What if we introduce the channel where all the received events are sent, and one goroutine is responsible for processing events in batch from the channel? In such a way, the locks from each goroutine are moved to the goroutine, which processes events from the channel. This means we would have only one place where the write lock is acquired; in such a way, we would get rid of the lock contention.

The fix results

As can be seen from metrics, once the fixed version was deployed and Node, ReplicaSets, and Pods were enabled for tracking, the number of cluster events was stable and didn’t go down.

Conclusions
The fix shows significant performance improvements. We left Nodes, ReplicaSets, and Pods enabled on large clusters.
ArgoCD UI is working smoothly.
The original issue has been resolved - users can manage Pods and ReplicaSets on large clusters.

crenshaw-dev · 2024-10-04T14:48:12Z

Your analysis is excruciatingly thorough, I love it! I've posted it to SIG Scalability, and we'll start analyzing ASAP. Please be patient, it'll take us a while to give it a really thorough review.

crenshaw-dev · 2024-10-09T17:19:44Z

pkg/cache/cluster.go

 		return err
 	}

+	go c.processEvents()


I see this in the sync function comment:

// When this function exits, the cluster cache is up to date, and the appropriate resources are being watched for // changes.

If I understand this change correctly (and the associated test changes), by processing these events in a goroutine, we're breaking the guarantee that sync will completely update the cluster cache. Is that correct?

These changes do not break the guarantee that sync will completely update the cluster cache.
sync function populates the cluster cache when it's run.

gitops-engine/pkg/cache/cluster.go

Line 934 in a16e663

c.setNode(c.newResource(un))

processEvents goroutine processes the future events that are received in watchEvents goroutine.

gitops-engine/pkg/cache/cluster.go

Line 962 in a16e663

go c.watchEvents(ctx, api, resClient, ns, resourceVersion)

watchEvents goroutine watches for the events from k8s resource types. Once an event is received, it's processed.

gitops-engine/pkg/cache/cluster.go

Line 706 in a16e663

case event, ok := <-w.ResultChan():

The event is sent to the channel that is read in the processEvents goroutine, where the processing is done in bulk.

gitops-engine/pkg/cache/cluster.go

Line 1299 in a16e663

c.eventMetaCh <- eventMeta{event, un}

crenshaw-dev · 2024-10-09T18:50:41Z

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

crenshaw-dev · 2024-10-09T18:57:20Z

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

mpelekh · 2024-10-10T15:24:24Z

@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes?

@crenshaw-dev Yes, I’d be happy to join the SIG Scalability meeting to discuss the changes. Please let me know the time and details or if there’s anything specific I should prepare in advance.

crenshaw-dev · 2024-10-10T15:31:38Z

Great! The event is on the Argoproj calendar, and we coordinate in CNCF Slack. The next meeting is two Wednesdays from now at 8am eastern time.

No need to prepare anything really, just be prepared to answer questions about the PR. :-)

mpelekh · 2024-10-10T16:00:52Z

Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests?

@crenshaw-dev Sure. Here it is - argoproj/argo-cd#20329.

crenshaw-dev · 2024-10-28T17:03:41Z

A couple things from the contributors meeting last week:

we should probably make this configurable via a flag from Argo CD; the more I think about it, the more I think we should have a quick opt-out option
if feasible, we should have batches process on a ticker or some max slice size; that'll help manage particularly high-churn spikes

mpelekh · 2024-11-06T15:59:41Z

Do we have a definitive answer yet for whether sync status/operation status are currently atomically updated vs. just very-quickly updated? Because if we're losing atomicity, that could be a big problem. If we're just slowing something down that used to be fast, I think that's relatively okay.

I provided the details in this comment - argoproj/argo-cd#20329 (comment)

tl;dr
The sync and operation status are not atomically updated. They just very quickly updated.

andrii-korotkov-verkada

Overall, looks great, have some minor comments.

andrii-korotkov-verkada · 2024-11-09T04:01:41Z

pkg/cache/cluster.go

+}
+
+func (c *clusterCache) processEvents() {
+	log := c.log.WithValues("fn", "processItems")


Is "fn" a standard thing? I'd rather use a more verbose name.

I completely agree with using a more verbose name. Fixed - 7a53eca

andrii-korotkov-verkada · 2024-11-09T04:02:52Z

pkg/cache/cluster.go

+}
+
+func (c *clusterCache) processEvents() {
+	log := c.log.WithValues("fn", "processItems")


Should we extend the lock to do this assignment, or is there nothing that can override c.log?

No need to extend the lock here, as WithValues returns a new Logger instance without modifying c.log directly. The assignment to log is local and safe from concurrent modification.
https://github.com/go-logr/logr/blob/master/logr.go#L328

andrii-korotkov-verkada · 2024-11-09T04:03:27Z

pkg/cache/cluster.go

+		select {
+		case em, ok := <-ch:
+			if !ok {
+				log.V(1).Info("Event processing channel closed, finish processing")


I think this can be debug level

It completely makes sense. Fixed - 7a53eca

andrii-korotkov-verkada · 2024-11-09T04:04:02Z

pkg/cache/cluster.go

+
+	for {
+		select {
+		case em, ok := <-ch:


Use a more verbose name like eventMeta. em is not a standard shortening.

Declaration of eventMeta shadows declaration at cluster.go - https://github.com/argoproj/gitops-engine/pull/629/files#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR80.

Renamed to evMeta.

andrii-korotkov-verkada · 2024-11-09T04:05:59Z

pkg/cache/cluster.go

+		case <-ticker.C:
+			if len(eventMetas) > 0 {
+				c.processEventsBatch(eventMetas)
+				eventMetas = eventMetas[:0]


Why assign eventMetas[:0] instead of nil or empty slice?

Thank you for review @andrii-korotkov-verkada

Setting eventMetas to eventMetas[:0] retains the original slice’s underlying array, which can help reduce memory allocations if the slice is reused frequently. Assigning nil would release the underlying array, and creating a new empty slice would lead to additional allocations. This approach allows us to efficiently clear the slice while keeping the capacity intact for future use.

mpelekh · 2024-11-12T13:09:13Z

Thanks for the review @andrii-korotkov-verkada. I addressed the comments in this commit - 7a53eca

I am going to rebase -i --autosquash it before merge.

mpelekh · 2024-12-02T17:38:13Z

we should probably make this configurable via a flag from Argo CD; the more I think about it, the more I think we should have a quick opt-out option

@crenshaw-dev As agreed, I've set this up using a flag from Argo CD.

The PRs have been updated:

jdroot · 2024-12-03T19:04:31Z

pkg/cache/cluster.go

 	return nil
 }

+// invalidateEventMeta closes the eventMeta channel if it is open


nit: It looks like some functions, like this one, have an assumption that c.lock is held when they are called. In my experience that can lead to bugs with future maintenance when someone doesn't realize a lock is needed to call the function. I personally name functions in a way that this is indicated, but that is certainly a personal style choice. At the very least, it might be worth indicating the lock requirement in the go doc.

@mpelekh can you update the go doc?

Signed-off-by: Mykola Pelekh <[email protected]>

…alse`) Signed-off-by: Mykola Pelekh <[email protected]>

Signed-off-by: Mykola Pelekh <[email protected]>

rumstead · 2024-12-12T14:11:49Z

pkg/cache/settings.go

+}
+
+// SetEventProcessingInterval allows to set the interval for processing events
+func SetEventProcessingInterval(interval time.Duration) UpdateSettingsFunc {


super nit but should we add a test asserting the default gets set without an override and vice versa?

Thanks for spotting that, @rumstead. I added a unit test for that.

…t is 0.1s) Signed-off-by: Mykola Pelekh <[email protected]>

sonarqubecloud · 2024-12-12T17:05:30Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

codecov · 2024-12-12T17:06:23Z

Codecov Report

Attention: Patch coverage is 65.17857% with 39 lines in your changes missing coverage. Please review.

Project coverage is 54.43%. Comparing base (8849c3f) to head (af50e33).
Report is 5 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/cache/cluster.go	71.42%	22 Missing and 4 partials ⚠️
pkg/cache/mocks/ClusterCache.go	0.00%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #629      +/-   ##
==========================================
+ Coverage   54.26%   54.43%   +0.16%     
==========================================
  Files          64       64              
  Lines        6164     6270     +106     
==========================================
+ Hits         3345     3413      +68     
- Misses       2549     2583      +34     
- Partials      270      274       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

crenshaw-dev · 2024-12-16T15:54:59Z

pkg/cache/cluster.go

+	batchEventsProcessing bool
+	eventMetaCh           chan eventMeta


These need docstrings. I'd also rename to batchEventsProcessingEnabled to make clear that it's a feature flag.

crenshaw-dev · 2024-12-16T15:56:45Z

pkg/cache/cluster.go

+func (c *clusterCache) getProcessEventsHandlers() []OnProcessEventsHandler {
+	c.handlersLock.Lock()
+	defer c.handlersLock.Unlock()
+	handlers := make([]OnProcessEventsHandler, 0, len(c.processEventsHandlers))
+	for _, h := range c.processEventsHandlers {
+		handlers = append(handlers, h)
+	}
+	return handlers
+}


Why not have the caller just use the map directly?

crenshaw-dev · 2024-12-16T15:57:19Z

pkg/cache/cluster.go

 	return nil
 }

+// invalidateEventMeta closes the eventMeta channel if it is open


@mpelekh can you update the go doc?

mpelekh marked this pull request as ready for review October 4, 2024 08:11

mpelekh mentioned this pull request Oct 4, 2024

Resource tree slow refresh argoproj/argo-cd#8172

Closed

crenshaw-dev reviewed Oct 9, 2024

View reviewed changes

mpelekh mentioned this pull request Oct 10, 2024

fix: avoid resources lock contention (#8172) argoproj/argo-cd#20329

Merged

14 tasks

mpelekh force-pushed the event-processing-performance-improvement branch 2 times, most recently from db0f61d to 2f5160b Compare October 10, 2024 14:02

mpelekh force-pushed the event-processing-performance-improvement branch from 2f5160b to c26baf2 Compare October 24, 2024 09:16

mpelekh force-pushed the event-processing-performance-improvement branch from c26baf2 to 809ba43 Compare November 7, 2024 13:37

andrii-korotkov-verkada reviewed Nov 9, 2024

View reviewed changes

andrii-korotkov-verkada approved these changes Nov 9, 2024

View reviewed changes

mpelekh force-pushed the event-processing-performance-improvement branch from d310961 to 7a53eca Compare November 12, 2024 13:05

mpelekh force-pushed the event-processing-performance-improvement branch 3 times, most recently from 569aa9c to 66d645e Compare December 2, 2024 12:33

rumstead approved these changes Dec 2, 2024

View reviewed changes

jdroot reviewed Dec 3, 2024

View reviewed changes

mpelekh added 3 commits December 12, 2024 12:00

fix: avoid resources lock contention utilizing channel

701dbf3

Signed-off-by: Mykola Pelekh <[email protected]>

feat: process events in batch when the mode is enabled (default is `f…

4ec8f36

…alse`) Signed-off-by: Mykola Pelekh <[email protected]>

test: update unit tests to verify batch events processing flag

65a271a

Signed-off-by: Mykola Pelekh <[email protected]>

feat: make eventProcessingInterval option configurable (default is 0.1s)

d36c78a

Signed-off-by: Mykola Pelekh <[email protected]>

mpelekh force-pushed the event-processing-performance-improvement branch from 8683b25 to d36c78a Compare December 12, 2024 10:00

rumstead reviewed Dec 12, 2024

View reviewed changes

fixup! feat: make eventProcessingInterval option configurable (defaul…

af50e33

…t is 0.1s) Signed-off-by: Mykola Pelekh <[email protected]>

crenshaw-dev approved these changes Dec 16, 2024

View reviewed changes

crenshaw-dev merged commit 54992bf into argoproj:master Dec 16, 2024
5 checks passed

crenshaw-dev reviewed Dec 16, 2024

View reviewed changes

fix: avoid resources lock contention utilizing channel #629

fix: avoid resources lock contention utilizing channel #629

Uh oh!

Conversation

mpelekh commented Oct 4, 2024

Uh oh!

mpelekh commented Oct 4, 2024

Uh oh!

crenshaw-dev commented Oct 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crenshaw-dev commented Oct 9, 2024

Uh oh!

crenshaw-dev commented Oct 9, 2024

Uh oh!

mpelekh commented Oct 10, 2024

Uh oh!

crenshaw-dev commented Oct 10, 2024

Uh oh!

mpelekh commented Oct 10, 2024

Uh oh!

crenshaw-dev commented Oct 28, 2024

Uh oh!

mpelekh commented Nov 6, 2024

Uh oh!

andrii-korotkov-verkada left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mpelekh commented Nov 12, 2024

Uh oh!

mpelekh commented Dec 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Dec 12, 2024

Quality Gate passed

Uh oh!

codecov bot commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

codecov bot commented Dec 12, 2024 •

edited

Loading