-
Notifications
You must be signed in to change notification settings - Fork 297
fix: avoid resources lock contention utilizing channel #629
fix: avoid resources lock contention utilizing channel #629
Conversation
|
The issue In large clusters where Argo CD monitors numerous resources, the processing of watches becomes significantly slow—in our case (total k8s resources in cluster: ~400k, Pods: ~76k, ReplicaSets: ~52k), taking around 10 minutes. As a result, the Argo CD UI displays outdated information, impacting several features reliant on sync waves, like PruneLast. Eventually, the sheer volume of events from the cluster overwhelmed the system, causing Argo CD to stall completely. To address this, we disabled the tracking of Pods and ReplicaSets, although this compromises one of the main benefits of the Argo CD UI. We also filtered out irrelevant events and tried to optimize various settings in the application controller. However, vertical scaling of the application controller had no effect, and horizontal scaling is not an option for a single cluster due to sharding limitations. Issue causes During the issue investigation, it was found that the problem lies in the following:
Patched v2.10.9 v2.10.9 was patched with the following commits. Though patches significantly improve performance, Argo CD still can not handle the load from large clusters.In the screenshot, you can see one of the largest clusters. Here, the patched with the above commits v2.10.9 build is running.
As can be seen, once pods and rs are enabled to be tracked, the cluster event count falls close to zero, and reconciliation time increases drastically. Number of pods in cluster: ~76k A more detailed comparison of different patched versions is added to this comment - argoproj/argo-cd#8172 (comment) The potential reason is lock contention.Here, a few more metrics were added, and it was found that when the number of events is significant, sometimes it takes ~5 minutes to acquire a lock, which leads to a delay in reconciliation. The suggested fix #602 to optimize the lock usage has not improved the issue in large clusters. Avoid resources lock contention utilizing channel Since we still have significant lock contention in massive clusters, and the approaches above didn’t resolve the issue, another approach has been considered. It is a part of this PR. When we must acquire a write lock in each goroutine, we can’t handle more than one event at a time. What if we introduce the channel where all the received events are sent, and one goroutine is responsible for processing events in batch from the channel? In such a way, the locks from each goroutine are moved to the goroutine, which processes events from the channel. This means we would have only one place where the write lock is acquired; in such a way, we would get rid of the lock contention. The fix resultsAs can be seen from metrics, once the fixed version was deployed and Node, ReplicaSets, and Pods were enabled for tracking, the number of cluster events was stable and didn’t go down. Conclusions |
|
Your analysis is excruciatingly thorough, I love it! I've posted it to SIG Scalability, and we'll start analyzing ASAP. Please be patient, it'll take us a while to give it a really thorough review. |
pkg/cache/cluster.go
Outdated
| return err | ||
| } | ||
|
|
||
| go c.processEvents() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this in the sync function comment:
// When this function exits, the cluster cache is up to date, and the appropriate resources are being watched for
// changes.If I understand this change correctly (and the associated test changes), by processing these events in a goroutine, we're breaking the guarantee that sync will completely update the cluster cache. Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes do not break the guarantee that sync will completely update the cluster cache.
sync function populates the cluster cache when it's run.
gitops-engine/pkg/cache/cluster.go
Line 934 in a16e663
| c.setNode(c.newResource(un)) |
processEvents goroutine processes the future events that are received in watchEvents goroutine.
gitops-engine/pkg/cache/cluster.go
Line 962 in a16e663
| go c.watchEvents(ctx, api, resClient, ns, resourceVersion) |
watchEvents goroutine watches for the events from k8s resource types. Once an event is received, it's processed.
gitops-engine/pkg/cache/cluster.go
Line 706 in a16e663
| case event, ok := <-w.ResultChan(): |
The event is sent to the channel that is read in the processEvents goroutine, where the processing is done in bulk.
gitops-engine/pkg/cache/cluster.go
Line 1299 in a16e663
| c.eventMetaCh <- eventMeta{event, un} |
|
@mpelekh would you be interested in joining a SIG Scalability meeting to talk through the changes? |
|
Could you open an Argo CD PR pointing to this commit so that we can run all Argo's tests? |
db0f61d to
2f5160b
Compare
@crenshaw-dev Yes, I’d be happy to join the SIG Scalability meeting to discuss the changes. Please let me know the time and details or if there’s anything specific I should prepare in advance. |
|
Great! The event is on the Argoproj calendar, and we coordinate in CNCF Slack. The next meeting is two Wednesdays from now at 8am eastern time. No need to prepare anything really, just be prepared to answer questions about the PR. :-) |
@crenshaw-dev Sure. Here it is - argoproj/argo-cd#20329. |
2f5160b to
c26baf2
Compare
|
A couple things from the contributors meeting last week:
|
I provided the details in this comment - argoproj/argo-cd#20329 (comment) tl;dr |
c26baf2 to
809ba43
Compare
andrii-korotkov-verkada
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, looks great, have some minor comments.
pkg/cache/cluster.go
Outdated
| } | ||
|
|
||
| func (c *clusterCache) processEvents() { | ||
| log := c.log.WithValues("fn", "processItems") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is "fn" a standard thing? I'd rather use a more verbose name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with using a more verbose name. Fixed - 7a53eca
pkg/cache/cluster.go
Outdated
| } | ||
|
|
||
| func (c *clusterCache) processEvents() { | ||
| log := c.log.WithValues("fn", "processItems") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we extend the lock to do this assignment, or is there nothing that can override c.log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to extend the lock here, as WithValues returns a new Logger instance without modifying c.log directly. The assignment to log is local and safe from concurrent modification.
https://github.com/go-logr/logr/blob/master/logr.go#L328
pkg/cache/cluster.go
Outdated
| select { | ||
| case em, ok := <-ch: | ||
| if !ok { | ||
| log.V(1).Info("Event processing channel closed, finish processing") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be debug level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It completely makes sense. Fixed - 7a53eca
pkg/cache/cluster.go
Outdated
|
|
||
| for { | ||
| select { | ||
| case em, ok := <-ch: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a more verbose name like eventMeta. em is not a standard shortening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaration of eventMeta shadows declaration at cluster.go - https://github.com/argoproj/gitops-engine/pull/629/files#diff-9c9e197d543705f08c9b1bc2dc404a55506cfc2935a988e6007d248257aadb1aR80.
Renamed to evMeta.
| case <-ticker.C: | ||
| if len(eventMetas) > 0 { | ||
| c.processEventsBatch(eventMetas) | ||
| eventMetas = eventMetas[:0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why assign eventMetas[:0] instead of nil or empty slice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review @andrii-korotkov-verkada
Setting eventMetas to eventMetas[:0] retains the original slice’s underlying array, which can help reduce memory allocations if the slice is reused frequently. Assigning nil would release the underlying array, and creating a new empty slice would lead to additional allocations. This approach allows us to efficiently clear the slice while keeping the capacity intact for future use.
d310961 to
7a53eca
Compare
|
Thanks for the review @andrii-korotkov-verkada. I addressed the comments in this commit - 7a53eca I am going to |
569aa9c to
66d645e
Compare
@crenshaw-dev As agreed, I've set this up using a flag from Argo CD. The PRs have been updated: |
| return nil | ||
| } | ||
|
|
||
| // invalidateEventMeta closes the eventMeta channel if it is open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It looks like some functions, like this one, have an assumption that c.lock is held when they are called. In my experience that can lead to bugs with future maintenance when someone doesn't realize a lock is needed to call the function. I personally name functions in a way that this is indicated, but that is certainly a personal style choice. At the very least, it might be worth indicating the lock requirement in the go doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpelekh can you update the go doc?
Signed-off-by: Mykola Pelekh <[email protected]>
…alse`) Signed-off-by: Mykola Pelekh <[email protected]>
Signed-off-by: Mykola Pelekh <[email protected]>
Signed-off-by: Mykola Pelekh <[email protected]>
8683b25 to
d36c78a
Compare
| } | ||
|
|
||
| // SetEventProcessingInterval allows to set the interval for processing events | ||
| func SetEventProcessingInterval(interval time.Duration) UpdateSettingsFunc { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit but should we add a test asserting the default gets set without an override and vice versa?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for spotting that, @rumstead. I added a unit test for that.
…t is 0.1s) Signed-off-by: Mykola Pelekh <[email protected]>
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #629 +/- ##
==========================================
+ Coverage 54.26% 54.43% +0.16%
==========================================
Files 64 64
Lines 6164 6270 +106
==========================================
+ Hits 3345 3413 +68
- Misses 2549 2583 +34
- Partials 270 274 +4 ☔ View full report in Codecov by Sentry. |
| batchEventsProcessing bool | ||
| eventMetaCh chan eventMeta |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These need docstrings. I'd also rename to batchEventsProcessingEnabled to make clear that it's a feature flag.
| func (c *clusterCache) getProcessEventsHandlers() []OnProcessEventsHandler { | ||
| c.handlersLock.Lock() | ||
| defer c.handlersLock.Unlock() | ||
| handlers := make([]OnProcessEventsHandler, 0, len(c.processEventsHandlers)) | ||
| for _, h := range c.processEventsHandlers { | ||
| handlers = append(handlers, h) | ||
| } | ||
| return handlers | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not have the caller just use the map directly?
| return nil | ||
| } | ||
|
|
||
| // invalidateEventMeta closes the eventMeta channel if it is open |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpelekh can you update the go doc?







Problem statement is in argoproj/argo-cd#8172 (comment)
The IterateHierrchyV2 significantly improved performance, getting us ~90% of the way there. But on huge clusters, we still have significant lock contention.
The fix in this pull request approaches the problem differently - it avoids lock contention by utilizing a channel to process events from the cluster.
More details are in the comments.