Resubscribe to project changes after websocket reconnects#2174
Resubscribe to project changes after websocket reconnects#2174
Conversation
When the websocket times out and reconnects, such as after a laptop is put to sleep or a phone changes networks, HubConnection reconnects with a new connection ID. But the connection ID is what is used on the server side to determine who should receive project-change notifications. So whenever the HubConnection reconnects, it needs to resubscribe to the project-change notification events, which will send the server its new connection ID.
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds a cached per-server+project SignalR Reconnected handler that re-sends "ListenForProjectChanges" on reconnect, stores the handler with a 1-hour sliding expiration, replaces any prior handler, and adds a helper to derive the reconnection cache key. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
The latest updates on your projects. Learn more about Argos notifications ↗︎
|
|
Ideally we would be able to tell the Lexbox server "Hey, connection ID 123abc has gone away" and have the server disconnect that connection ID from the project-change notification group. Otherwise connection IDs will just pile up and pile up in the group, which could potentially lead to slower performance depending on how SignalR handles things. (I didn't do a deep dive into SignalR to find out). But in HubConnection's Reconnected event handler, I could see no way to retrieve the old connection ID in order to tell the server "Hey, I used to be ID thus-and-such". So we register with the new connection ID and hope that SignalR is doing the sensible thing with old connection IDs. |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)
221-224: Add logging and error handling inside theReconnectedhandler.If
SendAsyncthrows (e.g., due to a transient network error or cancellation before reconnect completes), the exception silently propagates to the SignalR event-dispatch loop. Adding atry/catchwith structured logging makes reconnect failures visible without destabilising the connection.♻️ Proposed improvement — log and swallow recoverable errors
- lexboxConnection.Reconnected += (newId) => - { - return lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken); - }; + lexboxConnection.Reconnected += async (newId) => + { + try + { + logger.LogInformation("SignalR reconnected (connectionId={NewId}), re-subscribing to project {ProjectId}", newId, projectData.Id); + await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken); + } + catch (Exception ex) when (ex is not OperationCanceledException) + { + logger.LogError(ex, "Failed to re-subscribe to project changes for {ProjectId} after reconnect", projectData.Id); + } + };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines 221 - 224, Wrap the Reconnected handler body in an async lambda and add try/catch around the SendAsync call: change lexboxConnection.Reconnected += (newId) => { ... } to an async (newId) => { try { await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken); } catch (OperationCanceledException) { /* optional debug log if needed and swallow */ } catch (Exception ex) { /* structured log including projectData.Id, newId and the exception */ } }, so failures are logged with context (projectData.Id, newId) and do not propagate to the SignalR event loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-224: Register the lexboxConnection.Reconnected handler before
calling lexboxConnection.SendAsync to avoid the race where a reconnect occurs
with no handler attached; move the subscription of the Reconnected event (the
lambda that calls lexboxConnection.SendAsync("ListenForProjectChanges",
projectData.Id, stoppingToken)) so it runs prior to the initial
lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id,
stoppingToken), keeping the same arguments (projectData.Id, stoppingToken) in
the handler.
- Around line 221-224: StartLexboxProjectChangeListener currently appends a new
lexboxConnection.Reconnected handler per call which causes duplicate
ListenForProjectChanges sends on reconnect; fix by registering a single
Reconnected handler per cached HubConnection and maintain a thread-safe set of
registered project IDs for that connection (e.g., ConcurrentDictionary/HashSet)
so that when adding a project you only add its ID to the set (not another
Reconnected delegate), and the single Reconnected handler iterates the set and
calls lexboxConnection.SendAsync("ListenForProjectChanges", projectId,
stoppingToken) once per unique projectId; ensure thread-safety when updating the
set and remove project IDs / dispose handlers when listeners stop.
---
Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 221-224: Wrap the Reconnected handler body in an async lambda and
add try/catch around the SendAsync call: change lexboxConnection.Reconnected +=
(newId) => { ... } to an async (newId) => { try { await
lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id,
stoppingToken); } catch (OperationCanceledException) { /* optional debug log if
needed and swallow */ } catch (Exception ex) { /* structured log including
projectData.Id, newId and the exception */ } }, so failures are logged with
context (projectData.Id, newId) and do not propagate to the SignalR event loop.
|
@coderabbitai help |
ChatThere are 3 ways to chat with CodeRabbit:
CodeRabbit commands
Other keywords and placeholders
CodeRabbit configuration file (
|
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)
220-222: Lambda parametercacheshadows the class fieldthis.cache.The
GetOrCreatefactory parameter is namedcache, which shadows theIMemoryCachefield. This doesn't cause a bug today since the field isn't accessed inside the factory, but it's a readability trap — especially if someone later adds logic inside the factory that intends to reference theIMemoryCache. Consider naming itentryto match theICacheEntrytype.Naming fix
- var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache => + var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), entry => { - cache.SlidingExpiration = TimeSpan.FromHours(1); + entry.SlidingExpiration = TimeSpan.FromHours(1);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines 220 - 222, The lambda parameter named "cache" in the GetOrCreate call (where reconnectedHandler is created using GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache => { ... })) shadows the class field this.cache; rename that lambda parameter to a nonconflicting name such as "entry" (or "cacheEntry") and update any internal references (e.g., cache.SlidingExpiration) to entry.SlidingExpiration so the factory uses the ICacheEntry parameter and avoids hiding the IMemoryCache field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-231: The cached reconnection handler currently closes over the
local lexboxConnection, causing it to call SendAsync on a disposed
HubConnection; change the factory used by
cache.GetOrCreate(HubReconnectionHandlerCacheKey(...)) so the returned
Func<string?, Task> does not capture lexboxConnection but instead looks up the
current/active HubConnection at invocation time (e.g., resolve the connection
from the same connection cache by server/project inside the delegate) and calls
SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) on that
resolved connection; also update the Closed event handler to evict any cached
reconnection handlers for that server (use HubReconnectionHandlerCacheKey or
whatever key-generation helper) when a connection is permanently closed so stale
handlers are removed.
---
Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-222: The lambda parameter named "cache" in the GetOrCreate
call (where reconnectedHandler is created using GetOrCreate<Func<string?,
Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache => { ... }))
shadows the class field this.cache; rename that lambda parameter to a
nonconflicting name such as "entry" (or "cacheEntry") and update any internal
references (e.g., cache.SlidingExpiration) to entry.SlidingExpiration so the
factory uses the ICacheEntry parameter and avoids hiding the IMemoryCache field.
CodeRabbit AI review pointed out a possible race condition I would have missed: when the HubConnection is closed, it is removed from the cache. If ListenForProjectChanges is called after that point, it will return a new HubConnection object. The handler, though, has a cache duration of 1 hour (and I should probably bump that up to 1 or more day(s)) so that we will be certain to unregister the same event handler that was previously registered. Which means that closing over lexboxConnection would be sure to catch the stale, now-closed connection. By fetching the connection from the cache, we can ensure that we get the fresh connection rather than the stale one.
|
@CodeRabbit review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)
234-234:stoppingTokencaptured in handler silently drops resubscription if cancelled at reconnect time.
stoppingTokenis captured at factory-creation time. If the token is cancelled before theReconnectedevent fires (e.g., the service/scope that originally calledListenForProjectChangesshuts down),SendAsyncthrowsOperationCanceledException. SinceHubConnection.Reconnectedis aFunc<string?, Task>event and SignalR silently swallows faulted handler tasks, the resubscription fails with no log entry. Consider usingCancellationToken.Noneinside the handler (reconnect itself isn't bounded by the caller's lifetime) or at least logging the failure:♻️ Proposed refactor
- return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken); + return connection.SendAsync("ListenForProjectChanges", projectData.Id, CancellationToken.None);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` at line 234, The Reconnected handler captures the original stoppingToken causing SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) to throw OperationCanceledException and silently drop resubscription; update the Reconnected handler in LexboxProjectService so it does not use the captured stoppingToken (use CancellationToken.None for the SendAsync call) or at minimum catch and log exceptions from SendAsync inside the HubConnection.Reconnected delegate (reference: stoppingToken, ListenForProjectChanges call, HubConnection.Reconnected, and SendAsync) to ensure failures are visible and resubscribe attempts aren't cancelled by the caller's lifetime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 225-233: The fallback logic in the handler that uses
lexboxConnection captured at factory-creation is unsafe and the comment about a
five-minute guard is incorrect; update the handler inside the cache retrieval
(the block that checks cache.TryGetValue(HubConnectionCacheKey(server), out
HubConnection? connection) || connection is null) to not call SendAsync on the
captured lexboxConnection fallback. Instead return Task.CompletedTask when no
valid cached connection is found; also remove or correct the misleading comment
and reference StartLexboxProjectChangeListener, cache.GetOrCreate and
lexboxConnection so the code clearly uses the cached connection when available
and otherwise safely no-ops.
---
Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Line 234: The Reconnected handler captures the original stoppingToken causing
SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) to throw
OperationCanceledException and silently drop resubscription; update the
Reconnected handler in LexboxProjectService so it does not use the captured
stoppingToken (use CancellationToken.None for the SendAsync call) or at minimum
catch and log exceptions from SendAsync inside the HubConnection.Reconnected
delegate (reference: stoppingToken, ListenForProjectChanges call,
HubConnection.Reconnected, and SendAsync) to ensure failures are visible and
resubscribe attempts aren't cancelled by the caller's lifetime.
|
@CodeRabbit full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)
227-232: The "five minutes" in the comment is inaccurate — correct it.The
HubConnectionCacheKeycache entry (created at line 297) has no sliding or absolute expiration set on it; it stays in the cache indefinitely until theClosedevent explicitly callscache.Remove(HubConnectionCacheKey(server)). There is no 5-minute timer, so the described race would not require "five minutes to pass" — the real reason it is vanishingly unlikely is simply thatStartLexboxProjectChangeListenerand theGetOrCreateon the next line execute sequentially within the same synchronous call chain.✏️ Suggested comment correction
- // In theory this could have a race condition where it ends up closing over the old lexboxConnection value, - // which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that - // has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach - // this point, because that would require five minutes to pass between the StartLexboxProjectChangeListener call - // and the cache.GetOrCreate call two lines later. + // In theory this could have a race condition where it ends up closing over the old lexboxConnection value, + // which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that + // has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach + // this point, because StartLexboxProjectChangeListener and this GetOrCreate call execute sequentially in the + // same call chain, so the new connection is already in the cache by the time we get here.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines 227 - 232, The inline comment incorrectly refers to a "five minutes" delay; update the comment near the assignment to connection = lexboxConnection to remove the five‑minute claim and instead state the actual reason the race is vanishingly unlikely: the HubConnectionCacheKey entry has no expiration (it is only removed by cache.Remove(HubConnectionCacheKey(server)) in the Closed handler), and StartLexboxProjectChangeListener / ListenForProjectChanges and the subsequent cache.GetOrCreate call run synchronously in the same call chain so the new connection will already be in the cache; reference the symbols StartLexboxProjectChangeListener, ListenForProjectChanges, HubConnectionCacheKey, cache.GetOrCreate, lexboxConnection and connection when editing the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-238: Get a null-check after calling
cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server,
projectData)) to ensure reconnectedHandler is not null before
unsubscribing/subscribing to lexboxConnection.Reconnected; if the cached value
is null, call the factory (or create a non-null handler) and store it so that
reconnectedHandler is always a valid Func<string?, Task>, then use that non-null
reconnectedHandler for lexboxConnection.Reconnected -= and += so
ListenForProjectChanges is reliably re-sent on reconnect.
---
Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 227-232: The inline comment incorrectly refers to a "five minutes"
delay; update the comment near the assignment to connection = lexboxConnection
to remove the five‑minute claim and instead state the actual reason the race is
vanishingly unlikely: the HubConnectionCacheKey entry has no expiration (it is
only removed by cache.Remove(HubConnectionCacheKey(server)) in the Closed
handler), and StartLexboxProjectChangeListener / ListenForProjectChanges and the
subsequent cache.GetOrCreate call run synchronously in the same call chain so
the new connection will already be in the cache; reference the symbols
StartLexboxProjectChangeListener, ListenForProjectChanges,
HubConnectionCacheKey, cache.GetOrCreate, lexboxConnection and connection when
editing the comment.
| var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cacheEntry => | ||
| { | ||
| cacheEntry.SlidingExpiration = TimeSpan.FromHours(1); | ||
| return (newId) => | ||
| { | ||
| if (!cache.TryGetValue(HubConnectionCacheKey(server), out HubConnection? connection) || connection is null) | ||
| { | ||
| // In theory this could have a race condition where it ends up closing over the old lexboxConnection value, | ||
| // which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that | ||
| // has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach | ||
| // this point, because that would require five minutes to pass between the StartLexboxProjectChangeListener call | ||
| // and the cache.GetOrCreate call two lines later. | ||
| connection = lexboxConnection; | ||
| } | ||
| return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken); | ||
| }; | ||
| }); | ||
| lexboxConnection.Reconnected -= reconnectedHandler; | ||
| lexboxConnection.Reconnected += reconnectedHandler; |
There was a problem hiding this comment.
Add a null guard after cache.GetOrCreate<Func<string?, Task>>.
GetOrCreate() returns TItem? in .NET 7+, even when the factory delegate itself will never return null. This is by design — the cache supports storing null values, so if a prior call stored null under HubReconnectionHandlerCacheKey(server, projectData), GetOrCreate would skip the factory entirely and return that null.
When reconnectedHandler is null, the -= and += operations on lexboxConnection.Reconnected are both no-ops in C#, meaning the handler is silently never registered. The connection reconnects but ListenForProjectChanges is never re-sent — identical to the original bug this PR is fixing.
🛡️ Proposed fix — guard against null handler
var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cacheEntry =>
{
cacheEntry.SlidingExpiration = TimeSpan.FromHours(1);
return (newId) =>
{
if (!cache.TryGetValue(HubConnectionCacheKey(server), out HubConnection? connection) || connection is null)
{
connection = lexboxConnection;
}
return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
};
});
+ if (reconnectedHandler is null) return;
lexboxConnection.Reconnected -= reconnectedHandler;
lexboxConnection.Reconnected += reconnectedHandler;
await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines
220 - 238, Get a null-check after calling cache.GetOrCreate<Func<string?,
Task>>(HubReconnectionHandlerCacheKey(server, projectData)) to ensure
reconnectedHandler is not null before unsubscribing/subscribing to
lexboxConnection.Reconnected; if the cached value is null, call the factory (or
create a non-null handler) and store it so that reconnectedHandler is always a
valid Func<string?, Task>, then use that non-null reconnectedHandler for
lexboxConnection.Reconnected -= and += so ListenForProjectChanges is reliably
re-sent on reconnect.
There was a problem hiding this comment.
A null handler will never be stored in the cache, so this is not worth the extra complexity of the null check on the handler. Also, the proposed fix doesn't actually fix the issue, as it would still silently never register the handler in this case. In fact it makes it worse: the current code, if a null could possibly end up in reconnectedHandler, would at least send the "ListenForProjectChanges" message once. But the proposed fix would not even send the message a single time, which would be worse than the current behavior.
But the main reason not to do this is that there is no possibility of null being stored under the HubReconnectionHandlerCacheKey, because no other code uses that key and no other code will ever use that key. Fixing an issue that will never arise is not worth the cost of even a single extra line of code.
There was a problem hiding this comment.
@rmunn, you're absolutely right on both points. I appreciate the detailed explanation:
-
Cache key scope: Since
HubReconnectionHandlerCacheKey(server, projectData)is project-specific and only used within this method, and the factory delegate always returns a non-null handler, there's indeed no path for null to be stored in the cache. -
Flaw in my proposed fix: You caught a critical issue—my suggestion to return early if
reconnectedHandlerwere null would skip the initialSendAsync("ListenForProjectChanges", ...)call entirely, making things worse rather than better.
Thank you for the clarification. The current implementation is correct as-is.
(\/)
( ••)
/ >📝 noted
✏️ Learnings added
Learnt from: rmunn
Repo: sillsdev/languageforge-lexbox PR: 2174
File: backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs:220-238
Timestamp: 2026-02-20T03:59:32.650Z
Learning: In `backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`, the `HubReconnectionHandlerCacheKey(server, projectData)` cache key is project-specific and only used within the `ListenForProjectChanges` method. The factory delegate for this cache entry always returns a non-null `Func<string?, Task>`, so there's no possibility of null being cached and no null guard is needed.
<!-- </add_learning>
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
@CodeRabbit You're right; I was getting confused with the five-minute absolute expiration in GetLexboxProjects. The comment no longer applies at all, so I have removed it in commit 932ef4c. |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings usedIf you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
|
Tip For best results, initiate chat on the files or code changes.
🧠 Learnings usedIf you found this review helpful, would you consider giving us a shout-out on X? Thank you for using CodeRabbit! |
|
@myieye - I have finished going through this code with CodeRabbit, and don't expect to make any more changes to it. Go ahead and take a look at it, it's in its final form (unless you request changes) so I won't be changing it while you review. |
There was a problem hiding this comment.
I think this is looking good. I expect this will add a lot more stability to our auto-syncing 👍
I went over this a bunch with Devin AI. Here's what I came up with:
1) There's an old bug: cache.CreateEntry(...) only commits the cache entry if you dispose the returned ICacheEntry (e.g. with using I suppose). So we actually never cache connections. That might be one reason you were seeing different connection IDs. But I suspect we still need the Reconnected handler.
2) As it is we're wiring up 1 reconnected handler per project and we never know when to remove them from the cache. What if we use a ConditionalWeakTable keyed on the HubConnection instead? One handler per connection, automatic cleanup when the connection is GC'd, no expiration to worry about. Code something like this:
private static readonly ConditionalWeakTable<HubConnection, HashSet<Guid>> _reconnectProjects = new();
public async Task ListenForProjectChanges(ProjectData projectData, CancellationToken stoppingToken)
{
if (!options.Value.TryGetServer(projectData, out var server)) return;
var lexboxConnection = await StartLexboxProjectChangeListener(server, stoppingToken);
if (lexboxConnection is null) return;
var subscribedProjects = _reconnectProjects.GetValue(lexboxConnection, static conn =>
{
var projects = new HashSet<Guid>();
conn.Reconnected += async _ =>
{
Guid[] projectIds;
lock (projects)
{
projectIds = [.. projects];
}
foreach (var projectId in projectIds)
{
await conn.SendAsync("ListenForProjectChanges", projectId);
}
};
return projects;
});
lock (subscribedProjects)
{
subscribedProjects.Add(projectData.Id);
}
await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
}This also means HubReconnectionHandlerCacheKey can be deleted.
3) How about we add a more persistent retry policy? The default WithAutomaticReconnect() only retries 4 times over ~42 seconds. If a user closes their laptop for a few minutes, all retries are exhausted, Closed fires, the connection is disposed, and nothing re-establishes it. That might been the more common cause of #2149 than a brief reconnect. Something like this:
.WithAutomaticReconnect(new InfiniteRetryPolicy())private class InfiniteRetryPolicy : IRetryPolicy
{
public TimeSpan? GetNextRetryDelay(RetryContext retryContext)
{
return retryContext.PreviousRetryCount switch
{
0 => TimeSpan.Zero,
1 => TimeSpan.FromSeconds(2),
2 => TimeSpan.FromSeconds(10),
3 => TimeSpan.FromSeconds(30),
_ => TimeSpan.FromSeconds(60),
};
}
}
That is a strange API: I was actually skeptical of whether the AI had hallucinated that, because it seems a bizarre way to implement this. Disposing the object commits it? But sure enough, https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.caching.memory.icacheentry?view=net-10.0-pp says that ICacheEntry "Represents an entry in the IMemoryCache implementation. When Disposed, is committed to the cache." I suppose the idea is that you take the just-created cache entry and do things to it (setting the expiration time, for example), and then when you're done you exit the
Sounds like a good plan. I never remember the existence of ConditionalWeakTable and similar. I'll do that too.
Yes, closing the laptop lid for hours at a time is going to be a common scenario. And when the laptop lid opens, it's likely to be on a different WiFi network with a different IP address, so that's something we should test too. I'll do an infinite retry policy as suggested, especially since as per 2) we'll have a persistent Reconnect handler. |
That depends very much on how the retry system is set up, actually. If the laptop is suspended, then the retry code isn't running. So on second thought, I don't think this is the issue we thought it was. When the laptop gets opened up, that's the moment when the system will notice the connection dropped, try to reconnect, and start the timer on the retry policy. I'd have to test this, of course, but I suspect the laptop could be closed for an hour and the default policy would still work: try again after 2 seconds, then after 10, then after 30, then give up, all of it with timers started at the moment of unsuspending the laptop — as long as the user's laptop connects to WiFi within 42 seconds, the policy would succeed. |
|
I'm rethinking whether a ConditionalWeakTable keyed on the HubConnection is the right approach. I'm slightly hampered by not having a deep understanding of when we might end up getting a new HubConnnection object back instead of the existing one. But if we get a new object back and don't have the existing one anymore, wouldn't that mean that we would lose the list of connected projects entirely? The Reconnected handler closes over the HashSet, but I think it might be closing over the wrong HashSet. Let me think it through. Scenario: Start up FW Lite, connect to two projects, A and B. Then something happens to the HubConnection and it gets torn down and a new one is established. (E.g. if, when it tries to reconnect, it fails to get through for 45 seconds, enough to exhaust the default retry policy). I'll walk through this step by step. Step 1. FW Lite starts up. HubConnection 1 established. Connect to project A. State of the weak table: HubConnection 1 has a reference to HashSet 0x123, which contains Guid A. HubConnection 1 now gets a Reconnect handler, which closes over HashSet 0x123 in its Step 2. Connect to project B. ListenForProjectChanges gets called again, retrieves HubConnection 1 from the memory cache. The Step 3. HubConnection 1 disconnected, tries to reconnect, successfully reconnects after 20 seconds. The Reconnect handler is called. It closed over HashSet 0x123, so it will retrieve A and B and successfully ask the server to receive project events again on the new connection ID. All good so far. Step 4. HubConnection 1 disconnected again, tries to reconnect, can't get through at all. Eventually gives up, Closed fires, and HubConnection 1 is booted out of the memory cache. Its Reconnected handler is never fired, because it never succeeded in reconnecting. Now HubConnection 1 gets disposed, and soon no strong references to it exist, so it is booted out of the ConditionalWeakTable. Its Reconnected handler, which had closed over HashSet 0x123, is presumably disposed along with HubConnection 1. As far as I can see, nothing reconnects projects A and B here. Step 5. User tries to connect to a new project, C. ListenForProjectChanges is called. It doesn't find a HubConnection in the cache, so it creates new HubConnection 2. It then registers HubConnection 2 in the ConditionalWeakTable. The CreateValue callback (GetValue's second param) is called. It creates a new HashSet, 0x456, and registers it with HubConnection 2. It has not found HashSet 0x123, and there's no way for it to find that hash set because HubConnection 1 is gone. The list of project GUIDs in HashSet 0x456 now contains only project C, and projects A and B are nowhere to be found. |
|
As far as I can tell, thinking through this scenario, we want a list of projects keyed to the server, not the HubConnection object, if we want to be able to reconnect them when a new HubConnection ends up being created. |
IMemoryCache.CreateEntry() returns an IDisposable that must be disposed for the cache entry to actually be created.
This allows us to resubscribe to all projects in a single handler, so we never need to do the "Event -= Handler; Event += Handler" dance.
@rmunn I understand better now what you mean. We're not so interested in the connection<->project relationship, but rather the question: "What projects does the user currently want to listen to and what server are they on?" As it is, our ideas kind of side-step that question. Our options:
I'd say (4) is out of reach. That leaves us with (1) and (2). I like the idea of an infinite retry policy. I've had days when my network adapter dropped out for at least a minute several times. I expect some of our users lose internet for much longer periods of time. I think it would be great if we could expect our app to recover from longer interruptions. The idea that "it just keeps trying" is both less confusing and more satisfying than "it'll retry for approx X minutes". |
Fixes #2149.
When the websocket times out and reconnects, such as after a laptop is put to sleep or a phone changes networks, HubConnection reconnects with a new connection ID. But the connection ID is what is used on the server side to determine who should receive project-change notifications. So whenever the HubConnection reconnects, it needs to resubscribe to the project-change notification events, which will send the server its new connection ID.