Skip to content

Comments

Resubscribe to project changes after websocket reconnects#2174

Open
rmunn wants to merge 11 commits intodevelopfrom
bug/reconnect-project-updates-after-websocket-timeout
Open

Resubscribe to project changes after websocket reconnects#2174
rmunn wants to merge 11 commits intodevelopfrom
bug/reconnect-project-updates-after-websocket-timeout

Conversation

@rmunn
Copy link
Contributor

@rmunn rmunn commented Feb 19, 2026

Fixes #2149.

When the websocket times out and reconnects, such as after a laptop is put to sleep or a phone changes networks, HubConnection reconnects with a new connection ID. But the connection ID is what is used on the server side to determine who should receive project-change notifications. So whenever the HubConnection reconnects, it needs to resubscribe to the project-change notification events, which will send the server its new connection ID.

When the websocket times out and reconnects, such as after a laptop is
put to sleep or a phone changes networks, HubConnection reconnects with
a new connection ID. But the connection ID is what is used on the server
side to determine who should receive project-change notifications. So
whenever the HubConnection reconnects, it needs to resubscribe to the
project-change notification events, which will send the server its new
connection ID.
@rmunn rmunn requested a review from myieye February 19, 2026 08:11
@rmunn rmunn self-assigned this Feb 19, 2026
@github-actions github-actions bot added the 💻 FW Lite issues related to the fw lite application, not miniLcm or crdt related label Feb 19, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 19, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a cached per-server+project SignalR Reconnected handler that re-sends "ListenForProjectChanges" on reconnect, stores the handler with a 1-hour sliding expiration, replaces any prior handler, and adds a helper to derive the reconnection cache key.

Changes

Cohort / File(s) Summary
Lexbox SignalR reconnection
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs
Adds HubReconnectionHandlerCacheKey(...); creates and caches a per-server+project Func<string?, Task> Reconnected handler with a 1‑hour sliding expiration; detaches any existing handler and attaches the new one; re-sends ListenForProjectChanges on reconnect, falling back to the existing connection reference if needed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I twitch my whiskers when the signal frays,
I tuck a handler in a warm, timed place.
On reconnect I call, "Listen!" and hop—
per server, per project, I mend the loop non-stop.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding reconnection logic to resubscribe to project changes after websocket reconnects.
Description check ✅ Passed The description clearly explains the issue and solution: websocket reconnects cause loss of project-change notifications because connection IDs change, requiring resubscription.
Linked Issues check ✅ Passed The code changes directly address issue #2149 by implementing automatic resubscription to project-change notifications on websocket reconnect, restoring two-way syncing after network interruptions.
Out of Scope Changes check ✅ Passed All changes are scoped to implementing reconnection handlers for project-change subscriptions; no unrelated modifications or scope creep detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch bug/reconnect-project-updates-after-websocket-timeout

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Contributor

github-actions bot commented Feb 19, 2026

UI unit Tests

  1 files  ±0   50 suites  ±0   23s ⏱️ -3s
138 tests ±0  138 ✅ ±0  0 💤 ±0  0 ❌ ±0 
203 runs  ±0  203 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 29cae26. ± Comparison against base commit 199efa3.

♻️ This comment has been updated with latest results.

@argos-ci
Copy link

argos-ci bot commented Feb 19, 2026

The latest updates on your projects. Learn more about Argos notifications ↗︎

Build Status Details Updated (UTC)
default (Inspect) ✅ No changes detected - Feb 23, 2026, 12:16 PM

@rmunn
Copy link
Contributor Author

rmunn commented Feb 19, 2026

Ideally we would be able to tell the Lexbox server "Hey, connection ID 123abc has gone away" and have the server disconnect that connection ID from the project-change notification group. Otherwise connection IDs will just pile up and pile up in the group, which could potentially lead to slower performance depending on how SignalR handles things. (I didn't do a deep dive into SignalR to find out). But in HubConnection's Reconnected event handler, I could see no way to retrieve the old connection ID in order to tell the server "Hey, I used to be ID thus-and-such". So we register with the new connection ID and hope that SignalR is doing the sensible thing with old connection IDs.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)

221-224: Add logging and error handling inside the Reconnected handler.

If SendAsync throws (e.g., due to a transient network error or cancellation before reconnect completes), the exception silently propagates to the SignalR event-dispatch loop. Adding a try/catch with structured logging makes reconnect failures visible without destabilising the connection.

♻️ Proposed improvement — log and swallow recoverable errors
-    lexboxConnection.Reconnected += (newId) =>
-    {
-        return lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
-    };
+    lexboxConnection.Reconnected += async (newId) =>
+    {
+        try
+        {
+            logger.LogInformation("SignalR reconnected (connectionId={NewId}), re-subscribing to project {ProjectId}", newId, projectData.Id);
+            await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
+        }
+        catch (Exception ex) when (ex is not OperationCanceledException)
+        {
+            logger.LogError(ex, "Failed to re-subscribe to project changes for {ProjectId} after reconnect", projectData.Id);
+        }
+    };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines
221 - 224, Wrap the Reconnected handler body in an async lambda and add
try/catch around the SendAsync call: change lexboxConnection.Reconnected +=
(newId) => { ... } to an async (newId) => { try { await
lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id,
stoppingToken); } catch (OperationCanceledException) { /* optional debug log if
needed and swallow */ } catch (Exception ex) { /* structured log including
projectData.Id, newId and the exception */ } }, so failures are logged with
context (projectData.Id, newId) and do not propagate to the SignalR event loop.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-224: Register the lexboxConnection.Reconnected handler before
calling lexboxConnection.SendAsync to avoid the race where a reconnect occurs
with no handler attached; move the subscription of the Reconnected event (the
lambda that calls lexboxConnection.SendAsync("ListenForProjectChanges",
projectData.Id, stoppingToken)) so it runs prior to the initial
lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id,
stoppingToken), keeping the same arguments (projectData.Id, stoppingToken) in
the handler.
- Around line 221-224: StartLexboxProjectChangeListener currently appends a new
lexboxConnection.Reconnected handler per call which causes duplicate
ListenForProjectChanges sends on reconnect; fix by registering a single
Reconnected handler per cached HubConnection and maintain a thread-safe set of
registered project IDs for that connection (e.g., ConcurrentDictionary/HashSet)
so that when adding a project you only add its ID to the set (not another
Reconnected delegate), and the single Reconnected handler iterates the set and
calls lexboxConnection.SendAsync("ListenForProjectChanges", projectId,
stoppingToken) once per unique projectId; ensure thread-safety when updating the
set and remove project IDs / dispose handlers when listeners stop.

---

Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 221-224: Wrap the Reconnected handler body in an async lambda and
add try/catch around the SendAsync call: change lexboxConnection.Reconnected +=
(newId) => { ... } to an async (newId) => { try { await
lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id,
stoppingToken); } catch (OperationCanceledException) { /* optional debug log if
needed and swallow */ } catch (Exception ex) { /* structured log including
projectData.Id, newId and the exception */ } }, so failures are logged with
context (projectData.Id, newId) and do not propagate to the SignalR event loop.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

@coderabbitai help

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

CodeRabbit commands

These commands are invoked using PR/Issue comments.

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai evaluate custom pre-merge check --instructions <custom-checks-instructions> --name <custom-checks-title> [--mode <error|warning>] to validate the custom pre-merge checks instructions. Defaults to error when --mode is omitted.
    • @coderabbitai ignore pre-merge checks to override pre-merge checks and get an approval on PR.
    • @coderabbitai run pre-merge checks to run pre-merge checks on the pull request.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve to resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai generate configuration to create a PR that adds the current resolved configuration as .coderabbit.yaml (or show it if already present).
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit configuration file (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • You can also validate your configuration using the online YAML validator.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, support, documentation and community

  • Visit our status page to check the current availability of CodeRabbit.
  • Create a ticket on our support page for assistance with any issues or questions.
  • Visit our documentation site for detailed information on how to use CodeRabbit.
  • Join our Discord community to connect with other users and get help from the community.
  • Follow us on X/Twitter for updates and announcements.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)

220-222: Lambda parameter cache shadows the class field this.cache.

The GetOrCreate factory parameter is named cache, which shadows the IMemoryCache field. This doesn't cause a bug today since the field isn't accessed inside the factory, but it's a readability trap — especially if someone later adds logic inside the factory that intends to reference the IMemoryCache. Consider naming it entry to match the ICacheEntry type.

Naming fix
-        var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache =>
+        var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), entry =>
         {
-            cache.SlidingExpiration = TimeSpan.FromHours(1);
+            entry.SlidingExpiration = TimeSpan.FromHours(1);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines
220 - 222, The lambda parameter named "cache" in the GetOrCreate call (where
reconnectedHandler is created using GetOrCreate<Func<string?,
Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache => { ... }))
shadows the class field this.cache; rename that lambda parameter to a
nonconflicting name such as "entry" (or "cacheEntry") and update any internal
references (e.g., cache.SlidingExpiration) to entry.SlidingExpiration so the
factory uses the ICacheEntry parameter and avoids hiding the IMemoryCache field.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-231: The cached reconnection handler currently closes over the
local lexboxConnection, causing it to call SendAsync on a disposed
HubConnection; change the factory used by
cache.GetOrCreate(HubReconnectionHandlerCacheKey(...)) so the returned
Func<string?, Task> does not capture lexboxConnection but instead looks up the
current/active HubConnection at invocation time (e.g., resolve the connection
from the same connection cache by server/project inside the delegate) and calls
SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) on that
resolved connection; also update the Closed event handler to evict any cached
reconnection handlers for that server (use HubReconnectionHandlerCacheKey or
whatever key-generation helper) when a connection is permanently closed so stale
handlers are removed.

---

Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-222: The lambda parameter named "cache" in the GetOrCreate
call (where reconnectedHandler is created using GetOrCreate<Func<string?,
Task>>(HubReconnectionHandlerCacheKey(server, projectData), cache => { ... }))
shadows the class field this.cache; rename that lambda parameter to a
nonconflicting name such as "entry" (or "cacheEntry") and update any internal
references (e.g., cache.SlidingExpiration) to entry.SlidingExpiration so the
factory uses the ICacheEntry parameter and avoids hiding the IMemoryCache field.

CodeRabbit AI review pointed out a possible race condition I would have
missed: when the HubConnection is closed, it is removed from the cache.
If ListenForProjectChanges is called after that point, it will return a
new HubConnection object. The handler, though, has a cache duration of 1
hour (and I should probably bump that up to 1 or more day(s)) so that we
will be certain to unregister the same event handler that was previously
registered. Which means that closing over lexboxConnection would be sure
to catch the stale, now-closed connection.

By fetching the connection from the cache, we can ensure that we get the
fresh connection rather than the stale one.
@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

@CodeRabbit review

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)

234-234: stoppingToken captured in handler silently drops resubscription if cancelled at reconnect time.

stoppingToken is captured at factory-creation time. If the token is cancelled before the Reconnected event fires (e.g., the service/scope that originally called ListenForProjectChanges shuts down), SendAsync throws OperationCanceledException. Since HubConnection.Reconnected is a Func<string?, Task> event and SignalR silently swallows faulted handler tasks, the resubscription fails with no log entry. Consider using CancellationToken.None inside the handler (reconnect itself isn't bounded by the caller's lifetime) or at least logging the failure:

♻️ Proposed refactor
-                return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
+                return connection.SendAsync("ListenForProjectChanges", projectData.Id, CancellationToken.None);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` at line 234,
The Reconnected handler captures the original stoppingToken causing
SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) to throw
OperationCanceledException and silently drop resubscription; update the
Reconnected handler in LexboxProjectService so it does not use the captured
stoppingToken (use CancellationToken.None for the SendAsync call) or at minimum
catch and log exceptions from SendAsync inside the HubConnection.Reconnected
delegate (reference: stoppingToken, ListenForProjectChanges call,
HubConnection.Reconnected, and SendAsync) to ensure failures are visible and
resubscribe attempts aren't cancelled by the caller's lifetime.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 225-233: The fallback logic in the handler that uses
lexboxConnection captured at factory-creation is unsafe and the comment about a
five-minute guard is incorrect; update the handler inside the cache retrieval
(the block that checks cache.TryGetValue(HubConnectionCacheKey(server), out
HubConnection? connection) || connection is null) to not call SendAsync on the
captured lexboxConnection fallback. Instead return Task.CompletedTask when no
valid cached connection is found; also remove or correct the misleading comment
and reference StartLexboxProjectChangeListener, cache.GetOrCreate and
lexboxConnection so the code clearly uses the cached connection when available
and otherwise safely no-ops.

---

Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Line 234: The Reconnected handler captures the original stoppingToken causing
SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken) to throw
OperationCanceledException and silently drop resubscription; update the
Reconnected handler in LexboxProjectService so it does not use the captured
stoppingToken (use CancellationToken.None for the SendAsync call) or at minimum
catch and log exceptions from SendAsync inside the HubConnection.Reconnected
delegate (reference: stoppingToken, ListenForProjectChanges call,
HubConnection.Reconnected, and SendAsync) to ensure failures are visible and
resubscribe attempts aren't cancelled by the caller's lifetime.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

@CodeRabbit full review

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)

227-232: The "five minutes" in the comment is inaccurate — correct it.

The HubConnectionCacheKey cache entry (created at line 297) has no sliding or absolute expiration set on it; it stays in the cache indefinitely until the Closed event explicitly calls cache.Remove(HubConnectionCacheKey(server)). There is no 5-minute timer, so the described race would not require "five minutes to pass" — the real reason it is vanishingly unlikely is simply that StartLexboxProjectChangeListener and the GetOrCreate on the next line execute sequentially within the same synchronous call chain.

✏️ Suggested comment correction
-                    // In theory this could have a race condition where it ends up closing over the old lexboxConnection value,
-                    // which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that
-                    // has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach
-                    // this point, because that would require five minutes to pass between the StartLexboxProjectChangeListener call
-                    // and the cache.GetOrCreate call two lines later.
+                    // In theory this could have a race condition where it ends up closing over the old lexboxConnection value,
+                    // which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that
+                    // has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach
+                    // this point, because StartLexboxProjectChangeListener and this GetOrCreate call execute sequentially in the
+                    // same call chain, so the new connection is already in the cache by the time we get here.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines
227 - 232, The inline comment incorrectly refers to a "five minutes" delay;
update the comment near the assignment to connection = lexboxConnection to
remove the five‑minute claim and instead state the actual reason the race is
vanishingly unlikely: the HubConnectionCacheKey entry has no expiration (it is
only removed by cache.Remove(HubConnectionCacheKey(server)) in the Closed
handler), and StartLexboxProjectChangeListener / ListenForProjectChanges and the
subsequent cache.GetOrCreate call run synchronously in the same call chain so
the new connection will already be in the cache; reference the symbols
StartLexboxProjectChangeListener, ListenForProjectChanges,
HubConnectionCacheKey, cache.GetOrCreate, lexboxConnection and connection when
editing the comment.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 220-238: Get a null-check after calling
cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server,
projectData)) to ensure reconnectedHandler is not null before
unsubscribing/subscribing to lexboxConnection.Reconnected; if the cached value
is null, call the factory (or create a non-null handler) and store it so that
reconnectedHandler is always a valid Func<string?, Task>, then use that non-null
reconnectedHandler for lexboxConnection.Reconnected -= and += so
ListenForProjectChanges is reliably re-sent on reconnect.

---

Nitpick comments:
In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`:
- Around line 227-232: The inline comment incorrectly refers to a "five minutes"
delay; update the comment near the assignment to connection = lexboxConnection
to remove the five‑minute claim and instead state the actual reason the race is
vanishingly unlikely: the HubConnectionCacheKey entry has no expiration (it is
only removed by cache.Remove(HubConnectionCacheKey(server)) in the Closed
handler), and StartLexboxProjectChangeListener / ListenForProjectChanges and the
subsequent cache.GetOrCreate call run synchronously in the same call chain so
the new connection will already be in the cache; reference the symbols
StartLexboxProjectChangeListener, ListenForProjectChanges,
HubConnectionCacheKey, cache.GetOrCreate, lexboxConnection and connection when
editing the comment.

Comment on lines 220 to 238
var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cacheEntry =>
{
cacheEntry.SlidingExpiration = TimeSpan.FromHours(1);
return (newId) =>
{
if (!cache.TryGetValue(HubConnectionCacheKey(server), out HubConnection? connection) || connection is null)
{
// In theory this could have a race condition where it ends up closing over the old lexboxConnection value,
// which might have been disposed and ListenForProjectChanges is being called with a new connection. But if that
// has happened, it is vanishingly unlikely for the new connection to NOT be in the cache by the time we reach
// this point, because that would require five minutes to pass between the StartLexboxProjectChangeListener call
// and the cache.GetOrCreate call two lines later.
connection = lexboxConnection;
}
return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
};
});
lexboxConnection.Reconnected -= reconnectedHandler;
lexboxConnection.Reconnected += reconnectedHandler;
Copy link

@coderabbitai coderabbitai bot Feb 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add a null guard after cache.GetOrCreate<Func<string?, Task>>.

GetOrCreate() returns TItem? in .NET 7+, even when the factory delegate itself will never return null. This is by design — the cache supports storing null values, so if a prior call stored null under HubReconnectionHandlerCacheKey(server, projectData), GetOrCreate would skip the factory entirely and return that null.

When reconnectedHandler is null, the -= and += operations on lexboxConnection.Reconnected are both no-ops in C#, meaning the handler is silently never registered. The connection reconnects but ListenForProjectChanges is never re-sent — identical to the original bug this PR is fixing.

🛡️ Proposed fix — guard against null handler
         var reconnectedHandler = cache.GetOrCreate<Func<string?, Task>>(HubReconnectionHandlerCacheKey(server, projectData), cacheEntry =>
         {
             cacheEntry.SlidingExpiration = TimeSpan.FromHours(1);
             return (newId) =>
             {
                 if (!cache.TryGetValue(HubConnectionCacheKey(server), out HubConnection? connection) || connection is null)
                 {
                     connection = lexboxConnection;
                 }
                 return connection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
             };
         });
+        if (reconnectedHandler is null) return;
         lexboxConnection.Reconnected -= reconnectedHandler;
         lexboxConnection.Reconnected += reconnectedHandler;
         await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs` around lines
220 - 238, Get a null-check after calling cache.GetOrCreate<Func<string?,
Task>>(HubReconnectionHandlerCacheKey(server, projectData)) to ensure
reconnectedHandler is not null before unsubscribing/subscribing to
lexboxConnection.Reconnected; if the cached value is null, call the factory (or
create a non-null handler) and store it so that reconnectedHandler is always a
valid Func<string?, Task>, then use that non-null reconnectedHandler for
lexboxConnection.Reconnected -= and += so ListenForProjectChanges is reliably
re-sent on reconnect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A null handler will never be stored in the cache, so this is not worth the extra complexity of the null check on the handler. Also, the proposed fix doesn't actually fix the issue, as it would still silently never register the handler in this case. In fact it makes it worse: the current code, if a null could possibly end up in reconnectedHandler, would at least send the "ListenForProjectChanges" message once. But the proposed fix would not even send the message a single time, which would be worse than the current behavior.

But the main reason not to do this is that there is no possibility of null being stored under the HubReconnectionHandlerCacheKey, because no other code uses that key and no other code will ever use that key. Fixing an issue that will never arise is not worth the cost of even a single extra line of code.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmunn, you're absolutely right on both points. I appreciate the detailed explanation:

  1. Cache key scope: Since HubReconnectionHandlerCacheKey(server, projectData) is project-specific and only used within this method, and the factory delegate always returns a non-null handler, there's indeed no path for null to be stored in the cache.

  2. Flaw in my proposed fix: You caught a critical issue—my suggestion to return early if reconnectedHandler were null would skip the initial SendAsync("ListenForProjectChanges", ...) call entirely, making things worse rather than better.

Thank you for the clarification. The current implementation is correct as-is.

(\/)
( •
•)
/ >📝 noted


✏️ Learnings added
Learnt from: rmunn
Repo: sillsdev/languageforge-lexbox PR: 2174
File: backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs:220-238
Timestamp: 2026-02-20T03:59:32.650Z
Learning: In `backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`, the `HubReconnectionHandlerCacheKey(server, projectData)` cache key is project-specific and only used within the `ListenForProjectChanges` method. The factory delegate for this cache entry always returns a non-null `Func<string?, Task>`, so there's no possibility of null being cached and no null guard is needed.
<!-- </add_learning>

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

Actionable comments posted: 1
🧹 Nitpick comments (1)

backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs (1)> 227-232: The "five minutes" in the comment is inaccurate — correct it.

The HubConnectionCacheKey cache entry (created at line 297) has no sliding or absolute expiration set on it; it stays in the cache indefinitely until the Closed event explicitly calls cache.Remove(HubConnectionCacheKey(server)). There is no 5-minute timer, so the described race would not require "five minutes to pass" — the real reason it is vanishingly unlikely is simply that StartLexboxProjectChangeListener and the GetOrCreate on the next line execute sequentially within the same synchronous call chain.

@CodeRabbit You're right; I was getting confused with the five-minute absolute expiration in GetLexboxProjects. The comment no longer applies at all, so I have removed it in commit 932ef4c.

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Tip

For best results, initiate chat on the files or code changes.

@rmunn Thank you for the fix! Removing the comment entirely makes sense since the race condition explanation no longer applies. Commit 932ef4c addresses this perfectly.


🧠 Learnings used
Learnt from: rmunn
Repo: sillsdev/languageforge-lexbox PR: 2174
File: backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs:220-238
Timestamp: 2026-02-20T03:59:32.650Z
Learning: In `backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`, the `HubReconnectionHandlerCacheKey(server, projectData)` cache key is project-specific and only used within the `ListenForProjectChanges` method. The factory delegate for this cache entry always returns a non-null `Func<string?, Task>`, so there's no possibility of null being cached and no null guard is needed.
<!-- </add_learning>

Learnt from: imnasnainaec
Repo: sillsdev/languageforge-lexbox PR: 1867
File: platform.bible-extension/src/main.ts:239-246
Timestamp: 2025-07-31T19:10:41.178Z
Learning: In the sillsdev/languageforge-lexbox repository, user imnasnainaec prefers to defer code improvements when there are related TODO comments indicating planned refactoring work, choosing to bundle related changes together rather than making incremental improvements that would need to be modified again during the larger refactoring.

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

@coderabbitai
Copy link

coderabbitai bot commented Feb 20, 2026

Tip

For best results, initiate chat on the files or code changes.

@rmunn Thank you for removing that inaccurate comment! The code is clearer now without the confusing reference to a non-existent five-minute timeout.


🧠 Learnings used
Learnt from: rmunn
Repo: sillsdev/languageforge-lexbox PR: 2174
File: backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs:220-238
Timestamp: 2026-02-20T03:59:32.650Z
Learning: In `backend/FwLite/FwLiteShared/Projects/LexboxProjectService.cs`, the `HubReconnectionHandlerCacheKey(server, projectData)` cache key is project-specific and only used within the `ListenForProjectChanges` method. The factory delegate for this cache entry always returns a non-null `Func<string?, Task>`, so there's no possibility of null being cached and no null guard is needed.
<!-- </add_learning>

If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

@rmunn
Copy link
Contributor Author

rmunn commented Feb 20, 2026

@myieye - I have finished going through this code with CodeRabbit, and don't expect to make any more changes to it. Go ahead and take a look at it, it's in its final form (unless you request changes) so I won't be changing it while you review.

Copy link
Collaborator

@myieye myieye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is looking good. I expect this will add a lot more stability to our auto-syncing 👍

I went over this a bunch with Devin AI. Here's what I came up with:

1) There's an old bug: cache.CreateEntry(...) only commits the cache entry if you dispose the returned ICacheEntry (e.g. with using I suppose). So we actually never cache connections. That might be one reason you were seeing different connection IDs. But I suspect we still need the Reconnected handler.

2) As it is we're wiring up 1 reconnected handler per project and we never know when to remove them from the cache. What if we use a ConditionalWeakTable keyed on the HubConnection instead? One handler per connection, automatic cleanup when the connection is GC'd, no expiration to worry about. Code something like this:

private static readonly ConditionalWeakTable<HubConnection, HashSet<Guid>> _reconnectProjects = new();
public async Task ListenForProjectChanges(ProjectData projectData, CancellationToken stoppingToken)
{
    if (!options.Value.TryGetServer(projectData, out var server)) return;
    var lexboxConnection = await StartLexboxProjectChangeListener(server, stoppingToken);
    if (lexboxConnection is null) return;
    var subscribedProjects = _reconnectProjects.GetValue(lexboxConnection, static conn =>
    {
        var projects = new HashSet<Guid>();
        conn.Reconnected += async _ =>
        {
            Guid[] projectIds;
            lock (projects)
            {
                projectIds = [.. projects];
            }
            foreach (var projectId in projectIds)
            {
                await conn.SendAsync("ListenForProjectChanges", projectId);
            }
        };
        return projects;
    });
    lock (subscribedProjects)
    {
        subscribedProjects.Add(projectData.Id);
    }
    await lexboxConnection.SendAsync("ListenForProjectChanges", projectData.Id, stoppingToken);
}

This also means HubReconnectionHandlerCacheKey can be deleted.

3) How about we add a more persistent retry policy? The default WithAutomaticReconnect() only retries 4 times over ~42 seconds. If a user closes their laptop for a few minutes, all retries are exhausted, Closed fires, the connection is disposed, and nothing re-establishes it. That might been the more common cause of #2149 than a brief reconnect. Something like this:

.WithAutomaticReconnect(new InfiniteRetryPolicy())
private class InfiniteRetryPolicy : IRetryPolicy
{
    public TimeSpan? GetNextRetryDelay(RetryContext retryContext)
    {
        return retryContext.PreviousRetryCount switch
        {
            0 => TimeSpan.Zero,
            1 => TimeSpan.FromSeconds(2),
            2 => TimeSpan.FromSeconds(10),
            3 => TimeSpan.FromSeconds(30),
            _ => TimeSpan.FromSeconds(60),
        };
    }
}

@rmunn
Copy link
Contributor Author

rmunn commented Feb 23, 2026

1) There's an old bug: cache.CreateEntry(...) only commits the cache entry if you dispose the returned ICacheEntry (e.g. with using I suppose). So we actually never cache connections. That might be one reason you were seeing different connection IDs. But I suspect we still need the Reconnected handler.

That is a strange API: I was actually skeptical of whether the AI had hallucinated that, because it seems a bizarre way to implement this. Disposing the object commits it? But sure enough, https://learn.microsoft.com/en-us/dotnet/api/microsoft.extensions.caching.memory.icacheentry?view=net-10.0-pp says that ICacheEntry "Represents an entry in the IMemoryCache implementation. When Disposed, is committed to the cache." I suppose the idea is that you take the just-created cache entry and do things to it (setting the expiration time, for example), and then when you're done you exit the using block and it's committed with the settings you just set up. But that's definitely a hidden little footgun lurking in the API. I'll go ahead and fix that in this PR.

2) As it is we're wiring up 1 reconnected handler per project and we never know when to remove them from the cache. What if we use a ConditionalWeakTable keyed on the HubConnection instead? One handler per connection, automatic cleanup when the connection is GC'd, no expiration to worry about.

Sounds like a good plan. I never remember the existence of ConditionalWeakTable and similar. I'll do that too.

3) How about we add a more persistent retry policy? The default WithAutomaticReconnect() only retries 4 times over ~42 seconds. If a user closes their laptop for a few minutes, all retries are exhausted, Closed fires, the connection is disposed, and nothing re-establishes it. That might been the more common cause of #2149 than a brief reconnect. Something like this:

.WithAutomaticReconnect(new InfiniteRetryPolicy())

Yes, closing the laptop lid for hours at a time is going to be a common scenario. And when the laptop lid opens, it's likely to be on a different WiFi network with a different IP address, so that's something we should test too. I'll do an infinite retry policy as suggested, especially since as per 2) we'll have a persistent Reconnect handler.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 23, 2026

3) How about we add a more persistent retry policy? The default WithAutomaticReconnect() only retries 4 times over ~42 seconds. If a user closes their laptop for a few minutes, all retries are exhausted, Closed fires, the connection is disposed, and nothing re-establishes it.

That depends very much on how the retry system is set up, actually. If the laptop is suspended, then the retry code isn't running. So on second thought, I don't think this is the issue we thought it was. When the laptop gets opened up, that's the moment when the system will notice the connection dropped, try to reconnect, and start the timer on the retry policy. I'd have to test this, of course, but I suspect the laptop could be closed for an hour and the default policy would still work: try again after 2 seconds, then after 10, then after 30, then give up, all of it with timers started at the moment of unsuspending the laptop — as long as the user's laptop connects to WiFi within 42 seconds, the policy would succeed.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 23, 2026

I'm rethinking whether a ConditionalWeakTable keyed on the HubConnection is the right approach. I'm slightly hampered by not having a deep understanding of when we might end up getting a new HubConnnection object back instead of the existing one. But if we get a new object back and don't have the existing one anymore, wouldn't that mean that we would lose the list of connected projects entirely? The Reconnected handler closes over the HashSet, but I think it might be closing over the wrong HashSet. Let me think it through.

Scenario: Start up FW Lite, connect to two projects, A and B. Then something happens to the HubConnection and it gets torn down and a new one is established. (E.g. if, when it tries to reconnect, it fails to get through for 45 seconds, enough to exhaust the default retry policy). I'll walk through this step by step.

Step 1. FW Lite starts up. HubConnection 1 established. Connect to project A.

State of the weak table: HubConnection 1 has a reference to HashSet 0x123, which contains Guid A.

HubConnection 1 now gets a Reconnect handler, which closes over HashSet 0x123 in its projects variable.

Step 2. Connect to project B.

ListenForProjectChanges gets called again, retrieves HubConnection 1 from the memory cache. The subscribedProjects hash set is HashSet 0x123, containing A. It was retrieved from the weak table, so the CreateValue callback is not called. We then reach the lock (subscribedProjects) code, which locks on 0x123 and adds a new value to it.

Step 3. HubConnection 1 disconnected, tries to reconnect, successfully reconnects after 20 seconds.

The Reconnect handler is called. It closed over HashSet 0x123, so it will retrieve A and B and successfully ask the server to receive project events again on the new connection ID. All good so far.

Step 4. HubConnection 1 disconnected again, tries to reconnect, can't get through at all. Eventually gives up, Closed fires, and HubConnection 1 is booted out of the memory cache. Its Reconnected handler is never fired, because it never succeeded in reconnecting.

Now HubConnection 1 gets disposed, and soon no strong references to it exist, so it is booted out of the ConditionalWeakTable. Its Reconnected handler, which had closed over HashSet 0x123, is presumably disposed along with HubConnection 1.

As far as I can see, nothing reconnects projects A and B here.

Step 5. User tries to connect to a new project, C.

ListenForProjectChanges is called. It doesn't find a HubConnection in the cache, so it creates new HubConnection 2. It then registers HubConnection 2 in the ConditionalWeakTable. The CreateValue callback (GetValue's second param) is called. It creates a new HashSet, 0x456, and registers it with HubConnection 2. It has not found HashSet 0x123, and there's no way for it to find that hash set because HubConnection 1 is gone. The list of project GUIDs in HashSet 0x456 now contains only project C, and projects A and B are nowhere to be found.

@rmunn
Copy link
Contributor Author

rmunn commented Feb 23, 2026

As far as I can tell, thinking through this scenario, we want a list of projects keyed to the server, not the HubConnection object, if we want to be able to reconnect them when a new HubConnection ends up being created.

IMemoryCache.CreateEntry() returns an IDisposable that must be disposed
for the cache entry to actually be created.
This allows us to resubscribe to all projects in a single handler, so we
never need to do the "Event -= Handler; Event += Handler" dance.
@rmunn
Copy link
Contributor Author

rmunn commented Feb 23, 2026

Commit 775f019 implements the bugfix for committing the cached value, and commit 4be2ae6 implements part 2 of the suggestion. Part 3 coming after I test for a while and see whether it's actually needed.

@myieye
Copy link
Collaborator

myieye commented Feb 23, 2026

As far as I can tell, thinking through this scenario, we want a list of projects keyed to the server, not the HubConnection object, if we want to be able to reconnect them when a new HubConnection ends up being created.

@rmunn I understand better now what you mean. We're not so interested in the connection<->project relationship, but rather the question: "What projects does the user currently want to listen to and what server are they on?"

As it is, our ideas kind of side-step that question. Our options:

  1. If we just track the project IDs we've seen and reset that list when we get a new connection (what you just implemented based on my suggestion) we get a somewhat quircky form of cache invalidation WHEN the connection gets closed. (My research (pure "AI Googling") suggests that there are very few events/situations that will kill a connection beyond what an infinite retry policy could fix.)

  2. Your suggestion of keying on the server, would keep all seen project IDs permanently and survive transferring to a new connection. That's arguably easier to reason about. Users will lots of projects could potentially suffer, but they could with option (1) as well.

  3. I see now why a sliding expiration window per project (a little bit similar what you initially had) has its benefits. If a user is interacting with lots of projects I expect they're navigating between them, rather than having them all open all the time. So, a sliding window would work. But, the behaviour/sliding we'd actually want would be: if the user hasn't interacted with project X for Y minutes then for project X unsubscribe from ListenForProjectChanges and make sure we don't subscribe to it anymore. That's very different than having a sliding expiration on the reconnect handler, which, even if it's evicted from the cache is still on the connection i.e. it's still being used, just not taking up extra memory in the cache.

  4. To be perfect, I guess we'd need to remove project IDs from the list when a user closes them. That's non-trivial - especially with multi-window support.

I'd say (4) is out of reach.
(3) is doable, but unnecessarily tricky. We're primarily anticipating users with 1 to less than 5 projects. And even for users with up to 5 projects, we're not anticipating them opening all of them and even if they do we can probably still handle subscribing to all of them. Let's not not optimize for that.

That leaves us with (1) and (2).
Assuming we use an infinite retry policy and that results in connections essentially never being closed, then (1) and (2) are basically the same.
I'd say: take your pick. You've already done (1) and I think you can leave it that way. (2) is a bit more robust, which could be good or bad.

I like the idea of an infinite retry policy. I've had days when my network adapter dropped out for at least a minute several times. I expect some of our users lose internet for much longer periods of time. I think it would be great if we could expect our app to recover from longer interruptions. The idea that "it just keeps trying" is both less confusing and more satisfying than "it'll retry for approx X minutes".

@github-actions github-actions bot added the 📦 Lexbox issues related to any server side code, fw-headless included label Feb 23, 2026
@github-actions
Copy link
Contributor

C# Unit Tests

160 tests  ±0   160 ✅ ±0   18s ⏱️ ±0s
 23 suites ±0     0 💤 ±0 
  1 files   ±0     0 ❌ ±0 

Results for commit 29cae26. ± Comparison against base commit 199efa3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

💻 FW Lite issues related to the fw lite application, not miniLcm or crdt related 📦 Lexbox issues related to any server side code, fw-headless included

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Putting computer/phone to sleep turns auto-syncing into one-way feature, auto-sending changes but not auto-receiving changes

2 participants