Ensure calls to OOP for source-generator purposes see the exact same Solution-snapshot-instances across all calls they make #81529

CyrusNajmabadi · 2025-12-03T12:55:32Z

This one was a bit of a doozy. While investigating, it seemed like the only possible way for us to hit this crash on the OOP side was if the calls within TryComputeNewGeneratorInfoInRemoteProcessAsync to GetSourceGeneratedDocumentInfoAsync and GetContentsAsync were operating on different solution instances.

However, that was supposed to not happen, due to us creating an OOP KeepAliveSession prior to both calls to 'pin' that solution on the OOP side.

However, looking more closely at KeepAliveSession, i realized there were two major problems with how we were using it.

First, the solution we were asking it to pin in OOP wasn't hte same as the solution we were then asking for in those two calls. Specifically, the KeepAliveSession was pinning the full solution, while the individual calls were asking for Project-Cone subsets of hte solution. As nothing was pinning those, it was possible for the individual calls to get different instances, allowing them to be out of sync.

Second, KeepAliveSession was originally written as a 'best effort' helper. It would make a call over to OOP to pin the solution, but then return immediately, prior to the actual solution being fully remoted and actually pinned. This was fine for most features, as this was only intended to help speed things up. But it was not ok for the source-gen case as we are depending on the invariant that the project-cone actually be pinned, so that the next two calls must see that exact instance.

This PR fixes both. First, the KeepAliveSession we now make is specific to the project-cone we need pinned. Second, the KeepAliveSession system has been fixed up so that now it only returns back to the caller once the actual solution is pinned on the OOP side, ensuring that it must be there.

CyrusNajmabadi · 2025-12-03T12:56:29Z

@jasonmalinowski @dibarbet ptal. IN particular, with a strong eye towards raceyness and invariants.

CyrusNajmabadi · 2025-12-03T12:59:21Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+        // cancellation token triggers.  Note: we pass the keepAliveCancellationTokenSource.Token in here.  We want
+        // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work.
+        _ = InvokeKeepAliveAsync(
+            compilationState, projectId, client, nextSessionId, keepAliveCancellationTokenSource.Token);


note: the first thing this helper does is yield, ensuring we do proceed and concurrently kick off the call to WaitForSessionIdAsync.

CyrusNajmabadi · 2025-12-03T13:00:22Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+                compilationState,
+                projectId,
+                (service, _, cancellationToken) => service.WaitForSessionIdAsync(nextSessionId, cancellationToken),
+                callerCancellationToken).ConfigureAwait(false);


note the subtlety of using callerCancellationToken here, but keepAliveCancellationTokenSource.Token in the first OOP call this is concurrent with. The KeepAliveSession (The value we return) controls the lifetime of that first call), whereas the client calling into KeepAliveSession.CreateAsync controls this call to wait for the session to be established.

CyrusNajmabadi · 2025-12-03T13:02:50Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+        catch when (CancelKeepAliveSession(keepAliveCancellationTokenSource))
+        {
+            throw ExceptionUtilities.Unreachable();
+        }


i considered using a NoThrowAwaitable here and returning a new RemoteKeepAliveSession that would dispose teh cancellation token controlling KeepAliveAsync. However, i didn't like that as it meant the caller could cancel, and have this return normally, making it so that it would then execute the next statement after using var session = KeepAliveSession.Create(..., ct);. I don't like apis that return gracefully on cancellation as that means the caller must then immediately check the ct afterwards.

But we had to be delicate here. Because the caller could cancel, we def need to make sure that we cancel the work to sync the solution over. If we don't do that, and we bubbled out the cancellation exception, we would never release that solution on the OOP side, which would be super bad :)

CyrusNajmabadi · 2025-12-03T13:21:59Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+
+        // Now kick off the keep-alive work.  We don't wait on this as this will stick on the OOP side until the
+        // cancellation token triggers.  Note: we pass the keepAliveCancellationTokenSource.Token in here.  We want
+        // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work.


tried to be very clear in docs what is going on.

These docs are fantastic.

CyrusNajmabadi · 2025-12-03T13:22:33Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+        // session itself (which is critical for ensuring that we either stop syncing the solution/project-cone over, or
+        // that we allow it to be released on the oop side), and bubble the exception outwards to the caller to handle
+        // as they see fit.
+        catch when (DisposeKeepAliveSession(session))


intentionally naked catch. Any exception makes us Dispose the session. and since we return 'false' from DisposeKeepAliveSession, we'll just bubble that exception up.

CyrusNajmabadi · 2025-12-04T22:48:34Z

@jasonmalinowski @dibarbet ptal.

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

dibarbet · 2025-12-05T00:30:00Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs


-            // Now kick off the keep-alive work.  We don't wait on this as this will stick on the OOP side until
-            // the cancellation token triggers.
+            // Now kick off the keep-alive work.  We don't wait on this as this will stick on the OOP side until the


could this function be replaced by InvokeKeepAliveAsync?

No. The way "keep alive" works (at a high level" is that we send a call over to oop that effectively "hangs" on the oop side (waiting for cancellation) and does NOT return.

So we can't await it since it truly doesn't return (until the entire keep alive session is disposed).

The change here is to allow a second call to be made (and returned) to know that enough of the first call has gone through to hydrate and pin the solution on the oop side.

dibarbet · 2025-12-05T00:31:42Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

    /// to rebuild any compilations they've already built due to the solution going away and then coming back.
    /// </summary>
+    /// <remarks>
+    /// Subsequent calls to oop made while this session is alive must pass the same pinning data with the remote


is it all OOP calls, or call using the same keep alive session?

It is calls that want to get the solution instance that the keep alive session is pinning.

dibarbet · 2025-12-05T00:41:46Z

src/Workspaces/Remote/ServiceHub/Services/KeepAlive/RemoteKeepAliveService.cs

+        }
+    }
+
+    public async ValueTask WaitForSessionIdAsync(long sessionId, CancellationToken cancellationToken)


What is the scenario where you'd call KeepAliveAsync, but not WaitForSessionIdAsync (could they be combined into one)?

Wouldn't any keep alive session need to wait for the solution to be hydrated?

There's just one (rename), and I'll look into fixing it up. Rename has a synchronous startup. So we can't use async/await.

But rename is only using this to speed things up. Not for correctness. So it's ok if it might see a more than one solution instance in different oop calls. It just won't be as efficient. It won't be wrong.

resolved offline - KeepAliveAsync hangs until the session is disposed of (intentionally), but the caller also needs to observe when the keepalive is 'ready' while the session is in progress - so it needs a separate method to ask this.

Co-authored-by: David Barbet <[email protected]>

CyrusNajmabadi · 2025-12-05T15:40:23Z

@dibarbet another way to think about it. The initial 'keep alive' call (which, again, hangs on the OOP side), effectively takes the snapshot for the requested solution/project-cone, and pins it on the OOP side (effectively it has a ref-count of at least one due to that 'in flight call').

Because of that, the oop snapshot cannot be released, and all subsequent calls will necessarily see that pinned snapshot, increaseing/decreasing its refcount as they proceed (but never going below the ref-count of 1).

Then when the keep-alive-session is disposed, it will cancel the CT on the host side. THis transfers over to the OOP side, which then allows the hung OOP call to finally return. This drops taht ref-count on the OOP side, which means the ref-count can reach zero, and the pinned snapshot can be released in OOP.

jasonmalinowski

I think there is one bug here around the fact we're disposing the CancellationTokenSource we create in the session on the client side -- there's a sequence of events that might touch that token when it's disposed. I believe it's just safe to remove the Dispose() but might have to check with the runtime folks.

Overall the comments here are fantastic and it made it very easy to understand this PR from a perspective of what you're trying to do. The time I had to spend on this was just thinking for races, not understanding the code.

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

jasonmalinowski · 2025-12-05T21:24:18Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+
+        // Now kick off the keep-alive work.  We don't wait on this as this will stick on the OOP side until the
+        // cancellation token triggers.  Note: we pass the keepAliveCancellationTokenSource.Token in here.  We want
+        // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work.


These docs are fantastic.

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

jasonmalinowski · 2025-12-05T21:32:36Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+        _ = InvokeKeepAliveAsync(
+            compilationState, projectId, client, session);


If this fails, what prevents the WaitForSessionIdAsync() from not hanging forever?

This is a very good point. Fixing up.

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

jasonmalinowski · 2025-12-05T21:40:01Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+               compilationState,
+               projectId,
+               (service, solutionInfo, cancellationToken) => service.KeepAliveAsync(solutionInfo, sessionId, cancellationToken),
+               session.KeepAliveTokenSource.Token).ConfigureAwait(false);


I think there's a potential race here. Consider this sequence of events:

Client starts off a keep-alive session, and InvokeKeepAliveAsync runs and does the Yield(). The rest of this method is in the queue, but not running yet.

Client continues and hits the WaitForSessionIdAsync call.

Client triggers it's cancellation token.

The RemoteKeepAliveSession is disposed.

This method finally gets off the queue, and runs. We try to access .Token, but that throws since the CancellationTokenSource is disposed.

Do we need the dispose? Or do we know that at some point the cancellation token will get cancelled in all cases? AFAIK Disposal is only needed if you know you're not going to cancel the token.

AFAIK Disposal is only needed if you know you're not going to cancel the token.

That's good advice. I didn't realize that was a general way to clean things up. Based on that, i'm getting rid of all disposal and documenting accordingly.

jasonmalinowski · 2025-12-05T21:46:03Z

src/Workspaces/Remote/ServiceHub/Services/KeepAlive/RemoteKeepAliveService.cs

            => new RemoteKeepAliveService(arguments);
    }

+    private readonly ConcurrentDictionary<long, TaskCompletionSource<bool>> _sessionIdToCompletionSource = new();


Nit: add a comment the bool isn't being used, this is just to work around the fact this project compiles with netstandard2.0 and we can't use the non-generic form. I'm also not sure if TaskCompletionSource is better for the JIT, since there's always a risk this has to JIT a specialization here rather than using the System.__Canon one.

src/Workspaces/Remote/ServiceHub/Services/KeepAlive/RemoteKeepAliveService.cs

jasonmalinowski · 2025-12-05T21:48:50Z

src/Workspaces/Remote/ServiceHub/Services/KeepAlive/RemoteKeepAliveService.cs

+                // to be collected if not needed anymore.
+                //
+                // This was provided by stoub as an idiomatic way to wait indefinitely until a cancellation token triggers.
+                await Task.Delay(-1, cancellationToken).ConfigureAwait(false);


Isn't there a Timeout.Infinite we can use somewhere for this?

yes there is :)

CyrusNajmabadi · 2025-12-05T23:23:15Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+        // until the actual solution (or project-scope) is actually pinned in OOP.  Note: we pass the caller
+        // cancellation token in here so that if the caller decides they don't need to proceed, we can bail quickly,
+        // without waiting for the solution to sync and for us to wait on that completing.
+        await WaitForSessionIdAsync(compilationState, projectId, client, session, callerCancellationToken).ConfigureAwait(false);


Cleaned this up slightly. Easier to see (imo) what's going on by making this two calls. One fire-and-forget, one-awaited.

CyrusNajmabadi · 2025-12-05T23:24:39Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+                   (service, solutionInfo, cancellationToken) => service.KeepAliveAsync(solutionInfo, sessionId, cancellationToken),
+                   session.KeepAliveTokenSource.Token).ConfigureAwait(false);
+            }
+            catch (Exception ex) when (FatalError.ReportAndCatchUnlessCanceled(ex))


this part was subtle. Ensuring that if we fail for a non-cancellation reason to send the initiak KeepAliveAsync call, that we properly shutdown this session. This ensures cleaning up the WaitForSessionIdAsync call happens and the caller doesn't hang.

CyrusNajmabadi · 2025-12-05T23:25:55Z

src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

+                // inside InvokeKeepAliveAsync), we can bail out of this WaitForSessionIdAsync.
+                using var linkedTokenSource = CancellationTokenSource.CreateLinkedTokenSource(
+                    session.KeepAliveTokenSource.Token,
+                    callerCancellationToken);


this also changed. We want to make sure that cleanup (which can now happen from KeepAliveAsync) ensures that WaitForSessionIdAsync cleans up too. Mixing the two cancellation tokens together ensures that happens.

CyrusNajmabadi added 2 commits December 3, 2025 12:45

In progress

650e003

Ensure more invariants around keep alive sessions

00de932

CyrusNajmabadi requested a review from a team as a code owner December 3, 2025 12:55

github-actions bot added the Area-IDE label Dec 3, 2025

dotnet-policy-service bot added the VSCode label Dec 3, 2025

CyrusNajmabadi requested review from dibarbet and jasonmalinowski December 3, 2025 12:56

Docs

e2bac6d

CyrusNajmabadi commented Dec 3, 2025

View reviewed changes

CyrusNajmabadi added 5 commits December 3, 2025 14:05

Simplify

9b9551c

Simplify

acced7a

Docs

8f2ae53

Docs

415062c

Docs

f57cf7d

CyrusNajmabadi commented Dec 3, 2025

View reviewed changes

CyrusNajmabadi added 2 commits December 3, 2025 14:26

Extract helper

f27009b

Switch to long

8a410b5

dibarbet approved these changes Dec 5, 2025

View reviewed changes

Update src/Workspaces/Core/Portable/Remote/IRemoteKeepAliveService.cs

27fb0ee

Co-authored-by: David Barbet <[email protected]>

dotnet deleted a comment from CyrusNajmabadi Dec 5, 2025

jasonmalinowski approved these changes Dec 5, 2025

View reviewed changes

CyrusNajmabadi added 3 commits December 5, 2025 23:45

Merge remote-tracking branch 'upstream/main' into oopKeepAlive

6ead0ac

Inline

f3b0631

Cleanup

3f88a80

CyrusNajmabadi added 5 commits December 6, 2025 00:11

Docs

6fa0d7d

Docs

48a1ea2

Lint

02445d0

infinite

2653602

Docs

331bc96

CyrusNajmabadi commented Dec 5, 2025

View reviewed changes

CyrusNajmabadi added 2 commits December 6, 2025 00:34

Cleanup

915e85c

Docs

1f200c7

jasonmalinowski approved these changes Dec 5, 2025

View reviewed changes

CyrusNajmabadi added 3 commits December 6, 2025 00:43

Fixed missed disposal

9779def

Simplify

34289e6

Add fatal error

a649e5b

CyrusNajmabadi enabled auto-merge December 6, 2025 00:01

CyrusNajmabadi merged commit 1d8215d into dotnet:main Dec 6, 2025
25 of 26 checks passed

CyrusNajmabadi deleted the oopKeepAlive branch December 6, 2025 14:16

dotnet-policy-service bot added this to the Next milestone Dec 6, 2025

		_ = InvokeKeepAliveAsync(
		compilationState, projectId, client, session);

Ensure calls to OOP for source-generator purposes see the exact same Solution-snapshot-instances across all calls they make #81529

Ensure calls to OOP for source-generator purposes see the exact same Solution-snapshot-instances across all calls they make #81529

Conversation

CyrusNajmabadi commented Dec 3, 2025

Uh oh!

CyrusNajmabadi commented Dec 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Dec 4, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CyrusNajmabadi commented Dec 5, 2025

Uh oh!

jasonmalinowski left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants