-
Notifications
You must be signed in to change notification settings - Fork 4.2k
Ensure calls to OOP for source-generator purposes see the exact same Solution-snapshot-instances across all calls they make #81529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@jasonmalinowski @dibarbet ptal. IN particular, with a strong eye towards raceyness and invariants. |
| // cancellation token triggers. Note: we pass the keepAliveCancellationTokenSource.Token in here. We want | ||
| // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work. | ||
| _ = InvokeKeepAliveAsync( | ||
| compilationState, projectId, client, nextSessionId, keepAliveCancellationTokenSource.Token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: the first thing this helper does is yield, ensuring we do proceed and concurrently kick off the call to WaitForSessionIdAsync.
| compilationState, | ||
| projectId, | ||
| (service, _, cancellationToken) => service.WaitForSessionIdAsync(nextSessionId, cancellationToken), | ||
| callerCancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note the subtlety of using callerCancellationToken here, but keepAliveCancellationTokenSource.Token in the first OOP call this is concurrent with. The KeepAliveSession (The value we return) controls the lifetime of that first call), whereas the client calling into KeepAliveSession.CreateAsync controls this call to wait for the session to be established.
| catch when (CancelKeepAliveSession(keepAliveCancellationTokenSource)) | ||
| { | ||
| throw ExceptionUtilities.Unreachable(); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i considered using a NoThrowAwaitable here and returning a new RemoteKeepAliveSession that would dispose teh cancellation token controlling KeepAliveAsync. However, i didn't like that as it meant the caller could cancel, and have this return normally, making it so that it would then execute the next statement after using var session = KeepAliveSession.Create(..., ct);. I don't like apis that return gracefully on cancellation as that means the caller must then immediately check the ct afterwards.
But we had to be delicate here. Because the caller could cancel, we def need to make sure that we cancel the work to sync the solution over. If we don't do that, and we bubbled out the cancellation exception, we would never release that solution on the OOP side, which would be super bad :)
|
|
||
| // Now kick off the keep-alive work. We don't wait on this as this will stick on the OOP side until the | ||
| // cancellation token triggers. Note: we pass the keepAliveCancellationTokenSource.Token in here. We want | ||
| // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tried to be very clear in docs what is going on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are fantastic.
| // session itself (which is critical for ensuring that we either stop syncing the solution/project-cone over, or | ||
| // that we allow it to be released on the oop side), and bubble the exception outwards to the caller to handle | ||
| // as they see fit. | ||
| catch when (DisposeKeepAliveSession(session)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentionally naked catch. Any exception makes us Dispose the session. and since we return 'false' from DisposeKeepAliveSession, we'll just bubble that exception up.
|
@jasonmalinowski @dibarbet ptal. |
|
|
||
| // Now kick off the keep-alive work. We don't wait on this as this will stick on the OOP side until | ||
| // the cancellation token triggers. | ||
| // Now kick off the keep-alive work. We don't wait on this as this will stick on the OOP side until the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could this function be replaced by InvokeKeepAliveAsync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The way "keep alive" works (at a high level" is that we send a call over to oop that effectively "hangs" on the oop side (waiting for cancellation) and does NOT return.
So we can't await it since it truly doesn't return (until the entire keep alive session is disposed).
The change here is to allow a second call to be made (and returned) to know that enough of the first call has gone through to hydrate and pin the solution on the oop side.
| /// to rebuild any compilations they've already built due to the solution going away and then coming back. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// Subsequent calls to oop made while this session is alive must pass the same pinning data with the remote |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it all OOP calls, or call using the same keep alive session?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is calls that want to get the solution instance that the keep alive session is pinning.
| } | ||
| } | ||
|
|
||
| public async ValueTask WaitForSessionIdAsync(long sessionId, CancellationToken cancellationToken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the scenario where you'd call KeepAliveAsync, but not WaitForSessionIdAsync (could they be combined into one)?
Wouldn't any keep alive session need to wait for the solution to be hydrated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's just one (rename), and I'll look into fixing it up. Rename has a synchronous startup. So we can't use async/await.
But rename is only using this to speed things up. Not for correctness. So it's ok if it might see a more than one solution instance in different oop calls. It just won't be as efficient. It won't be wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved offline - KeepAliveAsync hangs until the session is disposed of (intentionally), but the caller also needs to observe when the keepalive is 'ready' while the session is in progress - so it needs a separate method to ask this.
Co-authored-by: David Barbet <[email protected]>
|
@dibarbet another way to think about it. The initial 'keep alive' call (which, again, hangs on the OOP side), effectively takes the snapshot for the requested solution/project-cone, and pins it on the OOP side (effectively it has a ref-count of at least one due to that 'in flight call'). Because of that, the oop snapshot cannot be released, and all subsequent calls will necessarily see that pinned snapshot, increaseing/decreasing its refcount as they proceed (but never going below the ref-count of 1). Then when the keep-alive-session is disposed, it will cancel the CT on the host side. THis transfers over to the OOP side, which then allows the hung OOP call to finally return. This drops taht ref-count on the OOP side, which means the ref-count can reach zero, and the pinned snapshot can be released in OOP. |
jasonmalinowski
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is one bug here around the fact we're disposing the CancellationTokenSource we create in the session on the client side -- there's a sequence of events that might touch that token when it's disposed. I believe it's just safe to remove the Dispose() but might have to check with the runtime folks.
Overall the comments here are fantastic and it made it very easy to understand this PR from a perspective of what you're trying to do. The time I had to spend on this was just thinking for races, not understanding the code.
|
|
||
| // Now kick off the keep-alive work. We don't wait on this as this will stick on the OOP side until the | ||
| // cancellation token triggers. Note: we pass the keepAliveCancellationTokenSource.Token in here. We want | ||
| // disposing the returned RemoteKeepAliveSession to be the thing that cancels this work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are fantastic.
| _ = InvokeKeepAliveAsync( | ||
| compilationState, projectId, client, session); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this fails, what prevents the WaitForSessionIdAsync() from not hanging forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a very good point. Fixing up.
| compilationState, | ||
| projectId, | ||
| (service, solutionInfo, cancellationToken) => service.KeepAliveAsync(solutionInfo, sessionId, cancellationToken), | ||
| session.KeepAliveTokenSource.Token).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a potential race here. Consider this sequence of events:
- Client starts off a keep-alive session, and InvokeKeepAliveAsync runs and does the Yield(). The rest of this method is in the queue, but not running yet.
- Client continues and hits the WaitForSessionIdAsync call.
- Client triggers it's cancellation token.
- The RemoteKeepAliveSession is disposed.
- This method finally gets off the queue, and runs. We try to access .Token, but that throws since the CancellationTokenSource is disposed.
Do we need the dispose? Or do we know that at some point the cancellation token will get cancelled in all cases? AFAIK Disposal is only needed if you know you're not going to cancel the token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK Disposal is only needed if you know you're not going to cancel the token.
That's good advice. I didn't realize that was a general way to clean things up. Based on that, i'm getting rid of all disposal and documenting accordingly.
| => new RemoteKeepAliveService(arguments); | ||
| } | ||
|
|
||
| private readonly ConcurrentDictionary<long, TaskCompletionSource<bool>> _sessionIdToCompletionSource = new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add a comment the bool isn't being used, this is just to work around the fact this project compiles with netstandard2.0 and we can't use the non-generic form. I'm also not sure if TaskCompletionSource is better for the JIT, since there's always a risk this has to JIT a specialization here rather than using the System.__Canon one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
src/Workspaces/Remote/ServiceHub/Services/KeepAlive/RemoteKeepAliveService.cs
Outdated
Show resolved
Hide resolved
| // to be collected if not needed anymore. | ||
| // | ||
| // This was provided by stoub as an idiomatic way to wait indefinitely until a cancellation token triggers. | ||
| await Task.Delay(-1, cancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there a Timeout.Infinite we can use somewhere for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes there is :)
| // until the actual solution (or project-scope) is actually pinned in OOP. Note: we pass the caller | ||
| // cancellation token in here so that if the caller decides they don't need to proceed, we can bail quickly, | ||
| // without waiting for the solution to sync and for us to wait on that completing. | ||
| await WaitForSessionIdAsync(compilationState, projectId, client, session, callerCancellationToken).ConfigureAwait(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleaned this up slightly. Easier to see (imo) what's going on by making this two calls. One fire-and-forget, one-awaited.
| (service, solutionInfo, cancellationToken) => service.KeepAliveAsync(solutionInfo, sessionId, cancellationToken), | ||
| session.KeepAliveTokenSource.Token).ConfigureAwait(false); | ||
| } | ||
| catch (Exception ex) when (FatalError.ReportAndCatchUnlessCanceled(ex)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this part was subtle. Ensuring that if we fail for a non-cancellation reason to send the initiak KeepAliveAsync call, that we properly shutdown this session. This ensures cleaning up the WaitForSessionIdAsync call happens and the caller doesn't hang.
| // inside InvokeKeepAliveAsync), we can bail out of this WaitForSessionIdAsync. | ||
| using var linkedTokenSource = CancellationTokenSource.CreateLinkedTokenSource( | ||
| session.KeepAliveTokenSource.Token, | ||
| callerCancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also changed. We want to make sure that cleanup (which can now happen from KeepAliveAsync) ensures that WaitForSessionIdAsync cleans up too. Mixing the two cancellation tokens together ensures that happens.
Fixes #81527
This one was a bit of a doozy. While investigating, it seemed like the only possible way for us to hit this crash on the OOP side was if the calls within
TryComputeNewGeneratorInfoInRemoteProcessAsynctoGetSourceGeneratedDocumentInfoAsyncandGetContentsAsyncwere operating on different solution instances.However, that was supposed to not happen, due to us creating an OOP
KeepAliveSessionprior to both calls to 'pin' that solution on the OOP side.However, looking more closely at
KeepAliveSession, i realized there were two major problems with how we were using it.First, the solution we were asking it to pin in OOP wasn't hte same as the solution we were then asking for in those two calls. Specifically, the KeepAliveSession was pinning the full solution, while the individual calls were asking for Project-Cone subsets of hte solution. As nothing was pinning those, it was possible for the individual calls to get different instances, allowing them to be out of sync.
Second, KeepAliveSession was originally written as a 'best effort' helper. It would make a call over to OOP to pin the solution, but then return immediately, prior to the actual solution being fully remoted and actually pinned. This was fine for most features, as this was only intended to help speed things up. But it was not ok for the source-gen case as we are depending on the invariant that the project-cone actually be pinned, so that the next two calls must see that exact instance.
This PR fixes both. First, the KeepAliveSession we now make is specific to the project-cone we need pinned. Second, the KeepAliveSession system has been fixed up so that now it only returns back to the caller once the actual solution is pinned on the OOP side, ensuring that it must be there.