-
Notifications
You must be signed in to change notification settings - Fork 3.4k
fix: workflow controller to detect stale workflows #15090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Eduardo Rodrigues <[email protected]>
Signed-off-by: Eduardo Rodrigues <[email protected]>
|
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a mechanism to detect and skip processing of stale workflow versions in the workflow controller, addressing multiple issues where the controller processes outdated versions of workflows. The implementation uses a combination of a workflow annotation and an in-memory map to track the last processed resource version for each workflow.
Key Changes:
- Added
last-seen-versionannotation and in-memory tracking to identify stale workflow events - Integrated stale detection check (
isOutdated) in the workflow processing pipeline - Cleanup of tracking data when workflows complete or are deleted
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
workflow/common/common.go |
Defines the new AnnotationKeyLastSeenVersion constant for storing the last seen resource version |
workflow/controller/controller.go |
Adds lastSeenVersions struct and tracking logic, implements isOutdated check in processing pipeline, and cleanup on workflow completion/deletion |
workflow/controller/operator.go |
Updates persistUpdates and persistWorkflowSizeLimitErr to set the annotation and update in-memory tracking after successful workflow updates |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
| // the strategy whose artifacts are being deleted | ||
| AnnotationKeyArtifactGCStrategy = workflow.WorkflowFullName + "/artifact-gc-strategy" | ||
|
|
||
| // AnnotationKeyLastSeenVersion is the last seen version for the workflow |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The comment for this annotation is vague. Consider updating it to better explain its purpose and semantics. For example:
// AnnotationKeyLastSeenVersion stores the resource version of the workflow when it was last successfully processed and updated by the controller. Used to detect and skip stale workflow events from the informer cache.| // AnnotationKeyLastSeenVersion is the last seen version for the workflow | |
| // AnnotationKeyLastSeenVersion stores the resource version of the workflow when it was last successfully processed and updated by the controller. | |
| // Used to detect and skip stale workflow events from the informer cache. |
| oldRV := woc.wf.ResourceVersion | ||
| _, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{}) | ||
| if err != nil { | ||
| woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error") | ||
| } else { | ||
| woc.updateLastSeenVersion(oldRV) |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The persistWorkflowSizeLimitErr function is missing the call to set the last-seen-version annotation before updating the workflow. For consistency with the main update flow (line 765), woc.updateLastSeenVersionAnnotation(oldRV) should be called before line 866.
Additionally, this function suffers from the same fundamental issue as the main path: using oldRV means the tracking mechanism won't properly detect stale workflows.
| oldRV := woc.wf.ResourceVersion | |
| _, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{}) | |
| if err != nil { | |
| woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error") | |
| } else { | |
| woc.updateLastSeenVersion(oldRV) | |
| woc.updateLastSeenVersionAnnotation(woc.wf.ResourceVersion) | |
| _, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{}) | |
| if err != nil { | |
| woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error") |
| if wfc.isOutdated(un) { | ||
| logger.WithField("key", key).Debug(ctx, "Skipping outdated workflow event") | ||
| wfc.wfQueue.AddRateLimited(key) | ||
| return true |
Copilot
AI
Dec 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a workflow is detected as outdated, it's requeued using AddRateLimited. However, if the informer cache still contains the stale version, the next processing attempt will encounter the same outdated workflow and requeue it again, potentially creating a requeue loop until the informer cache is updated.
Consider adding a mechanism to track requeue attempts for outdated workflows, or add exponential backoff specifically for this case to avoid excessive requeueing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this may be an issue
Motivation
Multiple issues have been created because of unexpected workflow behavior:
#13986
#14833
#12352
#14780
It appears that many of these issues occur because the controller is processing an outdated version of the workflow. The exact cause of these stale reads is still unknown, but there is some suspicion that it may be related to the informer write-back mechanism, which is being disabled by default in #15079.
This PR ensures that stale workflow versions are not reconciled by keeping track of the last processed resource version for each workflow in a last-seen-version annotation. A workflow is only processed when its annotation matches the expected version; otherwise, it is re-queued. The annotation stores the workflow’s resource version, though any unique value would work. I just thought using the RV was enough.
Modifications
last-seen-versionannotation, updated with the current resource version on everyUpdate()event.Deleteevent is received or when the workflow completes.Verification
Executed workflows with success.
Documentation