Skip to content

Conversation

@eduardodbr
Copy link
Member

@eduardodbr eduardodbr commented Nov 28, 2025

Motivation

Multiple issues have been created because of unexpected workflow behavior:

#13986
#14833
#12352
#14780

It appears that many of these issues occur because the controller is processing an outdated version of the workflow. The exact cause of these stale reads is still unknown, but there is some suspicion that it may be related to the informer write-back mechanism, which is being disabled by default in #15079.
This PR ensures that stale workflow versions are not reconciled by keeping track of the last processed resource version for each workflow in a last-seen-version annotation. A workflow is only processed when its annotation matches the expected version; otherwise, it is re-queued. The annotation stores the workflow’s resource version, though any unique value would work. I just thought using the RV was enough.

Modifications

  • Introduce a new last-seen-version annotation, updated with the current resource version on every Update() event.
  • Store the last-seen-version of each workflow in memory. When a workflow is processed, it proceeds only if the annotation matches the stored version.
  • If no stored version exists (e.g., after a controller restart), the workflow is always processed to allow normal recovery.
  • The in-memory entry is removed as soon as a Delete event is received or when the workflow completes.

Verification

Executed workflows with success.

Documentation

@eduardodbr
Copy link
Member Author

/retest

@eduardodbr eduardodbr marked this pull request as ready for review November 30, 2025 19:46
@Joibel Joibel requested a review from Copilot December 1, 2025 11:27
Copilot finished reviewing on behalf of Joibel December 1, 2025 11:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a mechanism to detect and skip processing of stale workflow versions in the workflow controller, addressing multiple issues where the controller processes outdated versions of workflows. The implementation uses a combination of a workflow annotation and an in-memory map to track the last processed resource version for each workflow.

Key Changes:

  • Added last-seen-version annotation and in-memory tracking to identify stale workflow events
  • Integrated stale detection check (isOutdated) in the workflow processing pipeline
  • Cleanup of tracking data when workflows complete or are deleted

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.

File Description
workflow/common/common.go Defines the new AnnotationKeyLastSeenVersion constant for storing the last seen resource version
workflow/controller/controller.go Adds lastSeenVersions struct and tracking logic, implements isOutdated check in processing pipeline, and cleanup on workflow completion/deletion
workflow/controller/operator.go Updates persistUpdates and persistWorkflowSizeLimitErr to set the annotation and update in-memory tracking after successful workflow updates

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

// the strategy whose artifacts are being deleted
AnnotationKeyArtifactGCStrategy = workflow.WorkflowFullName + "/artifact-gc-strategy"

// AnnotationKeyLastSeenVersion is the last seen version for the workflow
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] The comment for this annotation is vague. Consider updating it to better explain its purpose and semantics. For example:

// AnnotationKeyLastSeenVersion stores the resource version of the workflow when it was last successfully processed and updated by the controller. Used to detect and skip stale workflow events from the informer cache.
Suggested change
// AnnotationKeyLastSeenVersion is the last seen version for the workflow
// AnnotationKeyLastSeenVersion stores the resource version of the workflow when it was last successfully processed and updated by the controller.
// Used to detect and skip stale workflow events from the informer cache.

Copilot uses AI. Check for mistakes.
Comment on lines +865 to +870
oldRV := woc.wf.ResourceVersion
_, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error")
} else {
woc.updateLastSeenVersion(oldRV)
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The persistWorkflowSizeLimitErr function is missing the call to set the last-seen-version annotation before updating the workflow. For consistency with the main update flow (line 765), woc.updateLastSeenVersionAnnotation(oldRV) should be called before line 866.

Additionally, this function suffers from the same fundamental issue as the main path: using oldRV means the tracking mechanism won't properly detect stale workflows.

Suggested change
oldRV := woc.wf.ResourceVersion
_, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error")
} else {
woc.updateLastSeenVersion(oldRV)
woc.updateLastSeenVersionAnnotation(woc.wf.ResourceVersion)
_, err = wfClient.Update(ctx, woc.wf, metav1.UpdateOptions{})
if err != nil {
woc.log.WithError(err).Warn(ctx, "Error updating workflow with size error")

Copilot uses AI. Check for mistakes.
Comment on lines +738 to +741
if wfc.isOutdated(un) {
logger.WithField("key", key).Debug(ctx, "Skipping outdated workflow event")
wfc.wfQueue.AddRateLimited(key)
return true
Copy link

Copilot AI Dec 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a workflow is detected as outdated, it's requeued using AddRateLimited. However, if the informer cache still contains the stale version, the next processing attempt will encounter the same outdated workflow and requeue it again, potentially creating a requeue loop until the informer cache is updated.

Consider adding a mechanism to track requeue attempts for outdated workflows, or add exponential backoff specifically for this case to avoid excessive requeueing.

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may be an issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant