Prerequisites
Bug Description
We had a cluster that had the circuit breaker in a tripped state for many days and when we closed the braker, the FQ module started to process very old events. This resulted in nodes being cordoned and uncondoned for older issues. Ideally, the FQ module should only cordon nodes that have fault at the current time
Component
Fault Management
Steps to Reproduce
- Trip the circuit breaker
- Send many unhealthy events for X hours/days
- Send many healthy events for X hours/days
- Close the circuit breaker
Environment
- NVSentinel version: 0.2.0
- Kubernetes version: all
- Deployment method: ArgoCD
Logs/Output
N/A