-
Notifications
You must be signed in to change notification settings - Fork 318
Description
Problem
The snapshot acceptance tests are flaky, contributing to the ~50% CI failure rate. Analysis shows the snapshot row streaming goroutine has a 5-second timeout per row, which is too aggressive for CI environments.
Root Cause
In pkg/snapshot/snapshot.go:214, the SnapshotToQueryResult function streams snapshot data rows through a channel with a 5-second timeout:
case <-time.After(5 * time.Second):
// Timeout - consumer likely stopped reading, exit to prevent leak
returnIn slow CI environments or when the system is under load, this timeout can expire before data is fully consumed, resulting in incomplete snapshot output. This manifests as acceptance test failures where:
- Expected: 15 lines (header + 12 data rows)
- Actual: 3 lines (header only, no data)
Evidence
From PR #4865 CI run (snapshot test failure):
- Snapshot URL was generated successfully (upload worked)
- But output contained only header, no data rows
- Test expected 15 lines, got 3 lines
Recent CI runs show ~50% failure rate, with snapshot tests frequently failing.
Proposed Solution
Increase the timeout from 5 seconds to 30 seconds to accommodate slower CI environments while still preventing goroutine leaks.
Impact
This simple change should significantly reduce the flakiness of snapshot acceptance tests without any negative side effects. The timeout is only hit when the consumer has abandoned the result, so a longer timeout doesn't affect normal operation.
References
- Failed test:
tests/acceptance/test_files/snapshot.bats:17-43 - Code location:
pkg/snapshot/snapshot.go:214 - Example failure: PR [EXPERIMENTAL] Tiered CI Testing - Improve Reliability & Speed #4865 CI run