Skip to content

Commit d59bd34

Browse files
committed
Update robustness and antithesis documentation.
* Explain differences between the two * Add new issues discovered by Antithesis to track record * Document Antithesis setup Signed-off-by: Marek Siarkowicz <[email protected]>
1 parent a8884c7 commit d59bd34

File tree

2 files changed

+82
-16
lines changed

2 files changed

+82
-16
lines changed

tests/antithesis/README.md

Lines changed: 51 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,54 @@
1-
This directory enables integration of Antithesis with etcd. There are 4 containers running in this system: 3 that make up an etcd cluster (etcd0, etcd1, etcd2) and one that "[makes the system go](https://antithesis.com/docs/getting_started/basic_test_hookup/)" (client).
1+
# etcd Antithesis tests
2+
3+
This document describes the etcd test integration with [Antithesis].
4+
Antithesis provides a testing platform that allows you to explore edge cases, race conditions, and rare
5+
bugs that are difficult or impossible to reproduce in a normal environment.
6+
7+
[Antithesis]: https://antithesis.com/
8+
9+
## Robustness vs Antithesis tests
10+
11+
[Antithesis] runs the robustness tests inside their
12+
[deterministic simulation testing](https://antithesis.com/resources/deterministic_simulation_testing/)
13+
environment and [fault injection](https://antithesis.com/docs/environment/fault_injection/).
14+
15+
For more details on robustness tests, see the [robustness](../robustness).
16+
17+
## Antithesis Setup
18+
19+
The setup consists of a 3-node etcd cluster and a client container, orchestrated
20+
via [Docker Compose](https://antithesis.com/docs/getting_started/setup/).
21+
22+
Antithesis applies the following patches to the etcd server:
23+
24+
* **Critical code locations**: We replace etcd `gofail` comments (which signify
25+
code locations important for failure injection in robustness tests) with
26+
Antithesis `assert.Reachable`. This guides Antithesis to explore the
27+
execution space around these points.
28+
* **Assertions**: We change etcd `verify` package assertions to Antithesis
29+
`assert.Always`, encouraging the platform to try and break those assertions.
30+
* **Instrumentation**: The etcd binary is instrumented using
31+
`antithesis-go-instrumentor` to enable coverage tracking and feedback for
32+
the Antithesis platform.
33+
34+
The Antithesis etcd tests configure the
35+
[Test Composer](https://antithesis.com/docs/test_templates/test_composer_reference/)
36+
in the following way:
37+
38+
* **`entrypoint`**:
39+
* Waits for all etcd nodes to be healthy.
40+
* Emits the `setup_complete` message to Antithesis to start the testing phase.
41+
* **`singleton_driver_traffic`**:
42+
* Generates robustness test traffic against the cluster while faults are injected.
43+
* Runs as a "Singleton Driver", meaning it is the only driver running at any given time.
44+
* All generated traffic is saved as an operation history and stored on a shared volume.
45+
* **`finally_validation`**:
46+
* Runs as a "Finally Driver", meaning it is the last driver to run,
47+
with failure injection disabled.
48+
* Reads the history of operations and validates them using the robustness test validation logic.
49+
* Results of robustness tests are executed as Antithesis `assert.Always` assertions.
50+
* Similar to robustness tests, it emits a visualization of the operations
51+
history to an HTML file that is uploaded to the Antithesis platform.
252

353
# Running tests with docker compose
454

tests/robustness/README.md

Lines changed: 31 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -6,23 +6,37 @@ The purpose of these tests is to rigorously validate that etcd maintains its [KV
66
[KV API guarantees]: https://etcd.io/docs/v3.6/learning/api_guarantees/#kv-apis
77
[watch API guarantees]: https://etcd.io/docs/v3.6/learning/api_guarantees/#watch-apis
88

9+
## Robustness vs Antithesis tests
10+
11+
[Antithesis] runs the robustness tests inside their
12+
[deterministic simulation testing](https://antithesis.com/resources/deterministic_simulation_testing/)
13+
environment and [fault injection](https://antithesis.com/docs/environment/fault_injection/).
14+
15+
For more details on Antithesis integration, see the [antithesis](../antithesis).
16+
17+
[Antithesis]: https://antithesis.com/
18+
919
## Robustness track record
1020

11-
| Correctness / Consistency issue | Report | Introduced in | Discovered by | Reproducible by robustness test | Command |
12-
| ----------------------------------------------------------------- | ---------- | ----------------- | --------------- | ------------------------------------------------- | ----------------------------------- |
13-
| Inconsistent revision caused by crash during high load [#13766] | Mar 2022 | v3.5 | User | Yes, report preceded robustness tests | `make test-robustness-issue13766` |
14-
| Single node cluster can lose a write on crash [#14370] | Aug 2022 | v3.4 or earlier | User | Yes, report preceded robustness tests | `make test-robustness-issue14370` |
15-
| Enabling auth can lead to inconsistency [#14571] | Oct 2022 | v3.4 or earlier | User | No, authorization is not covered. | |
16-
| Inconsistent revision caused by crash during defrag [#14685] | Nov 2022 | v3.5 | Robustness | Yes, after covering defragmentation. | `make test-robustness-issue14685` |
17-
| Watch progress notification not synced with stream [#15220] | Jan 2023 | v3.4 or earlier | User | Yes, after covering watch progress notification | `make test-robustness-issue15220` |
18-
| Watch traveling back in time after network partition [#15271] | Feb 2023 | v3.4 or earlier | Robustness | Yes, after covering network partitions | `make test-robustness-issue15271` |
19-
| Duplicated watch event due to bug in TXN caching [#17247] | Jan 2024 | main branch | Robustness | Yes, prevented regression in v3.6 | |
20-
| Watch events lost during stream starvation [#17529] | Mar 2024 | v3.4 or earlier | User | Yes, after covering of slow watch | `make test-robustness-issue17529` |
21-
| Revision decreasing caused by crash during compaction [#17780] | Apr 2024 | v3.4 or earlier | Robustness | Yes, after covering compaction | |
22-
| Watch dropping an event when compacting on delete [#18089] | May 2024 | v3.4 or earlier | Robustness | Yes, after covering of compaction | `make test-robustness-issue18089` |
23-
| Inconsistency when reading compacted revision in TXN [#18667] | Oct 2024 | v3.4 or earlier | User | | |
24-
| Missing delete event on watch opened on same revision as compaction [#19179] | Jan 2025 | v3.4 or earlier | Robustness | Yes, after covering of compaction | `make test-robustness-issue19179` |
25-
| Watch on future revision returns old events or notifications [#20221] | Jun 2025 | v3.4 or earlier | Robustness | Yes, after covering connection to multiple members| |
21+
| Correctness / Consistency / Panic issue | Report | Introduced in | Discovered by | Reproducible by robustness test | Command |
22+
| ---------------------------------------------------------------------------- | ---------- | --------------- | ----------------------- | -------------------------------------------------- | ----------------------------------- |
23+
| Inconsistent revision caused by crash during high load [#13766] | Mar 2022 | v3.5 | User | Yes, report preceded robustness tests | `make test-robustness-issue13766` |
24+
| Single node cluster can lose a write on crash [#14370] | Aug 2022 | v3.4 or earlier | User | Yes, report preceded robustness tests | `make test-robustness-issue14370` |
25+
| Enabling auth can lead to inconsistency [#14571] | Oct 2022 | v3.4 or earlier | User | No, authorization is not covered. | |
26+
| Inconsistent revision caused by crash during defrag [#14685] | Nov 2022 | v3.5 | Robustness | Yes, after covering defragmentation. | `make test-robustness-issue14685` |
27+
| Watch progress notification not synced with stream [#15220] | Jan 2023 | v3.4 or earlier | User | Yes, after covering watch progress notification | `make test-robustness-issue15220` |
28+
| Watch traveling back in time after network partition [#15271] | Feb 2023 | v3.4 or earlier | Robustness | Yes, after covering network partitions | `make test-robustness-issue15271` |
29+
| Duplicated watch event due to bug in TXN caching [#17247] | Jan 2024 | main branch | Robustness | Yes, prevented regression in v3.6 | |
30+
| Watch events lost during stream starvation [#17529] | Mar 2024 | v3.4 or earlier | User | Yes, after covering of slow watch | `make test-robustness-issue17529` |
31+
| Revision decreasing caused by crash during compaction [#17780] | Apr 2024 | v3.4 or earlier | Robustness | Yes, after covering compaction | |
32+
| Watch dropping an event when compacting on delete [#18089] | May 2024 | v3.4 or earlier | Robustness | Yes, after covering of compaction | `make test-robustness-issue18089` |
33+
| Panic when two snapshots are received in a short period [#18055] | May 2024 | v3.4 or earlier | Robustness | Yes, via Antithesis | |
34+
| Inconsistency when reading compacted revision in TXN [#18667] | Oct 2024 | v3.4 or earlier | User | No, specifying revision in TXN is not implemented | |
35+
| Missing delete event on watch opened on same revision as compaction [#19179] | Jan 2025 | v3.4 or earlier | Robustness | Yes, after covering of compaction | `make test-robustness-issue19179` |
36+
| Watch on future revision returns notifications [#20221] | Jun 2025 | v3.4 or earlier | Robustness, Antithesis | Yes, after covering connection to multiple members | |
37+
| Watch on future revision returns old events [#20221] | Jun 2025 | v3.4 or earlier | Antithesis | Yes, after covering connection to multiple members | |
38+
| Panic from db page expected to be 5 [#20271] | Jul 2025 | v3.4 or earlier | Antithesis | Yes, via Antithesis | |
39+
2640

2741
[#13766]: https://github.com/etcd-io/etcd/issues/13766
2842
[#14370]: https://github.com/etcd-io/etcd/issues/14370
@@ -37,6 +51,8 @@ The purpose of these tests is to rigorously validate that etcd maintains its [KV
3751
[#18667]: https://github.com/etcd-io/etcd/issues/18667
3852
[#19179]: https://github.com/etcd-io/etcd/issues/19179
3953
[#20221]: https://github.com/etcd-io/etcd/issues/20221
54+
[#18055]: https://github.com/etcd-io/etcd/issues/18055
55+
[#20271]: https://github.com/etcd-io/etcd/issues/20271
4056

4157
## How Robustness Tests Work
4258

0 commit comments

Comments
 (0)