feat: Add segment-aware hot spare support with SLURM topology ordering #226

hexinw-nvidia · 2025-12-04T00:40:21Z

Add intelligent node selection and hot spare prioritization for fault-tolerant LLM training with rack topology awareness and SLURM-ordered rank assignment. The feature enables rack-aware node placement, complete segment selection, and granular failure rate-based prioritization of healthy racks.

Key Features:

Rack-aware node selection: Groups nodes by rack and ensures balanced distribution across racks based on segment size
SLURM topology ordering: Uses infra_rank (SLURM_PROCID) to maintain SLURM_JOB_NODELIST ordering for deterministic rank assignment
Complete segment selection: Selects as many complete segments as possible from each rack (e.g., rack_size=12, segment=4 → use 12 nodes = 3 segments)
Granular rack prioritization: Racks prioritized by failure rate (0% > low > high) for finer-grained selection vs binary classification
Failure history tracking: Automatically tracks and deprioritizes nodes with worker failures for one rendezvous round
Two operating modes:
- Fill-Gap mode (segment=None): Existing hot spare support with simple gap filling for hardware failures
- Segment-aware mode (segment=N): Extended hot spare with rack topology awareness and failure rate-based prioritization

Implementation Details:

Add segment-aware filtering in _filter_participants_by_segment() that:
- Sorts participants by infra_rank to maintain SLURM topology order
- Groups participants by rack number parsed from node hostnames
- Calculates usable nodes per rack (complete segments only)
- Prioritizes racks by failure rate with SLURM topology as secondary key
- Selects exactly min_nodes from prioritized racks in topology order
- Reassigns contiguous ranks (0,1,2,...) maintaining SLURM order
Extend RendezvousParticipantInfo to include has_worker_failure flag
Add mark_node_with_worker_failure() to track nodes with worker failures
Integrate with LocalElasticAgent to mark failed nodes before restart
Add rack number caching to optimize repeated parsing operations
Remove redundant sorting from get_all_participants() for O(N log N) speedup

Configuration:

segment: Minimum nodes per rack (None disables segment awareness)
rack_id_from_node_name: Enable parsing rack ID from node hostname
rack_id_prefix: Prefix to strip from rack_id (default: "nvl72")
prefer_healthy_racks: Prefer healthy racks with lower failure rates (default: True, renamed from prefer_spare_over_recovered_node)

Node naming convention: <rack_id>-<node_id> where rack_id = <rack_number>
Example: "nvl72144-T01" → rack_number=144

This commit implements segment-aware participant selection for hardware fault tolerance and simplifies the rank assignment logic by unifying the handling of both segment-aware and default selection modes. Key changes: 1. Segment-aware participant selection - When segment parameter is set, enforce domain-level constraints: * Domains must have >= segment nodes to be valid * Select as many complete segments as possible from each domain * Select exactly min_nodes participants across valid domains - Selection follows SLURM topology order for deterministic behavior - Validate min_nodes is divisible by segment size - Cache domain number parsing for performance 2. Unified rank assignment for both configurations - Simplified _assign_group_ranks_with_infra_rank to handle both: * Segment-aware: domain-constrained selection via _select_by_domain * Default (segment=None): first min_nodes in SLURM topology order - Both produce contiguous active ranks [0..min_nodes-1] plus standby ranks - Active participants selected based on configuration - Unselected nodes assigned sequential standby ranks (min_nodes, min_nodes+1, ...) - Standby nodes complete rendezvous and wait with local_world_size=0 3. Improved code clarity - Removed confusing "Fill-gap mode" terminology - Refer to non-segment behavior as "Default behavior (segment=None)" - Added clear logging distinguishing active vs standby rank assignments - Enhanced duplicate infra_rank validation with actionable error messages Examples: Default behavior (segment=None): Config: min_nodes=5, max_nodes=6 Arrivals: infra_ranks [0, 1, 3, 4, 10, 11] Result: Active: infra_ranks [0,1,3,4,10] → group_ranks [0,1,2,3,4] Standby: infra_rank [11] → group_rank [5] (hot spare) Segment-aware (segment=4): Config: min_nodes=8, max_nodes=12, segment=4 Arrivals by domain: Domain 100 (nvl72100-*): infra_ranks [0,1,2,3,4,5] (6 nodes) Domain 101 (nvl72101-*): infra_ranks [6,7,8,9] (4 nodes) Domain 102 (nvl72102-*): infra_ranks [10,11] (2 nodes) Result: Active: infra_ranks [0,1,2,3] from domain 100 (4 nodes = 1 segment) infra_ranks [6,7,8,9] from domain 101 (4 nodes = 1 segment) → group_ranks [0,1,2,3,4,5,6,7] Standby: infra_ranks [4,5] from domain 100 → group_ranks [8,9] (hot spares) Excluded: domain 102 (< segment threshold), becomes standby if launched infra_ranks [10,11] → group_ranks [10,11] (hot spares)

This commit removes the `use_infra_group_rank` configuration parameter and simplifies rank assignment logic to always use infrastructure-based ordering when available (SLURM_PROCID or GROUP_RANK), with deterministic fallback based on sorted node order when environment variables are not set. Key Changes: 1. **Remove use_infra_group_rank configuration** - Removed from FaultToleranceConfig, RendezvousSettings, and all handler APIs - Removed CLI argument --ft-use-infra-group-rank from launcher - Cleaned up documentation and examples that referenced this parameter 2. **Simplify rank assignment logic** - Infrastructure ranks (SLURM_PROCID/GROUP_RANK) are always used when available - When neither env var is set, ranks are assigned deterministically based on sorted node order (infra_rank = -1 triggers deterministic assignment) - Removed complex logic for preserving previous rank assignments - Simplified _assign_ranks() to focus on infrastructure-based ordering 3. **Add NIC health check support** - Added IBLinkStateHealthCheck class for InfiniBand link state monitoring - New config parameters: enable_nic_healthcheck, link_state_path_template - Checks if IB ports transition from ACTIVE to non-ACTIVE state - Complementary to existing enable_nic_monitor (link_downed counters) - Integrated into both FtRendezvousHandler and FtRendezvousBarrierHandler 4. **Update tests** - Added BaseRendezvousTest class that clears SLURM_PROCID/GROUP_RANK env vars - Most tests now inherit from BaseRendezvousTest for deterministic behavior - Infrastructure rank tests explicitly inherit from TestCase to test env vars - Removed tests for use_infra_group_rank parameter and invalid infra_rank=-1 - Updated test expectations to match new simplified rank assignment

rhewett-nv · 2025-12-04T00:50:10Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

+            else:
+                rack_to_participants[rack_num].append((node_desc, infra_rank, has_worker_failure))
+
+        if nodes_without_rack:


I think a flag to fail here would be smart, I don't want to miss this warning and soldier on.

Done. "nodes_without_rack" should not happen when "domain_id_from_node_id" is True. A runtime error is raised if there is an issue parsing the domain_id.

rhewett-nv · 2025-12-04T23:57:31Z

src/nvidia_resiliency_ext/fault_tolerance/_ft_rendezvous.py

-            state.participants[self._node] = 0
+            # Neither env var is set - will be assigned deterministically later
+            # based on sorted node order in _assign_ranks
+            infra_rank = -1


Maybe an error here if SLURM_JOB_ID is set and we are in this case? Then it preserves the previous behavior.

rhewett-nv · 2025-12-04T23:58:16Z

src/nvidia_resiliency_ext/fault_tolerance/_ft_rendezvous.py

+        # All nodes have infrastructure ranks - use them directly
+        # Validate that all participants have valid infrastructure ranks
+        for node, rank in participants.items():
+            if rank < 0 or rank >= len(participants):


rank cannot be < 0 due to above case

rank can be > len(participants) if there are hot spares

src/nvidia_resiliency_ext/fault_tolerance/_ft_rendezvous.py

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

rhewett-nv · 2025-12-05T00:27:32Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

+                domain_to_participants[domain_num].append((node_desc, infra_rank))
+
+        if nodes_without_domain:
+            log.warning(f"Found {len(nodes_without_domain)} nodes without valid domain numbers")


If we are in domain aware mode, and we cannot match it, then parsing is wrong or something is specified incorrectly -- this should be an error / failure?

We catch nodes without domain_id. This code is gone.

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

rhewett-nv · 2025-12-05T00:41:01Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

+        total_selected_nodes = 0
+
+        for domain_num, usable_nodes, first_infra_rank in valid_domains_info:
+            if total_selected_nodes >= min_nodes:


s/min_nodes/required_world_size or something like this

Changed min_nodes to world_size in the group rank assignment.

rhewett-nv · 2025-12-05T00:45:00Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

+                break
+
+            selected_domains.append((domain_num, usable_nodes))
+            total_selected_nodes += usable_nodes


+= min(usable_nodes, world-size-total_selected_nodes)

Ignore "domain_id_from_node_id" when segment is None.

…ers.

rhewett-nv · 2025-12-08T23:52:39Z

src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py

        """See base class."""
        return False

+    def _run_health_check(self, health_checker, check_name: str, failure_message: str) -> None:


Can you make this a separate PR please?

hexinw-nvidia requested a review from rhewett-nv December 4, 2025 00:40

hexinw-nvidia force-pushed the hot_spare branch from 42c3a72 to 340f2e3 Compare December 4, 2025 00:56

hexinw-nvidia force-pushed the hot_spare branch from 340f2e3 to 0fc5f29 Compare December 4, 2025 02:19

hexinw-nvidia added 2 commits December 4, 2025 15:33

Merge branch 'main' into hot_spare

d327cf9

rhewett-nv reviewed Dec 5, 2025

View reviewed changes

hexinw-nvidia added 4 commits December 4, 2025 19:36

Another simplification on segment-aware group rank assignment.

17cde4c

Ignore "domain_id_from_node_id" when segment is None.

Merge branch 'main' into hot_spare

a5bb8ab

Added host spare segment aware unit tests.

a4030e4

Fixed NestedRestarter report when original node0 becomes standby node.

2c09503

hexinw-nvidia force-pushed the hot_spare branch from 01fde2f to 2c09503 Compare December 5, 2025 08:22

Standby node transitioning to active node failed to create child work…

ac5c030

…ers.

hexinw-nvidia force-pushed the hot_spare branch 2 times, most recently from 3d3fd0a to a391981 Compare December 5, 2025 13:41

First cycle will wait for all participants.

0f520a0

hexinw-nvidia force-pushed the hot_spare branch from a391981 to 0f520a0 Compare December 6, 2025 06:20

rhewett-nv reviewed Dec 8, 2025

View reviewed changes

hexinw-nvidia added 3 commits December 9, 2025 12:54

Deprecated NVRx NestedRestarter.

87802aa

Notify peer to abort current workers.

38df583

Added health check injector monkey patch.

d0a0c55

hexinw-nvidia force-pushed the hot_spare branch from 7392a73 to d0a0c55 Compare December 10, 2025 02:16

Support cluster_uuid from "nvidia-smi" as domain id.

ed34da4

hexinw-nvidia force-pushed the hot_spare branch from eeafd18 to ed34da4 Compare December 10, 2025 06:14

domain_id is reported by each participant.

d26d025

feat: Add segment-aware hot spare support with SLURM topology ordering #226

Are you sure you want to change the base?

feat: Add segment-aware hot spare support with SLURM topology ordering #226

Uh oh!

Conversation

hexinw-nvidia commented Dec 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants