-
Notifications
You must be signed in to change notification settings - Fork 37
feat: Add segment-aware hot spare support with SLURM topology ordering #226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
42c3a72 to
340f2e3
Compare
This commit implements segment-aware participant selection for hardware
fault tolerance and simplifies the rank assignment logic by unifying the
handling of both segment-aware and default selection modes.
Key changes:
1. Segment-aware participant selection
- When segment parameter is set, enforce domain-level constraints:
* Domains must have >= segment nodes to be valid
* Select as many complete segments as possible from each domain
* Select exactly min_nodes participants across valid domains
- Selection follows SLURM topology order for deterministic behavior
- Validate min_nodes is divisible by segment size
- Cache domain number parsing for performance
2. Unified rank assignment for both configurations
- Simplified _assign_group_ranks_with_infra_rank to handle both:
* Segment-aware: domain-constrained selection via _select_by_domain
* Default (segment=None): first min_nodes in SLURM topology order
- Both produce contiguous active ranks [0..min_nodes-1] plus standby ranks
- Active participants selected based on configuration
- Unselected nodes assigned sequential standby ranks (min_nodes, min_nodes+1, ...)
- Standby nodes complete rendezvous and wait with local_world_size=0
3. Improved code clarity
- Removed confusing "Fill-gap mode" terminology
- Refer to non-segment behavior as "Default behavior (segment=None)"
- Added clear logging distinguishing active vs standby rank assignments
- Enhanced duplicate infra_rank validation with actionable error messages
Examples:
Default behavior (segment=None):
Config: min_nodes=5, max_nodes=6
Arrivals: infra_ranks [0, 1, 3, 4, 10, 11]
Result:
Active: infra_ranks [0,1,3,4,10] → group_ranks [0,1,2,3,4]
Standby: infra_rank [11] → group_rank [5] (hot spare)
Segment-aware (segment=4):
Config: min_nodes=8, max_nodes=12, segment=4
Arrivals by domain:
Domain 100 (nvl72100-*): infra_ranks [0,1,2,3,4,5] (6 nodes)
Domain 101 (nvl72101-*): infra_ranks [6,7,8,9] (4 nodes)
Domain 102 (nvl72102-*): infra_ranks [10,11] (2 nodes)
Result:
Active: infra_ranks [0,1,2,3] from domain 100 (4 nodes = 1 segment)
infra_ranks [6,7,8,9] from domain 101 (4 nodes = 1 segment)
→ group_ranks [0,1,2,3,4,5,6,7]
Standby: infra_ranks [4,5] from domain 100 → group_ranks [8,9] (hot spares)
Excluded: domain 102 (< segment threshold), becomes standby if launched
infra_ranks [10,11] → group_ranks [10,11] (hot spares)
340f2e3 to
0fc5f29
Compare
This commit removes the `use_infra_group_rank` configuration parameter and
simplifies rank assignment logic to always use infrastructure-based ordering
when available (SLURM_PROCID or GROUP_RANK), with deterministic fallback
based on sorted node order when environment variables are not set.
Key Changes:
1. **Remove use_infra_group_rank configuration**
- Removed from FaultToleranceConfig, RendezvousSettings, and all handler APIs
- Removed CLI argument --ft-use-infra-group-rank from launcher
- Cleaned up documentation and examples that referenced this parameter
2. **Simplify rank assignment logic**
- Infrastructure ranks (SLURM_PROCID/GROUP_RANK) are always used when available
- When neither env var is set, ranks are assigned deterministically based on
sorted node order (infra_rank = -1 triggers deterministic assignment)
- Removed complex logic for preserving previous rank assignments
- Simplified _assign_ranks() to focus on infrastructure-based ordering
3. **Add NIC health check support**
- Added IBLinkStateHealthCheck class for InfiniBand link state monitoring
- New config parameters: enable_nic_healthcheck, link_state_path_template
- Checks if IB ports transition from ACTIVE to non-ACTIVE state
- Complementary to existing enable_nic_monitor (link_downed counters)
- Integrated into both FtRendezvousHandler and FtRendezvousBarrierHandler
4. **Update tests**
- Added BaseRendezvousTest class that clears SLURM_PROCID/GROUP_RANK env vars
- Most tests now inherit from BaseRendezvousTest for deterministic behavior
- Infrastructure rank tests explicitly inherit from TestCase to test env vars
- Removed tests for use_infra_group_rank parameter and invalid infra_rank=-1
- Updated test expectations to match new simplified rank assignment
| else: | ||
| rack_to_participants[rack_num].append((node_desc, infra_rank, has_worker_failure)) | ||
|
|
||
| if nodes_without_rack: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a flag to fail here would be smart, I don't want to miss this warning and soldier on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. "nodes_without_rack" should not happen when "domain_id_from_node_id" is True. A runtime error is raised if there is an issue parsing the domain_id.
| state.participants[self._node] = 0 | ||
| # Neither env var is set - will be assigned deterministically later | ||
| # based on sorted node order in _assign_ranks | ||
| infra_rank = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe an error here if SLURM_JOB_ID is set and we are in this case? Then it preserves the previous behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| # All nodes have infrastructure ranks - use them directly | ||
| # Validate that all participants have valid infrastructure ranks | ||
| for node, rank in participants.items(): | ||
| if rank < 0 or rank >= len(participants): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank cannot be < 0 due to above case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rank can be > len(participants) if there are hot spares
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py
Outdated
Show resolved
Hide resolved
| domain_to_participants[domain_num].append((node_desc, infra_rank)) | ||
|
|
||
| if nodes_without_domain: | ||
| log.warning(f"Found {len(nodes_without_domain)} nodes without valid domain numbers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are in domain aware mode, and we cannot match it, then parsing is wrong or something is specified incorrectly -- this should be an error / failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We catch nodes without domain_id. This code is gone.
src/nvidia_resiliency_ext/fault_tolerance/ft_rendezvous_barrier.py
Outdated
Show resolved
Hide resolved
| total_selected_nodes = 0 | ||
|
|
||
| for domain_num, usable_nodes, first_infra_rank in valid_domains_info: | ||
| if total_selected_nodes >= min_nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/min_nodes/required_world_size or something like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed min_nodes to world_size in the group rank assignment.
| break | ||
|
|
||
| selected_domains.append((domain_num, usable_nodes)) | ||
| total_selected_nodes += usable_nodes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+= min(usable_nodes, world-size-total_selected_nodes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Ignore "domain_id_from_node_id" when segment is None.
01fde2f to
2c09503
Compare
3d3fd0a to
a391981
Compare
a391981 to
0f520a0
Compare
| """See base class.""" | ||
| return False | ||
|
|
||
| def _run_health_check(self, health_checker, check_name: str, failure_message: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make this a separate PR please?
7392a73 to
d0a0c55
Compare
eeafd18 to
ed34da4
Compare
Add intelligent node selection and hot spare prioritization for fault-tolerant LLM training with rack topology awareness and SLURM-ordered rank assignment. The feature enables rack-aware node placement, complete segment selection, and granular failure rate-based prioritization of healthy racks.
Key Features:
Implementation Details:
Configuration:
Node naming convention: <rack_id>-<node_id> where rack_id = <rack_number>
Example: "nvl72144-T01" → rack_number=144