Skip to content

Conversation

@hexinw-nvidia
Copy link
Contributor

Add intelligent node selection and hot spare prioritization for fault-tolerant LLM training with rack topology awareness and SLURM-ordered rank assignment. The feature enables rack-aware node placement, complete segment selection, and granular failure rate-based prioritization of healthy racks.

Key Features:

  • Rack-aware node selection: Groups nodes by rack and ensures balanced distribution across racks based on segment size
  • SLURM topology ordering: Uses infra_rank (SLURM_PROCID) to maintain SLURM_JOB_NODELIST ordering for deterministic rank assignment
  • Complete segment selection: Selects as many complete segments as possible from each rack (e.g., rack_size=12, segment=4 → use 12 nodes = 3 segments)
  • Granular rack prioritization: Racks prioritized by failure rate (0% > low > high) for finer-grained selection vs binary classification
  • Failure history tracking: Automatically tracks and deprioritizes nodes with worker failures for one rendezvous round
  • Two operating modes:
    • Fill-Gap mode (segment=None): Existing hot spare support with simple gap filling for hardware failures
    • Segment-aware mode (segment=N): Extended hot spare with rack topology awareness and failure rate-based prioritization

Implementation Details:

  • Add segment-aware filtering in _filter_participants_by_segment() that:
    • Sorts participants by infra_rank to maintain SLURM topology order
    • Groups participants by rack number parsed from node hostnames
    • Calculates usable nodes per rack (complete segments only)
    • Prioritizes racks by failure rate with SLURM topology as secondary key
    • Selects exactly min_nodes from prioritized racks in topology order
    • Reassigns contiguous ranks (0,1,2,...) maintaining SLURM order
  • Extend RendezvousParticipantInfo to include has_worker_failure flag
  • Add mark_node_with_worker_failure() to track nodes with worker failures
  • Integrate with LocalElasticAgent to mark failed nodes before restart
  • Add rack number caching to optimize repeated parsing operations
  • Remove redundant sorting from get_all_participants() for O(N log N) speedup

Configuration:

  • segment: Minimum nodes per rack (None disables segment awareness)
  • rack_id_from_node_name: Enable parsing rack ID from node hostname
  • rack_id_prefix: Prefix to strip from rack_id (default: "nvl72")
  • prefer_healthy_racks: Prefer healthy racks with lower failure rates (default: True, renamed from prefer_spare_over_recovered_node)

Node naming convention: <rack_id>-<node_id> where rack_id = <rack_number>
Example: "nvl72144-T01" → rack_number=144

This commit implements segment-aware participant selection for hardware
fault tolerance and simplifies the rank assignment logic by unifying the
handling of both segment-aware and default selection modes.

Key changes:

1. Segment-aware participant selection
   - When segment parameter is set, enforce domain-level constraints:
     * Domains must have >= segment nodes to be valid
     * Select as many complete segments as possible from each domain
     * Select exactly min_nodes participants across valid domains
   - Selection follows SLURM topology order for deterministic behavior
   - Validate min_nodes is divisible by segment size
   - Cache domain number parsing for performance

2. Unified rank assignment for both configurations
   - Simplified _assign_group_ranks_with_infra_rank to handle both:
     * Segment-aware: domain-constrained selection via _select_by_domain
     * Default (segment=None): first min_nodes in SLURM topology order
   - Both produce contiguous active ranks [0..min_nodes-1] plus standby ranks
   - Active participants selected based on configuration
   - Unselected nodes assigned sequential standby ranks (min_nodes, min_nodes+1, ...)
   - Standby nodes complete rendezvous and wait with local_world_size=0

3. Improved code clarity
   - Removed confusing "Fill-gap mode" terminology
   - Refer to non-segment behavior as "Default behavior (segment=None)"
   - Added clear logging distinguishing active vs standby rank assignments
   - Enhanced duplicate infra_rank validation with actionable error messages

Examples:

Default behavior (segment=None):
  Config: min_nodes=5, max_nodes=6
  Arrivals: infra_ranks [0, 1, 3, 4, 10, 11]
  Result:
    Active:  infra_ranks [0,1,3,4,10] → group_ranks [0,1,2,3,4]
    Standby: infra_rank [11] → group_rank [5] (hot spare)

Segment-aware (segment=4):
  Config: min_nodes=8, max_nodes=12, segment=4
  Arrivals by domain:
    Domain 100 (nvl72100-*): infra_ranks [0,1,2,3,4,5] (6 nodes)
    Domain 101 (nvl72101-*): infra_ranks [6,7,8,9] (4 nodes)
    Domain 102 (nvl72102-*): infra_ranks [10,11] (2 nodes)
  Result:
    Active:  infra_ranks [0,1,2,3] from domain 100 (4 nodes = 1 segment)
             infra_ranks [6,7,8,9] from domain 101 (4 nodes = 1 segment)
             → group_ranks [0,1,2,3,4,5,6,7]
    Standby: infra_ranks [4,5] from domain 100 → group_ranks [8,9] (hot spares)
    Excluded: domain 102 (< segment threshold), becomes standby if launched
             infra_ranks [10,11] → group_ranks [10,11] (hot spares)
This commit removes the `use_infra_group_rank` configuration parameter and
simplifies rank assignment logic to always use infrastructure-based ordering
when available (SLURM_PROCID or GROUP_RANK), with deterministic fallback
based on sorted node order when environment variables are not set.

Key Changes:

1. **Remove use_infra_group_rank configuration**
   - Removed from FaultToleranceConfig, RendezvousSettings, and all handler APIs
   - Removed CLI argument --ft-use-infra-group-rank from launcher
   - Cleaned up documentation and examples that referenced this parameter

2. **Simplify rank assignment logic**
   - Infrastructure ranks (SLURM_PROCID/GROUP_RANK) are always used when available
   - When neither env var is set, ranks are assigned deterministically based on
     sorted node order (infra_rank = -1 triggers deterministic assignment)
   - Removed complex logic for preserving previous rank assignments
   - Simplified _assign_ranks() to focus on infrastructure-based ordering

3. **Add NIC health check support**
   - Added IBLinkStateHealthCheck class for InfiniBand link state monitoring
   - New config parameters: enable_nic_healthcheck, link_state_path_template
   - Checks if IB ports transition from ACTIVE to non-ACTIVE state
   - Complementary to existing enable_nic_monitor (link_downed counters)
   - Integrated into both FtRendezvousHandler and FtRendezvousBarrierHandler

4. **Update tests**
   - Added BaseRendezvousTest class that clears SLURM_PROCID/GROUP_RANK env vars
   - Most tests now inherit from BaseRendezvousTest for deterministic behavior
   - Infrastructure rank tests explicitly inherit from TestCase to test env vars
   - Removed tests for use_infra_group_rank parameter and invalid infra_rank=-1
   - Updated test expectations to match new simplified rank assignment
else:
rack_to_participants[rack_num].append((node_desc, infra_rank, has_worker_failure))

if nodes_without_rack:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a flag to fail here would be smart, I don't want to miss this warning and soldier on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. "nodes_without_rack" should not happen when "domain_id_from_node_id" is True. A runtime error is raised if there is an issue parsing the domain_id.

state.participants[self._node] = 0
# Neither env var is set - will be assigned deterministically later
# based on sorted node order in _assign_ranks
infra_rank = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe an error here if SLURM_JOB_ID is set and we are in this case? Then it preserves the previous behavior.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# All nodes have infrastructure ranks - use them directly
# Validate that all participants have valid infrastructure ranks
for node, rank in participants.items():
if rank < 0 or rank >= len(participants):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank cannot be < 0 due to above case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rank can be > len(participants) if there are hot spares

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

domain_to_participants[domain_num].append((node_desc, infra_rank))

if nodes_without_domain:
log.warning(f"Found {len(nodes_without_domain)} nodes without valid domain numbers")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are in domain aware mode, and we cannot match it, then parsing is wrong or something is specified incorrectly -- this should be an error / failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We catch nodes without domain_id. This code is gone.

total_selected_nodes = 0

for domain_num, usable_nodes, first_infra_rank in valid_domains_info:
if total_selected_nodes >= min_nodes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/min_nodes/required_world_size or something like this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed min_nodes to world_size in the group rank assignment.

break

selected_domains.append((domain_num, usable_nodes))
total_selected_nodes += usable_nodes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+= min(usable_nodes, world-size-total_selected_nodes)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@hexinw-nvidia hexinw-nvidia force-pushed the hot_spare branch 2 times, most recently from 3d3fd0a to a391981 Compare December 5, 2025 13:41
"""See base class."""
return False

def _run_health_check(self, health_checker, check_name: str, failure_message: str) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this a separate PR please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants