Skip to content

Conversation

@hexinw-nvidia
Copy link
Contributor

Implement a comprehensive cycle management system to track fault tolerance cycles and enable external failure attribution modules to report failure reasons via REST API. The launcher uses this information to make early exit decisions for non-recoverable hardware failures.

Key changes:

  • Add built-in HTTP server (default port 2025).
  • Enabled by default, disable with --no-http-server
  • Implemented in launcher_server.py using werkzeug
  • Cycle Management System:

    • Add CycleManager singleton to maintain LRU cache of recent cycles
    • Add Cycle class to store profiling events, failure reasons, and metadata
    • Store all profiling events (WORKER_START_STARTED, WORKER_START_COMPLETED, FAILURE_DETECTED, etc.) within cycle objects with timestamps
    • Support negative indexing for cycle queries (e.g., -1 for last cycle)
    • Implement check_recent_cycles_for_exit() to detect non-recoverable failures (GPU_HW_FAILURE, MEMORY_HW_FAILURE, etc.)
  • REST API Endpoints:

    • Add GET / endpoint returning launcher_start_time
    • Add GET /cycles endpoint to query all or specific cycles with events
      • Support query parameter: /cycles?cycle_number=3
      • Support negative indexing: /cycles?cycle_number=-1
    • Add POST /cycles endpoint for external modules to update cycle failure_reason and metadata (requires existing cycle)
  • Profiler Refactoring:

    • Remove global profiler singleton and record_profiling_event() function
    • Make FaultToleranceProfiler an instance variable of LocalElasticAgent
    • Pass profiler to rendezvous handlers via set_profiler() method
    • All profiling events now stored in cycle objects via cycle.add_event()
    • Remove explicit cycle_start_time tracking (derived from events)
  • Launcher Integration:

    • Integrate cycle check in _monitor_workers() with 5-second throttling
    • Set _remaining_restarts=0 when non-recoverable failure detected
    • Prevent restart attempts for hardware failures that won't recover
    • Early exit job when non-recoverable failure reported by external module

Implement a comprehensive cycle management system to track fault tolerance
cycles and enable external failure attribution modules to report failure
reasons via REST API. The launcher uses this information to make early exit
decisions for non-recoverable hardware failures.

Key changes:

* Add built-in HTTP server (default port 2025).

- Enabled by default, disable with --no-http-server
- Implemented in launcher_server.py using werkzeug

* Cycle Management System:
  - Add CycleManager singleton to maintain LRU cache of recent cycles
  - Add Cycle class to store profiling events, failure reasons, and metadata
  - Store all profiling events (WORKER_START_STARTED, WORKER_START_COMPLETED,
    FAILURE_DETECTED, etc.) within cycle objects with timestamps
  - Support negative indexing for cycle queries (e.g., -1 for last cycle)
  - Implement check_recent_cycles_for_exit() to detect non-recoverable
    failures (GPU_HW_FAILURE, MEMORY_HW_FAILURE, etc.)

* REST API Endpoints:
  - Add GET / endpoint returning launcher_start_time
  - Add GET /cycles endpoint to query all or specific cycles with events
    - Support query parameter: /cycles?cycle_number=3
    - Support negative indexing: /cycles?cycle_number=-1
  - Add POST /cycles endpoint for external modules to update cycle
    failure_reason and metadata (requires existing cycle)

* Profiler Refactoring:
  - Remove global profiler singleton and record_profiling_event() function
  - Make FaultToleranceProfiler an instance variable of LocalElasticAgent
  - Pass profiler to rendezvous handlers via set_profiler() method
  - All profiling events now stored in cycle objects via cycle.add_event()
  - Remove explicit cycle_start_time tracking (derived from events)

* Launcher Integration:
  - Integrate cycle check in _monitor_workers() with 5-second throttling
  - Set _remaining_restarts=0 when non-recoverable failure detected
  - Prevent restart attempts for hardware failures that won't recover
  - Early exit job when non-recoverable failure reported by external module
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant