Infra HC service over UDS #227

namitdhameja · 2025-12-06T01:06:50Z

The FT launcher accepts an optional argument to point to the Infra health check service (InfraHCD) Unix domain socket. When provided, the launcher exports the socket path to child processes and the rendezvous handlers will use it in their node health checks.
NVRX requests a complete Slurm epilog health-check run before deciding to restart workload.

--infrahc-socket (alias: --infrahc_socket) sets the InfraHCD Unix socket path.
The launcher propagates this value via the INFRAHCD_SOCKET environment variable.
The rendezvous implementations call InfraNodeHealthCheck which will connect to this socket.
Connectivity errors are treated as non-fatal (health passes); explicit RPC failures reported by the service mark the node unhealthy.

hexinw-nvidia · 2025-12-06T15:36:05Z

docs/source/fault_tolerance/usage_guide.rst


+Details:
+
+* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.


In the implementation, the health check is doing gRPC over UDS. We should just focus on conveying it is a health check service that InJob is using. So we can change this config parameter to
"--ft-node-health-check-endpoint" and automatically parse fro UDS, IP, or DNS. The naming would also be consistent to the existing health-check flag "--ft-node-health-check-interval" in the current code.

hexinw-nvidia · 2025-12-06T15:36:47Z

docs/source/fault_tolerance/usage_guide.rst

+Details:
+
+* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.
+* The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable.


Let's avoid using the ENV. It is not necessary.

hexinw-nvidia · 2025-12-06T15:39:04Z

docs/source/fault_tolerance/usage_guide.rst

+
+* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.
+* The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable.
+* The rendezvous implementations call ``InfraNodeHealthCheck`` which will connect to this socket.


Do we need to emphasize "Infra"? We should just make it generic to some NodeHealthCheck service.

hexinw-nvidia · 2025-12-06T15:39:51Z

docs/source/fault_tolerance/usage_guide.rst

  without restarting remaining workers, e.g., with the :doc:`../inprocess/index`.
  For details on how ``min-healthy`` policy interacts with :doc:`../inprocess/index` see :doc:`integration/inprocess`.

+Infra health check service (InfraHCD)


I prefer just "Node health check service". In the future, the service can be somewhere in the cloud. It takes an argument of the "node_id" or "node_name" and "time_range" for you to query about the node health status.

Connecting to an external service introduces significant security implications, particularly around authentication and secure credential injection within containers. Given these complexities, I don't think we should try to future-proof for this scenario now. It's different enough that we should address it when that work becomes concrete.

time_range itself is a good one to have and I would have liked to support it. Currently, the underlying HC service does have time_range for parsing say syslog but those are hard coded; and if we really want to change it then would have to go through a cycle with them for justification etc. Note that I don't think we have a strong use case yet.

hexinw-nvidia · 2025-12-06T15:42:16Z

src/nvidia_resiliency_ext/fault_tolerance/_ft_rendezvous.py

        msg = f"Checking health status of {self._this_node}."
        self._record(message=msg)
-        # Perform GPU health check
+        # Perform GPU and Infra node health checks


Just a heads up I am also touching the "ensure_node_is_healthy" code in my NIC link state health check. We might need to resolve the conflict during merge.

hexinw-nvidia · 2025-12-06T16:21:44Z

src/nvidia_resiliency_ext/shared_utils/health_check.py

+            self._grpc = _grpc_mod
+            self._pb2 = _pb2_mod
+            self._pb2_grpc = _pb2_grpc_mod
+        except Exception as e:


We don't need to carry the whole grpc module here. We should fully leverage the object-oriented design provided by the gRPC. In this particular scenario, we should subclass the HealthCheckServiceServicer and use the add_HealthCheckServiceServicer_to_server. Refer to nvhcd_pb2_grpc.py. At the end of the day, all we care is that there is a health check grpc server, and we can invoke its gRPC APIs.

You can refer to how VACE integrates with gRPC service framework as a working example.
https://gitlab-master.nvidia.com/ngcc/vace/-/blob/main/src/vace/server/rpc_service.py?ref_type=heads#L36

hexinw-nvidia · 2025-12-06T16:23:48Z

src/nvidia_resiliency_ext/shared_utils/health_check.py

+            return True
+
+        # If socket does not exist, assume service not deployed and return True (non-fatal/optional check)
+        if not os.path.exists(self.socket_path):


This check is not necessary. If the path doesn't exist, the gRPC handle would be none, which we have already checked.

hexinw-nvidia · 2025-12-06T16:24:30Z

src/nvidia_resiliency_ext/shared_utils/health_check.py

+          - On gRPC connectivity errors, return False.
+        """
+        # If gRPC client cannot be constructed, return True (non-fatal/optional check)
+        if self._grpc is None or self._pb2 is None or self._pb2_grpc is None:


We can just check if the health check gRPC service module is instantiated correctly or not.

hexinw-nvidia · 2025-12-06T16:27:07Z

src/nvidia_resiliency_ext/shared_utils/proto/nvhcd.proto

+// HealthCheckRequest contains the arguments to pass to the health check script
+message HealthCheckRequest {
+  // Arguments to pass to the health check script
+  repeated string args = 1;


It is a category of health checks that we can request the health check service to run. Should we make it an enum instead of string?

It buys us a slightly better abstraction over having to do another round on the server side, with regeneration of .deb et all. I don't think its worth the delay. The valid strings which indicate the check category, we should document and will update it.

hexinw-nvidia · 2025-12-06T16:29:39Z

src/nvidia_resiliency_ext/shared_utils/health_check.py

+
+                return True
+
+        except Exception as e:


The gRPC service itself can also report exception for any of the errors in its own gRPC layer. We should catch and have more explicit gRPC error handling.

What would be the more explicit gRPC error handling? What can be done if the UDS is specified but n/a or not responsive or any other myriad errors? nvrx can potentially have notification support, is that what are alluding to? is this related to the slack channel notification work?

infra HC service over UDS

1ef7261

namitdhameja requested review from apaithankar, hexinw-nvidia and rhewett-nv December 6, 2025 01:06

namitdhameja self-assigned this Dec 6, 2025

namitdhameja changed the title ~~infra HC service over UDS~~ Infra HC service over UDS Dec 6, 2025

hexinw-nvidia reviewed Dec 6, 2025

View reviewed changes

remove gen files

2eacd4c


		Details:

		* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.

Infra HC service over UDS #227

Are you sure you want to change the base?

Infra HC service over UDS #227

Uh oh!

Conversation

namitdhameja commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

namitdhameja commented Dec 6, 2025 •

edited

Loading