Skip to content

Conversation

@namitdhameja
Copy link
Contributor

@namitdhameja namitdhameja commented Dec 6, 2025

The FT launcher accepts an optional argument to point to the Infra health check service (InfraHCD) Unix domain socket. When provided, the launcher exports the socket path to child processes and the rendezvous handlers will use it in their node health checks.
NVRX requests a complete Slurm epilog health-check run before deciding to restart workload.

  • --infrahc-socket (alias: --infrahc_socket) sets the InfraHCD Unix socket path.
  • The launcher propagates this value via the INFRAHCD_SOCKET environment variable.
  • The rendezvous implementations call InfraNodeHealthCheck which will connect to this socket.
  • Connectivity errors are treated as non-fatal (health passes); explicit RPC failures reported by the service mark the node unhealthy.

@namitdhameja namitdhameja self-assigned this Dec 6, 2025
@namitdhameja namitdhameja changed the title infra HC service over UDS Infra HC service over UDS Dec 6, 2025

Details:

* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the implementation, the health check is doing gRPC over UDS. We should just focus on conveying it is a health check service that InJob is using. So we can change this config parameter to
"--ft-node-health-check-endpoint" and automatically parse fro UDS, IP, or DNS. The naming would also be consistent to the existing health-check flag "--ft-node-health-check-interval" in the current code.

Details:

* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.
* The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid using the ENV. It is not necessary.


* ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path.
* The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable.
* The rendezvous implementations call ``InfraNodeHealthCheck`` which will connect to this socket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to emphasize "Infra"? We should just make it generic to some NodeHealthCheck service.

without restarting remaining workers, e.g., with the :doc:`../inprocess/index`.
For details on how ``min-healthy`` policy interacts with :doc:`../inprocess/index` see :doc:`integration/inprocess`.

Infra health check service (InfraHCD)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer just "Node health check service". In the future, the service can be somewhere in the cloud. It takes an argument of the "node_id" or "node_name" and "time_range" for you to query about the node health status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Connecting to an external service introduces significant security implications, particularly around authentication and secure credential injection within containers. Given these complexities, I don't think we should try to future-proof for this scenario now. It's different enough that we should address it when that work becomes concrete.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_range itself is a good one to have and I would have liked to support it. Currently, the underlying HC service does have time_range for parsing say syslog but those are hard coded; and if we really want to change it then would have to go through a cycle with them for justification etc. Note that I don't think we have a strong use case yet.

msg = f"Checking health status of {self._this_node}."
self._record(message=msg)
# Perform GPU health check
# Perform GPU and Infra node health checks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up I am also touching the "ensure_node_is_healthy" code in my NIC link state health check. We might need to resolve the conflict during merge.

self._grpc = _grpc_mod
self._pb2 = _pb2_mod
self._pb2_grpc = _pb2_grpc_mod
except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to carry the whole grpc module here. We should fully leverage the object-oriented design provided by the gRPC. In this particular scenario, we should subclass the HealthCheckServiceServicer and use the add_HealthCheckServiceServicer_to_server. Refer to nvhcd_pb2_grpc.py. At the end of the day, all we care is that there is a health check grpc server, and we can invoke its gRPC APIs.

You can refer to how VACE integrates with gRPC service framework as a working example.
https://gitlab-master.nvidia.com/ngcc/vace/-/blob/main/src/vace/server/rpc_service.py?ref_type=heads#L36

return True

# If socket does not exist, assume service not deployed and return True (non-fatal/optional check)
if not os.path.exists(self.socket_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is not necessary. If the path doesn't exist, the gRPC handle would be none, which we have already checked.

- On gRPC connectivity errors, return False.
"""
# If gRPC client cannot be constructed, return True (non-fatal/optional check)
if self._grpc is None or self._pb2 is None or self._pb2_grpc is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just check if the health check gRPC service module is instantiated correctly or not.

// HealthCheckRequest contains the arguments to pass to the health check script
message HealthCheckRequest {
// Arguments to pass to the health check script
repeated string args = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a category of health checks that we can request the health check service to run. Should we make it an enum instead of string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It buys us a slightly better abstraction over having to do another round on the server side, with regeneration of .deb et all. I don't think its worth the delay. The valid strings which indicate the check category, we should document and will update it.


return True

except Exception as e:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The gRPC service itself can also report exception for any of the errors in its own gRPC layer. We should catch and have more explicit gRPC error handling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would be the more explicit gRPC error handling? What can be done if the UDS is specified but n/a or not responsive or any other myriad errors? nvrx can potentially have notification support, is that what are alluding to? is this related to the slack channel notification work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants