-
Notifications
You must be signed in to change notification settings - Fork 37
Infra HC service over UDS #227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
|
||
| Details: | ||
|
|
||
| * ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the implementation, the health check is doing gRPC over UDS. We should just focus on conveying it is a health check service that InJob is using. So we can change this config parameter to
"--ft-node-health-check-endpoint" and automatically parse fro UDS, IP, or DNS. The naming would also be consistent to the existing health-check flag "--ft-node-health-check-interval" in the current code.
| Details: | ||
|
|
||
| * ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path. | ||
| * The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid using the ENV. It is not necessary.
|
|
||
| * ``--infrahc-socket`` (alias: ``--infrahc_socket``) sets the InfraHCD Unix socket path. | ||
| * The launcher propagates this value via the ``INFRAHCD_SOCKET`` environment variable. | ||
| * The rendezvous implementations call ``InfraNodeHealthCheck`` which will connect to this socket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to emphasize "Infra"? We should just make it generic to some NodeHealthCheck service.
| without restarting remaining workers, e.g., with the :doc:`../inprocess/index`. | ||
| For details on how ``min-healthy`` policy interacts with :doc:`../inprocess/index` see :doc:`integration/inprocess`. | ||
|
|
||
| Infra health check service (InfraHCD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer just "Node health check service". In the future, the service can be somewhere in the cloud. It takes an argument of the "node_id" or "node_name" and "time_range" for you to query about the node health status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Connecting to an external service introduces significant security implications, particularly around authentication and secure credential injection within containers. Given these complexities, I don't think we should try to future-proof for this scenario now. It's different enough that we should address it when that work becomes concrete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
time_range itself is a good one to have and I would have liked to support it. Currently, the underlying HC service does have time_range for parsing say syslog but those are hard coded; and if we really want to change it then would have to go through a cycle with them for justification etc. Note that I don't think we have a strong use case yet.
| msg = f"Checking health status of {self._this_node}." | ||
| self._record(message=msg) | ||
| # Perform GPU health check | ||
| # Perform GPU and Infra node health checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a heads up I am also touching the "ensure_node_is_healthy" code in my NIC link state health check. We might need to resolve the conflict during merge.
| self._grpc = _grpc_mod | ||
| self._pb2 = _pb2_mod | ||
| self._pb2_grpc = _pb2_grpc_mod | ||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to carry the whole grpc module here. We should fully leverage the object-oriented design provided by the gRPC. In this particular scenario, we should subclass the HealthCheckServiceServicer and use the add_HealthCheckServiceServicer_to_server. Refer to nvhcd_pb2_grpc.py. At the end of the day, all we care is that there is a health check grpc server, and we can invoke its gRPC APIs.
You can refer to how VACE integrates with gRPC service framework as a working example.
https://gitlab-master.nvidia.com/ngcc/vace/-/blob/main/src/vace/server/rpc_service.py?ref_type=heads#L36
| return True | ||
|
|
||
| # If socket does not exist, assume service not deployed and return True (non-fatal/optional check) | ||
| if not os.path.exists(self.socket_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is not necessary. If the path doesn't exist, the gRPC handle would be none, which we have already checked.
| - On gRPC connectivity errors, return False. | ||
| """ | ||
| # If gRPC client cannot be constructed, return True (non-fatal/optional check) | ||
| if self._grpc is None or self._pb2 is None or self._pb2_grpc is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just check if the health check gRPC service module is instantiated correctly or not.
| // HealthCheckRequest contains the arguments to pass to the health check script | ||
| message HealthCheckRequest { | ||
| // Arguments to pass to the health check script | ||
| repeated string args = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a category of health checks that we can request the health check service to run. Should we make it an enum instead of string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It buys us a slightly better abstraction over having to do another round on the server side, with regeneration of .deb et all. I don't think its worth the delay. The valid strings which indicate the check category, we should document and will update it.
|
|
||
| return True | ||
|
|
||
| except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The gRPC service itself can also report exception for any of the errors in its own gRPC layer. We should catch and have more explicit gRPC error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would be the more explicit gRPC error handling? What can be done if the UDS is specified but n/a or not responsive or any other myriad errors? nvrx can potentially have notification support, is that what are alluding to? is this related to the slack channel notification work?
The FT launcher accepts an optional argument to point to the Infra health check service (InfraHCD) Unix domain socket. When provided, the launcher exports the socket path to child processes and the rendezvous handlers will use it in their node health checks.
NVRX requests a complete Slurm epilog health-check run before deciding to restart workload.
--infrahc-socket(alias:--infrahc_socket) sets the InfraHCD Unix socket path.INFRAHCD_SOCKETenvironment variable.InfraNodeHealthCheckwhich will connect to this socket.