Add CNV Prometheus metrics monitor #4458

machadovilaca · 2025-11-07T11:19:58Z

Which issue this PR addresses:

Fixes https://issues.redhat.com/browse/ARO-22503

What this PR does / why we need it:

This PR adds a cluster monitor exporter for Prometheus metrics. As of this PR, it queries the cluster's Prometheus instance for kubevirt_vmi_info metrics and emits them as custom metrics with all the same labels.

Test plan for issue:

Added comprehensive unit tests and manually verified that it works correctly on a live cluster.

Is there any documentation that needs to be updated for this PR?

No documentation update is needed for this PR

How do you know this will function as expected in production?

error handling, logging, and failure isolation

Signed-off-by: machadovilaca <[email protected]>

pkg/monitor/cluster/prometheusmetrics.go

+					return pm.dialPrometheus(ctx, i, port)
+				},
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true,


stevekuznetsov · 2025-11-11T14:36:42Z

pkg/monitor/cluster/cluster.go

 		mon.emitIngressAndAPIServerCertificateExpiry,
 		mon.emitEtcdCertificateExpiry,
 		mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable
+		mon.emitPrometheusMetrics,


based on the comment for the one above, do we want to place this elsewhere? does the ordering matter? will this not run if something earlier in the order fails?

It does matter. IIRC there is a window of 60s so I would put this above that last one. (There are plans to improve this in the works)

It does matter. Putting it at the end ensure the previous checks are guaranteed to finish. Especially on large clusters prometheus checks can time out. This is why I suggested a different approach via email.

We should put this below authentication type. That check cannot time out.

stevekuznetsov · 2025-11-11T14:37:02Z

pkg/monitor/cluster/prometheusmetrics.go

+	prometheusQueryURL string
+}
+
+type prometheusQueryResponse struct {


I would prefer you import their types and use their SDK

stevekuznetsov · 2025-11-11T14:37:40Z

pkg/monitor/cluster/prometheusmetrics.go

+	Value  []any             `json:"value"`
+}
+
+const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s"


this doesn't look like something that should be const, and certainly should not have format-strings or query params in it. please use the Go stdlib to create query params and format the url.URL as a string

stevekuznetsov · 2025-11-11T14:37:53Z

pkg/monitor/cluster/prometheusmetrics.go

+const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s"
+
+func (mon *Monitor) emitPrometheusMetrics(ctx context.Context) error {
+	mon.log.Debugf("running emitPrometheusMetrics")


this is a low-value log, please omit it

stevekuznetsov · 2025-11-11T14:38:35Z

pkg/monitor/cluster/prometheusmetrics.go

+func (pm *prometheusMetrics) connectToPrometheus(ctx context.Context) error {
+	var err error
+
+	for i := range 2 {


why 2? what does the retry behavior look like for a Monitor in general?

stevekuznetsov · 2025-11-11T14:38:59Z

pkg/monitor/cluster/prometheusmetrics.go

+				DialContext: func(ctx context.Context, network, address string) (net.Conn, error) {
+					_, port, err := net.SplitHostPort(address)
+					if err != nil {
+						return nil, err


general comment: please add some context to errors before returning

stevekuznetsov · 2025-11-11T14:39:30Z

pkg/monitor/cluster/prometheusmetrics.go

+					return pm.dialPrometheus(ctx, i, port)
+				},
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true,


why? at minimum add some input config for this to turn off tls in testing, but certainly not for prod

stevekuznetsov · 2025-11-11T14:40:02Z

pkg/monitor/cluster/prometheusmetrics.go

+			},
+		}
+
+		_, err = pm.queryPrometheus(ctx, fmt.Sprintf("prometheus_build_info{pod='prometheus-k8s-%d'}", i))


why are we throwing away the result?
why do we choose one specific prometheus pod?

stevekuznetsov · 2025-11-11T14:40:57Z

pkg/monitor/cluster/prometheusmetrics.go

+
+	token := pm.mon.restconfig.BearerToken
+	if token == "" && pm.mon.restconfig.BearerTokenFile != "" {
+		tokenBytes, err := os.ReadFile(pm.mon.restconfig.BearerTokenFile)


how often does the token change? seems like you should be loading it once at startup (or rotating it) but not during every single query

stevekuznetsov · 2025-11-11T14:41:14Z

pkg/monitor/cluster/prometheusmetrics.go

+		token = string(tokenBytes)
+	}
+
+	if token != "" {


if token is empty, isn't this an error?

stevekuznetsov · 2025-11-11T14:41:20Z

pkg/monitor/cluster/prometheusmetrics.go

+	if err != nil {
+		return nil, err
+	}
+	defer resp.Body.Close()


handle your errors

stevekuznetsov · 2025-11-11T14:43:05Z

pkg/monitor/cluster/prometheusmetrics.go

+	return queryResp.Data.Result, nil
+}
+
+func (pm *prometheusMetrics) emitCNVMetrics(ctx context.Context) error {


can we use the meta-apis to list all kubevirt-related metrics, and, for each, figure out what kind they are (gauge, counter, etc), then generically query & emit all of them without having to list them here?

hlipsig

Overall a few notes especially on the ordering. Please put this check at the end.

Agree with all of Steve's comments. I am curious though why this way was the final approach as opposed to the one recommended via email?

hlipsig · 2025-11-12T19:14:47Z

pkg/monitor/cluster/prometheusmetrics.go

+					return pm.dialPrometheus(ctx, i, port)
+				},
+				TLSClientConfig: &tls.Config{
+					InsecureSkipVerify: true,


Why are we skipping verification?

hlipsig · 2025-11-12T19:19:53Z

pkg/monitor/cluster/cluster.go

 		mon.emitIngressAndAPIServerCertificateExpiry,
 		mon.emitEtcdCertificateExpiry,
 		mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable
+		mon.emitPrometheusMetrics,


We should put this below authentication type. That check cannot time out.

hlipsig · 2025-11-12T20:01:09Z

One additional note. This PR comes from a fork. We will need this re-opened from a Branch, as there are rules against forks in Azure. If you cannot make a branch let me know you should see an invite soon to accept contributor access.

Add CNV Prometheus metrics monitor

f3ef09c

Signed-off-by: machadovilaca <[email protected]>

machadovilaca requested review from bennerv, cadenmarchese, fahlmant, hawkowl, hlipsig, jharrington22, kimorris27, mociarain, mrWinston, pepedocs, rogbas, sankur-codes, tiguelu, tsatam, wanghaoran1988 and yjst2012 as code owners November 7, 2025 11:20

github-advanced-security bot found potential problems Nov 11, 2025

View reviewed changes

pkg/monitor/cluster/prometheusmetrics.go

return pm.dialPrometheus(ctx, i, port)

},

TLSClientConfig: &tls.Config{

InsecureSkipVerify: true,

Check failure

Code scanning / CodeQL

Disabled TLS certificate check High

InsecureSkipVerify should not be used in production code.

stevekuznetsov reviewed Nov 11, 2025

View reviewed changes

pkg/monitor/cluster/prometheusmetrics.go

if err != nil {

return nil, err

}

defer resp.Body.Close()

Copy link

Contributor

stevekuznetsov Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle your errors

stevekuznetsov reviewed Nov 11, 2025

View reviewed changes

hlipsig requested changes Nov 12, 2025

View reviewed changes

Add CNV Prometheus metrics monitor #4458

Are you sure you want to change the base?

Add CNV Prometheus metrics monitor #4458

Uh oh!

Conversation

machadovilaca commented Nov 7, 2025

Which issue this PR addresses:

What this PR does / why we need it:

Test plan for issue:

Is there any documentation that needs to be updated for this PR?

How do you know this will function as expected in production?

Uh oh!

Check failure

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hlipsig left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hlipsig commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants