Skip to content

Conversation

@machadovilaca
Copy link
Collaborator

Which issue this PR addresses:

Fixes https://issues.redhat.com/browse/ARO-22503

What this PR does / why we need it:

This PR adds a cluster monitor exporter for Prometheus metrics. As of this PR, it queries the cluster's Prometheus instance for kubevirt_vmi_info metrics and emits them as custom metrics with all the same labels.

Test plan for issue:

Added comprehensive unit tests and manually verified that it works correctly on a live cluster.

Is there any documentation that needs to be updated for this PR?

No documentation update is needed for this PR

How do you know this will function as expected in production?

error handling, logging, and failure isolation

return pm.dialPrometheus(ctx, i, port)
},
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,

Check failure

Code scanning / CodeQL

Disabled TLS certificate check High

InsecureSkipVerify should not be used in production code.
mon.emitIngressAndAPIServerCertificateExpiry,
mon.emitEtcdCertificateExpiry,
mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable
mon.emitPrometheusMetrics,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the comment for the one above, do we want to place this elsewhere? does the ordering matter? will this not run if something earlier in the order fails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter. IIRC there is a window of 60s so I would put this above that last one. (There are plans to improve this in the works)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter. Putting it at the end ensure the previous checks are guaranteed to finish. Especially on large clusters prometheus checks can time out. This is why I suggested a different approach via email.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this below authentication type. That check cannot time out.

prometheusQueryURL string
}

type prometheusQueryResponse struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer you import their types and use their SDK

Value []any `json:"value"`
}

const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like something that should be const, and certainly should not have format-strings or query params in it. please use the Go stdlib to create query params and format the url.URL as a string

const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s"

func (mon *Monitor) emitPrometheusMetrics(ctx context.Context) error {
mon.log.Debugf("running emitPrometheusMetrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a low-value log, please omit it

func (pm *prometheusMetrics) connectToPrometheus(ctx context.Context) error {
var err error

for i := range 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2? what does the retry behavior look like for a Monitor in general?

DialContext: func(ctx context.Context, network, address string) (net.Conn, error) {
_, port, err := net.SplitHostPort(address)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment: please add some context to errors before returning

return pm.dialPrometheus(ctx, i, port)
},
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? at minimum add some input config for this to turn off tls in testing, but certainly not for prod

},
}

_, err = pm.queryPrometheus(ctx, fmt.Sprintf("prometheus_build_info{pod='prometheus-k8s-%d'}", i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we throwing away the result?
why do we choose one specific prometheus pod?


token := pm.mon.restconfig.BearerToken
if token == "" && pm.mon.restconfig.BearerTokenFile != "" {
tokenBytes, err := os.ReadFile(pm.mon.restconfig.BearerTokenFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often does the token change? seems like you should be loading it once at startup (or rotating it) but not during every single query

token = string(tokenBytes)
}

if token != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if token is empty, isn't this an error?

if err != nil {
return nil, err
}
defer resp.Body.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle your errors

return queryResp.Data.Result, nil
}

func (pm *prometheusMetrics) emitCNVMetrics(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the meta-apis to list all kubevirt-related metrics, and, for each, figure out what kind they are (gauge, counter, etc), then generically query & emit all of them without having to list them here?

Copy link
Collaborator

@hlipsig hlipsig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall a few notes especially on the ordering. Please put this check at the end.

Agree with all of Steve's comments. I am curious though why this way was the final approach as opposed to the one recommended via email?

return pm.dialPrometheus(ctx, i, port)
},
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we skipping verification?

mon.emitIngressAndAPIServerCertificateExpiry,
mon.emitEtcdCertificateExpiry,
mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable
mon.emitPrometheusMetrics,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this below authentication type. That check cannot time out.

@hlipsig
Copy link
Collaborator

hlipsig commented Nov 12, 2025

One additional note. This PR comes from a fork. We will need this re-opened from a Branch, as there are rules against forks in Azure. If you cannot make a branch let me know you should see an invite soon to accept contributor access.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants