-
Notifications
You must be signed in to change notification settings - Fork 191
Add CNV Prometheus metrics monitor #4458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: machadovilaca <[email protected]>
| mon.emitIngressAndAPIServerCertificateExpiry, | ||
| mon.emitEtcdCertificateExpiry, | ||
| mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable | ||
| mon.emitPrometheusMetrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the comment for the one above, do we want to place this elsewhere? does the ordering matter? will this not run if something earlier in the order fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter. IIRC there is a window of 60s so I would put this above that last one. (There are plans to improve this in the works)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter. Putting it at the end ensure the previous checks are guaranteed to finish. Especially on large clusters prometheus checks can time out. This is why I suggested a different approach via email.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put this below authentication type. That check cannot time out.
| prometheusQueryURL string | ||
| } | ||
|
|
||
| type prometheusQueryResponse struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer you import their types and use their SDK
| Value []any `json:"value"` | ||
| } | ||
|
|
||
| const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't look like something that should be const, and certainly should not have format-strings or query params in it. please use the Go stdlib to create query params and format the url.URL as a string
| const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s" | ||
|
|
||
| func (mon *Monitor) emitPrometheusMetrics(ctx context.Context) error { | ||
| mon.log.Debugf("running emitPrometheusMetrics") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a low-value log, please omit it
| func (pm *prometheusMetrics) connectToPrometheus(ctx context.Context) error { | ||
| var err error | ||
|
|
||
| for i := range 2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why 2? what does the retry behavior look like for a Monitor in general?
| DialContext: func(ctx context.Context, network, address string) (net.Conn, error) { | ||
| _, port, err := net.SplitHostPort(address) | ||
| if err != nil { | ||
| return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
general comment: please add some context to errors before returning
| return pm.dialPrometheus(ctx, i, port) | ||
| }, | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why? at minimum add some input config for this to turn off tls in testing, but certainly not for prod
| }, | ||
| } | ||
|
|
||
| _, err = pm.queryPrometheus(ctx, fmt.Sprintf("prometheus_build_info{pod='prometheus-k8s-%d'}", i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we throwing away the result?
why do we choose one specific prometheus pod?
|
|
||
| token := pm.mon.restconfig.BearerToken | ||
| if token == "" && pm.mon.restconfig.BearerTokenFile != "" { | ||
| tokenBytes, err := os.ReadFile(pm.mon.restconfig.BearerTokenFile) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how often does the token change? seems like you should be loading it once at startup (or rotating it) but not during every single query
| token = string(tokenBytes) | ||
| } | ||
|
|
||
| if token != "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if token is empty, isn't this an error?
| if err != nil { | ||
| return nil, err | ||
| } | ||
| defer resp.Body.Close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle your errors
| return queryResp.Data.Result, nil | ||
| } | ||
|
|
||
| func (pm *prometheusMetrics) emitCNVMetrics(ctx context.Context) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use the meta-apis to list all kubevirt-related metrics, and, for each, figure out what kind they are (gauge, counter, etc), then generically query & emit all of them without having to list them here?
hlipsig
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall a few notes especially on the ordering. Please put this check at the end.
Agree with all of Steve's comments. I am curious though why this way was the final approach as opposed to the one recommended via email?
| return pm.dialPrometheus(ctx, i, port) | ||
| }, | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we skipping verification?
| mon.emitIngressAndAPIServerCertificateExpiry, | ||
| mon.emitEtcdCertificateExpiry, | ||
| mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable | ||
| mon.emitPrometheusMetrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put this below authentication type. That check cannot time out.
|
One additional note. This PR comes from a fork. We will need this re-opened from a Branch, as there are rules against forks in Azure. If you cannot make a branch let me know you should see an invite soon to accept contributor access. |
Which issue this PR addresses:
Fixes https://issues.redhat.com/browse/ARO-22503
What this PR does / why we need it:
This PR adds a cluster monitor exporter for Prometheus metrics. As of this PR, it queries the cluster's Prometheus instance for
kubevirt_vmi_infometrics and emits them as custom metrics with all the same labels.Test plan for issue:
Added comprehensive unit tests and manually verified that it works correctly on a live cluster.
Is there any documentation that needs to be updated for this PR?
No documentation update is needed for this PR
How do you know this will function as expected in production?
error handling, logging, and failure isolation