-
Notifications
You must be signed in to change notification settings - Fork 191
Add CNV Prometheus metrics monitor #4458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,153 @@ | ||
| package cluster | ||
|
|
||
| // Copyright (c) Microsoft Corporation. | ||
| // Licensed under the Apache License 2.0. | ||
|
|
||
| import ( | ||
| "context" | ||
| "crypto/tls" | ||
| "encoding/json" | ||
| "fmt" | ||
| "io" | ||
| "net" | ||
| "net/http" | ||
| "net/url" | ||
| "os" | ||
|
|
||
| "github.com/Azure/ARO-RP/pkg/util/portforward" | ||
| ) | ||
|
|
||
| type prometheusMetrics struct { | ||
| mon *Monitor | ||
| client *http.Client | ||
| prometheusQueryURL string | ||
| } | ||
|
|
||
| type prometheusQueryResponse struct { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would prefer you import their types and use their SDK |
||
| Status string `json:"status"` | ||
| Data struct { | ||
| ResultType string `json:"resultType"` | ||
| Result []prometheusResult `json:"result"` | ||
| } `json:"data"` | ||
| } | ||
|
|
||
| type prometheusResult struct { | ||
| Metric map[string]string `json:"metric"` | ||
| Value []any `json:"value"` | ||
| } | ||
|
|
||
| const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s" | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this doesn't look like something that should be |
||
|
|
||
| func (mon *Monitor) emitPrometheusMetrics(ctx context.Context) error { | ||
| mon.log.Debugf("running emitPrometheusMetrics") | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is a low-value log, please omit it |
||
| pm := &prometheusMetrics{ | ||
| mon: mon, | ||
| prometheusQueryURL: prometheusQueryURL, | ||
| } | ||
|
|
||
| err := pm.connectToPrometheus(ctx) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| return pm.emitCNVMetrics(ctx) | ||
| } | ||
|
|
||
| func (pm *prometheusMetrics) connectToPrometheus(ctx context.Context) error { | ||
| var err error | ||
|
|
||
| for i := range 2 { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why 2? what does the retry behavior look like for a |
||
| pm.client = &http.Client{ | ||
| Transport: &http.Transport{ | ||
| DialContext: func(ctx context.Context, network, address string) (net.Conn, error) { | ||
| _, port, err := net.SplitHostPort(address) | ||
| if err != nil { | ||
| return nil, err | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. general comment: please add some context to errors before returning |
||
| } | ||
|
|
||
| return pm.dialPrometheus(ctx, i, port) | ||
| }, | ||
| TLSClientConfig: &tls.Config{ | ||
| InsecureSkipVerify: true, | ||
Check failureCode scanning / CodeQL Disabled TLS certificate check High
InsecureSkipVerify should not be used in production code.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why? at minimum add some input config for this to turn off tls in testing, but certainly not for prod
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why are we skipping verification? |
||
| }, | ||
| DisableKeepAlives: true, | ||
| }, | ||
| } | ||
|
|
||
| _, err = pm.queryPrometheus(ctx, fmt.Sprintf("prometheus_build_info{pod='prometheus-k8s-%d'}", i)) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why are we throwing away the result? |
||
| if err == nil { | ||
| pm.mon.log.Debugf("successfully queried prometheus-k8s-%d", i) | ||
| return nil | ||
| } | ||
| pm.mon.log.Debugf("failed to query prometheus-k8s-%d: %+v", i, err) | ||
| } | ||
|
|
||
| return fmt.Errorf("failed to connect to any prometheus pod: %w", err) | ||
| } | ||
|
|
||
| func (pm *prometheusMetrics) dialPrometheus(ctx context.Context, podIndex int, port string) (net.Conn, error) { | ||
| podName := fmt.Sprintf("prometheus-k8s-%d", podIndex) | ||
| return portforward.DialContext(ctx, pm.mon.log, pm.mon.restconfig, "openshift-monitoring", podName, port) | ||
| } | ||
|
|
||
| func (pm *prometheusMetrics) queryPrometheus(ctx context.Context, query string) ([]prometheusResult, error) { | ||
| queryURL := fmt.Sprintf(pm.prometheusQueryURL, url.QueryEscape(query)) | ||
| req, err := http.NewRequestWithContext(ctx, http.MethodGet, queryURL, nil) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| token := pm.mon.restconfig.BearerToken | ||
| if token == "" && pm.mon.restconfig.BearerTokenFile != "" { | ||
| tokenBytes, err := os.ReadFile(pm.mon.restconfig.BearerTokenFile) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how often does the token change? seems like you should be loading it once at startup (or rotating it) but not during every single query |
||
| if err != nil { | ||
| return nil, fmt.Errorf("load bearer token file: %w", err) | ||
| } | ||
| token = string(tokenBytes) | ||
| } | ||
|
|
||
| if token != "" { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if token is empty, isn't this an error? |
||
| req.Header.Set("Authorization", "Bearer "+token) | ||
| } | ||
|
|
||
| resp, err := pm.client.Do(req) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
| defer resp.Body.Close() | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. handle your errors |
||
|
|
||
| if resp.StatusCode != http.StatusOK { | ||
| respBody, err := io.ReadAll(resp.Body) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to read response body on unexpected status code %d: %w", resp.StatusCode, err) | ||
| } | ||
| return nil, fmt.Errorf("unexpected status code %d %s: %s", resp.StatusCode, resp.Status, string(respBody)) | ||
| } | ||
|
|
||
| var queryResp prometheusQueryResponse | ||
| err = json.NewDecoder(resp.Body).Decode(&queryResp) | ||
| if err != nil { | ||
| return nil, err | ||
| } | ||
|
|
||
| if queryResp.Status != "success" { | ||
| return nil, fmt.Errorf("prometheus query failed with status %s: %+v", queryResp.Status, queryResp.Data) | ||
| } | ||
|
|
||
| return queryResp.Data.Result, nil | ||
| } | ||
|
|
||
| func (pm *prometheusMetrics) emitCNVMetrics(ctx context.Context) error { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we use the meta-apis to list all kubevirt-related metrics, and, for each, figure out what kind they are (gauge, counter, etc), then generically query & emit all of them without having to list them here? |
||
| pm.mon.log.Debugf("emitting CNV metrics") | ||
| results, err := pm.queryPrometheus(ctx, `{__name__="kubevirt_vmi_info"}`) | ||
| if err != nil { | ||
| return err | ||
| } | ||
|
|
||
| for _, result := range results { | ||
| pm.mon.log.Debugf("emitting metric cnv.kubevirt.vmi.info with labels %+v", result.Metric) | ||
| pm.mon.emitGauge("cnv.kubevirt.vmi.info", 1, result.Metric) | ||
| } | ||
|
|
||
| return nil | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the comment for the one above, do we want to place this elsewhere? does the ordering matter? will this not run if something earlier in the order fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter. IIRC there is a window of 60s so I would put this above that last one. (There are plans to improve this in the works)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does matter. Putting it at the end ensure the previous checks are guaranteed to finish. Especially on large clusters prometheus checks can time out. This is why I suggested a different approach via email.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put this below authentication type. That check cannot time out.