Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions pkg/monitor/cluster/cluster.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,7 @@ func NewMonitor(log *logrus.Entry, restConfig *rest.Config, oc *api.OpenShiftClu
mon.emitIngressAndAPIServerCertificateExpiry,
mon.emitEtcdCertificateExpiry,
mon.emitPrometheusAlerts, // at the end for now because it's the slowest/least reliable
mon.emitPrometheusMetrics,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on the comment for the one above, do we want to place this elsewhere? does the ordering matter? will this not run if something earlier in the order fails?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter. IIRC there is a window of 60s so I would put this above that last one. (There are plans to improve this in the works)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does matter. Putting it at the end ensure the previous checks are guaranteed to finish. Especially on large clusters prometheus checks can time out. This is why I suggested a different approach via email.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put this below authentication type. That check cannot time out.

mon.emitCWPStatus,
mon.emitClusterAuthenticationType,
}
Expand Down
153 changes: 153 additions & 0 deletions pkg/monitor/cluster/prometheusmetrics.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
package cluster

// Copyright (c) Microsoft Corporation.
// Licensed under the Apache License 2.0.

import (
"context"
"crypto/tls"
"encoding/json"
"fmt"
"io"
"net"
"net/http"
"net/url"
"os"

"github.com/Azure/ARO-RP/pkg/util/portforward"
)

type prometheusMetrics struct {
mon *Monitor
client *http.Client
prometheusQueryURL string
}

type prometheusQueryResponse struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer you import their types and use their SDK

Status string `json:"status"`
Data struct {
ResultType string `json:"resultType"`
Result []prometheusResult `json:"result"`
} `json:"data"`
}

type prometheusResult struct {
Metric map[string]string `json:"metric"`
Value []any `json:"value"`
}

const prometheusQueryURL = "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=%s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look like something that should be const, and certainly should not have format-strings or query params in it. please use the Go stdlib to create query params and format the url.URL as a string


func (mon *Monitor) emitPrometheusMetrics(ctx context.Context) error {
mon.log.Debugf("running emitPrometheusMetrics")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a low-value log, please omit it

pm := &prometheusMetrics{
mon: mon,
prometheusQueryURL: prometheusQueryURL,
}

err := pm.connectToPrometheus(ctx)
if err != nil {
return err
}

return pm.emitCNVMetrics(ctx)
}

func (pm *prometheusMetrics) connectToPrometheus(ctx context.Context) error {
var err error

for i := range 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 2? what does the retry behavior look like for a Monitor in general?

pm.client = &http.Client{
Transport: &http.Transport{
DialContext: func(ctx context.Context, network, address string) (net.Conn, error) {
_, port, err := net.SplitHostPort(address)
if err != nil {
return nil, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general comment: please add some context to errors before returning

}

return pm.dialPrometheus(ctx, i, port)
},
TLSClientConfig: &tls.Config{
InsecureSkipVerify: true,

Check failure

Code scanning / CodeQL

Disabled TLS certificate check High

InsecureSkipVerify should not be used in production code.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? at minimum add some input config for this to turn off tls in testing, but certainly not for prod

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we skipping verification?

},
DisableKeepAlives: true,
},
}

_, err = pm.queryPrometheus(ctx, fmt.Sprintf("prometheus_build_info{pod='prometheus-k8s-%d'}", i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we throwing away the result?
why do we choose one specific prometheus pod?

if err == nil {
pm.mon.log.Debugf("successfully queried prometheus-k8s-%d", i)
return nil
}
pm.mon.log.Debugf("failed to query prometheus-k8s-%d: %+v", i, err)
}

return fmt.Errorf("failed to connect to any prometheus pod: %w", err)
}

func (pm *prometheusMetrics) dialPrometheus(ctx context.Context, podIndex int, port string) (net.Conn, error) {
podName := fmt.Sprintf("prometheus-k8s-%d", podIndex)
return portforward.DialContext(ctx, pm.mon.log, pm.mon.restconfig, "openshift-monitoring", podName, port)
}

func (pm *prometheusMetrics) queryPrometheus(ctx context.Context, query string) ([]prometheusResult, error) {
queryURL := fmt.Sprintf(pm.prometheusQueryURL, url.QueryEscape(query))
req, err := http.NewRequestWithContext(ctx, http.MethodGet, queryURL, nil)
if err != nil {
return nil, err
}

token := pm.mon.restconfig.BearerToken
if token == "" && pm.mon.restconfig.BearerTokenFile != "" {
tokenBytes, err := os.ReadFile(pm.mon.restconfig.BearerTokenFile)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how often does the token change? seems like you should be loading it once at startup (or rotating it) but not during every single query

if err != nil {
return nil, fmt.Errorf("load bearer token file: %w", err)
}
token = string(tokenBytes)
}

if token != "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if token is empty, isn't this an error?

req.Header.Set("Authorization", "Bearer "+token)
}

resp, err := pm.client.Do(req)
if err != nil {
return nil, err
}
defer resp.Body.Close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle your errors


if resp.StatusCode != http.StatusOK {
respBody, err := io.ReadAll(resp.Body)
if err != nil {
return nil, fmt.Errorf("failed to read response body on unexpected status code %d: %w", resp.StatusCode, err)
}
return nil, fmt.Errorf("unexpected status code %d %s: %s", resp.StatusCode, resp.Status, string(respBody))
}

var queryResp prometheusQueryResponse
err = json.NewDecoder(resp.Body).Decode(&queryResp)
if err != nil {
return nil, err
}

if queryResp.Status != "success" {
return nil, fmt.Errorf("prometheus query failed with status %s: %+v", queryResp.Status, queryResp.Data)
}

return queryResp.Data.Result, nil
}

func (pm *prometheusMetrics) emitCNVMetrics(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use the meta-apis to list all kubevirt-related metrics, and, for each, figure out what kind they are (gauge, counter, etc), then generically query & emit all of them without having to list them here?

pm.mon.log.Debugf("emitting CNV metrics")
results, err := pm.queryPrometheus(ctx, `{__name__="kubevirt_vmi_info"}`)
if err != nil {
return err
}

for _, result := range results {
pm.mon.log.Debugf("emitting metric cnv.kubevirt.vmi.info with labels %+v", result.Metric)
pm.mon.emitGauge("cnv.kubevirt.vmi.info", 1, result.Metric)
}

return nil
}
Loading
Loading