Background

This page describes OVN monitoring in Genestack.
Most OVN monitoring in Genestack comes from Kube-OVN's model
- As the Kube-OVN documentation indicates, this includes both:
  - control plane information
  - data plane network quality information
- Kube-OVN has instrumention for Promethus
  - so Genestack documentation directs installing the k8s ServiceMonitors so that Promethus can discover these metrics.

Links

Genestack documentation on installing Kube-OVN monitoring
As mentioned above, this simply installs the ServiceMonitors so that Prometheus in Genestack can discover the metrics exported by Kube-OVN.
you can see the ServiceMonitors installed here - in particular, it has ServiceMonitors for components: - kube-ovn-cni - kube-ovn-controller - kube-ovn-monitor - kube-ovn-pinger
```
    You can see a architectural descriptions of these components
    [here](https://kubeovn.github.io/docs/stable/en/reference/architecture/#core-controller-and-agent)
```
Kube-OVN User Guide's "Monitor and Dashboard" section
- the information runs a bit sparse in the User Guide; note the reference manual link (in the User Guide itself, and next link here below) for more detailed information on the provided metrics.
Kube-OVN Reference Manual "Metrics"
- This describes the monitoring metrics provided by Kube-OVN

Metrics

Viewing the metrics

In a full Genestack installation, you can view Prometheus metrics:

by using Prometheus' UI
by querying Prometheus' HTTPS API
by using Grafana
- In particular, Kube-OVN provides pre-defined Grafana dashboards installed in Genestack.

Going in-depth on these would go beyond the scope of this document, but sections below provide some brief coverage.

Prometheus' data model

Prometheus' data model and design-for-scale tends to make interactive PromQL queries cumbersome. In general usage, you will find that Prometheus data works better for feeding into other tools, like the Alertmanager for alerting, and Grafana for visualization.

Prometheus UI

A full Genestack installation includes the Prometheus UI. The Prometheus UI prominently displays a search bar that takes PromQL expressions.

You can easily see the available Kube-OVN metrics by opening the Metrics Explorer (click the globe icon) and typing kube_ovn_.

While this has some limited utility for getting a low-level view of individual metrics, you will generally find it more useful to look at the Grafana dashboards as described below.

As mentioned above, the Kube-OVN documentation details the collected metrics here

Prometheus API

You will probably need a strong understanding of the Prometheus data model and PromQL to use the Prometheus API directly, and will likely find little use for using the API interactively.

However, where you have a working kubectl, you can do something like the following to use curl on the Prometheus API with minor adaptations for your installation:

# Run kubectl proxy in the background
kubectl proxy &

# You will probably find the -g/--globoff option to curl useful to stop curl
# itself from interpreting the characters {} [] in URLs.
#
# Additionally, these characters technically require escaping in URLs, so you
# might want to use --data-urlencode

curl -sS -gG \
http://localhost:8001/api/v1/namespaces/prometheus/services/prometheus-operated:9090/proxy/api/v1/query \
--data-urlencode 'query=kube_ovn_ovn_status' | jq .

Grafana

As mentioned previously, Kube-OVN provides various dashboards with information on both the control plane and the data plane network quality.

These dashboards contain a lot of information, so in many cases, you will likely use them for troubleshooting by expanding to a large timeframe to see when irregularities may have started occurring.

You can see the documentation from Kube-OVN on these dashboards here

Some additional details on each of these dashboards follows.

Controller dashboard

In a typical Genestack installation, you should see 3 controllers up here. The dashboard displays the number of up controllers prominently.

The graphs show information about the kube-ovn-controller pods in the kube-system namespace, mostly identified by their ClusterIP on the k8s service network. You can typically see them along with the ClusterIPs that identify them individually on the dashboard like:

kubectl -n kube-system get pod -l app=kube-ovn-controller -o wide

Kube-OVN-CNI dashboard

In this case, CNI refers to k8s' container network interface which allows k8s to use various plugins (such as Kube-OVN) for cluster networking, as described here

Like the Controller dashboard, it displays the number of pods up prominently. It should have 1 pod for each k8s node, so it should match the count of your k8s nodes:

# tail to skip header lines
kubectl get node | tail -n +2 | wc -l

These metrics belong to the 'control plane' metrics, and this dashboard will probably work well by using a large timeframe to find anomalous behavior as previously described.

OVN dashboard

This dashboard displays some metrics from the OVN, like the number of logical switches and logical ports. It shows a chassis count that should match the number of nodes in your cluster, and a flag (or technically a count, but you will see "1") for "OVN DB Up".

This dashboard looks useful as previously described for looking for anomalous behavior that emerged across a timeframe.

OVS dashboard

This dashboard displays information from OVS from each k8s node.

It sometimes uses names, but sometimes pod IPs.

OVS activity across nodes might not necessarily have a strong correlation, so on this dashboard, you might take particular note that you can click the node of interest on the legend for each graph, or choose a particular instance for all of the graphs at the top.

You may need to collate the ClusterIPs with a node, which you can do with something like:

kubectl get node -n kube-system -o wide | grep 10.10.10.10

which will display the name of the node with the pinger pod. kube-ovn-pinger collects OVS status information, so while you see the pinger pod when checking the IP, the information from the dashboard may come from the pinger pod, but pertains to OVS, not the pinger pod.

Pinger dashboard

This dashboard prominently displays the OVS up count, OVN-Controller up count, and API server accessible count. These should all have the same number, equal to your number of nodes.

Inconsistent port binding nums

The Kube-OVN documentation for this metric says "The number of mismatch port bindings between ovs and ovn-sb" and provides no additional information. However, in Genestack, you find ovn metadata 'localport' types in the NB that don't need an SB binding, which increases this count. Kube-OVN's OVN itself often gets used for k8s alone and would in that case often have a 0 count here, but Genestack uses OpenStack Neutron as a second CMS for Kube-OVN's OVN installation, resulting in the existence of ports that increase this count but don't indicate a problem, so Genestack will generally have a significant count for each compute node for this metric.