This page explains the OVN claim storm
alert in Genestack.
Background information
OVN has a distributed architecture without a central controller. The ovn-controller process on each chassis, in OVN terminology, meaning every k8s node since Genestack uses Kube-OVN, runs the same control logic, and coordination happens through the OVN south database, but this introduces some of the complexity of a distributed system to what often acts and appears like centralized control.
OVN makes use of some ports not bound to any particular chassis, especially on the gateway nodes. OVN may move those ports to a different chassis if a gateway node goes down, or BFD (bi-directional forwarding detection) shows poor link quality for a chassis. In some edge cases, ovn-controller on different nodes might each determine that they should have a port, and each chassis will claim the port as quickly as it can. That doesn't normally or usually happen, and could have some range of edge case root causes, perhaps a NIC malfunctioning in a way that escapes detection by BFD (bi-directional forwarding detection). The architecture of OVN seems to make it hard to ensure that this condition could never occur, even if it rarely occurs, and OVN itself implements a rate-limit of 0.5s, so that no chassis will try to claim the same port more than once every 0.5s, as seen in this commit.
However, a typical production-grade Genestack installation will probably have
at least 3 gateway nodes, and each public IP CIDR block added individually to
the installation will have an associated cr-lrp
on the gateway nodes, not
bound to any particular chassis and so free to move between the gateway nodes.
These ports allow OpenStack nova instances without a floating IP to access the
Internet via NAT, which allows them to, say, pull operating system patches, etc.
without needing to assign a floating IP so that an instance only needs a
floating IP to make services available on a public Internet address. So,
consider 3 gateway nodes and 5 CIDR blocks, and a 0.5 s rate limit per port per
node. In the worst case, with each node trying to claim every port not bound to
a chassis as quickly as possible, this example could have each port getting
claimed about six times per second, and a load of 30 claims per second to commit
to the to the OVN south DB. (In fact, it only seems to take one bad node to
shoot this up to around the theoretical maximum, since the bad node might claim
it as often as possible, and every other node has equal claim.) In this
scenario, the affected ports themselves move between chassis too quickly to
actually work, and the OVN south DB itself gets overloaded. In that case,
instances without floating IPs would not have Internet access, and the high load
on the south DB would likely result in provisioning failures for new OpenStack
nova instances and new k8s pods.
Symptoms and identification
The alert will normally catch this condition, however, for reference and to identify the individual nodes with the problem:
network agents Alive
status
openstack network agent list
usually has output like below (aside from the
made up UUIDs and node names):
+--------------------------------------+------------------------------+------------------------+-------------------+-------+-------+----------------------------+
| ID | Agent Type | Host | Availability Zone | Alive | State | Binary |
+--------------------------------------+------------------------------+------------------------+-------------------+-------+-------+----------------------------+
| deadbeef-dead-beef-dead-deadbeef0001 | OVN Controller agent | node01.domain.name | | :-) | UP | ovn-controller |
| deadbeef-dead-beef-dead-deadbeef0002 | OVN Controller agent | node02.domain.name | nova | :-) | UP | ovn-controller |
| deadbeef-dead-beef-dead-deadbeef0003 | OVN Metadata agent | node02.domain.name | nova | :-) | UP | neutron-ovn-metadata-agent |
| deadbeef-dead-beef-dead-deadbeef0004 | OVN Controller agent | node03.domain.name | nova | :-) | UP | ovn-controller |
| deadbeef-dead-beef-dead-deadbeef0005 | OVN Metadata agent | node03.domain.name | nova | :-) | UP | neutron-ovn-metadata-agent |
| deadbeef-dead-beef-dead-deadbeef0006 | OVN Controller agent | node04.domain.name | | :-) | UP | ovn-controller |
| deadbeef-dead-beef-dead-deadbeef0007 | OVN Controller agent | node05.domain.name | nova | :-) | UP | ovn-controller |
| deadbeef-dead-beef-dead-deadbeef0008 | OVN Metadata agent | node05.domain.name | nova | :-) | UP | neutron-ovn-metadata-agent |
| deadbeef-dead-beef-dead-deadbeef0009 | OVN Controller agent | node06.domain.name | nova | :-) | UP | ovn-controller |
For minor technical reasons, these probably don't technically qualify as real
Neutron agents, but in either case, this information gets queried from the
OVN south DB, which gets overloaded, so the output of this command will likely
show XXX
[sic] under the alive column, although they do continue to show
state UP
.
Since this happens because of the south DB, this command doesn't help identify
affected nodes. All agents will likely show XXX
for Alive
.
log lines
The alert checks log lines, but you will have to identify which gate nodes have the issue.
The log lines happen on the ovs-ovn
pods of the gateway nodes, and in full,
look like this:
2024-09-05T16:38:54.711Z|19953|binding|INFO|Claiming lport cr-lrp-deadbeef-dead-beef-dead-deadbeef0001 for this chassis.
2024-09-05T16:38:54.711Z|19954|binding|INFO|cr-lrp-deadbeef-dead-beef-dead-deadbeef0001: Claiming de:ad:be:ef:de:01 1.0.1.0/24
2024-09-05T16:39:38.870Z|19955|binding|INFO|Claiming lport cr-lrp-ddeadbeef-dead-beef-dead-deadbeef0002 for this chassis.
2024-09-05T16:39:38.870Z|19956|binding|INFO|cr-lrp-deadbeef-dead-beef-dead-deadbeef0002: Claiming de:ad:be:ef:de:02 1.0.2.0/24
2024-09-05T16:40:32.813Z|19957|binding|INFO|Claiming lport cr-lrp-deadbeef-dead-beef-dead-deadbeef0003 for this chassis.
2024-09-05T16:40:32.813Z|19958|binding|INFO|cr-lrp-deadbeef-dead-beef-dead-deadbeef0003: Claiming de:ad:be:ef:de:03 1.0.3.0/24
2024-09-05T16:41:52.669Z|19959|binding|INFO|Claiming lport cr-lrp-deadbeef-dead-beef-dead-deadbeef0004 for this chassis.
2024-09-05T16:41:52.669Z|19960|binding|INFO|cr-lrp-deadbeef-dead-beef-dead-deadbeef0004: Claiming de:ad:be:ef:de:04 1.0.4.0/24
2024-09-05T16:42:33.762Z|19961|binding|INFO|Claiming lport cr-lrp-deadbeef-dead-beef-dead-deadbeef0004 for this chassis.
you will probably see these densely packed and continuously generating with no other log lines between them with an interval of less than 1 second between consecutive port bindings.
Log lines like this happen during normal operation, but the ports don't tend
to move around more than once every 5 minutes, so you may see a block like this
for every cr-lrp
port, and so one for every CIDR block of public IPs you use,
but during a claim storm, you will see the same ports and CIDRs getting bound
continuously and consecutively.
Remediation
This likely happens as an aggravation of a pre-existing problem, so it may take some investigation to identify any particular root cause, and draining and rebooting an affected node may resolve the issue temporarily, or seemingly permanently enough if the issue occurred due to something transient and unidentifiable.
Some tips and recommendations:
-
ensure you ran
host-setup.yaml
as indicated in the Genestack installation documentation which adjusts some kernel networking variables.- This playbook works idempotently, and you can run it again on the nodes
to make sure, and mostly see
OK
tasks instead ofchanged
.
As an example, if you used
infra-deploy.yaml
, on your launcher node, you might run something like:sudo su -l cd /opt/genestack/ansible && \ source ../scripts/genestack.rc && \ ansible-playbook playbooks/host-setup.yml \ -i /etc/genestack/inventory/openstack-flex-inventory.ini \ --limit openstack-flex-node-1.cluster.local
adjusted for your installation and however you needed to run the playbook. (Root on the launcher node created by
infra-deploy.yaml
normally has a venv, etc. for Ansible when you do a root login as happens withsudo su -l
.) - This playbook works idempotently, and you can run it again on the nodes
to make sure, and mostly see
-
Ensure you have up-to-date kernels.
- In particular, a bug in Linux 5.15-113 and 5.15-119 resolved in Linux 6.8 resulted in a problem electing OVN north and south database (NB and SB) leaders, although that probably shouldn't directly trigger this issue.
- Ensure you have the best and most up-to-date drivers for your NICs.
-
Check BFD
- As mentioned, OVN uses BFD to help determine when it needs to move ports.
You might run:
which shows BFD status on ports in-line to the OVSDB-centric overview of OVS for the node, and/or:
to check directly and only BFD.
(assuming you have installed the
Kube-OVN
kubectl plugin as described in docs/ovn-troubleshooting.md's Kube-OVN kubectl plugin section) and investigate any potential BFD issues. -
Check the health and configuration of NIC(s), bonds, etc. at the operating system level
- Check switch and switch port configuration.
- If you have a separate interface allowing you to reach a gateway node via SSH,
you can down the interface(s) with the Geneve tunnels on individual gateway
nodes one at a time and observe whether downing the interfaces of any
particular node(s) stops the claim storm.
- OVN will take care of moving the ports to another gateway node if you have multiple gateway nodes. When possible to do this without losing your connection (you could perhaps even use the server OOB console), you effectively temporarily take one gateway node out of rotation.)
- Drain and reboot any suspect (or all, one at time, if necessary) gateway node(s):
kubectl drain <gateway-node> --ignore-daemonsets --delete-local-data --force
# reboot the node
kubectl uncordon <gateway-node> # after return from reboot
- Since the issue has strong chance of having occurred as an aggravation of an
existing, possibly even otherwise relatively benign problem, you should
perform other general and generic troubleshooting such as reading system
logs,
dmesg
output, hardware errors reported by DRAC, iLO, or other OOB management solutions, etc.