Background

By default, Genestack creates a pod that runs OVN snapshots daily in the kube-system namespace where you find other centralized OVN things. These get stored on a persistent storage volume associated with the ovndb-backup PersistentVolumeClaim. Snapshots older than 30 days get deleted.

You should primarily follow the Kube-OVN documentation on backup and recovery and consider the information here supplementary.

Backup

A default Genestack installation creates a k8s CronJob in the kube-system namespace along side the other central OVN components that will store snapshots of the OVN NB and SB in the PersistentVolume for the PersistentVolumeClaim named ovndb-backup. Storing these on the persistent volume like this matches the conventions for MariaDB in Genestack.

Restoration and recovery

You may wish to implement shipping these off of the cluster to a permanent location, as you might have cluster problems that could interfere with your ability to get these off of the PersistentVolume when you need these backups.

Recovering when a majority of OVN DB nodes work fine

If you have a majority of k8s nodes running ovn-central working fine, you can just follow the directions in the Kube-OVN documentation for kicking a node out. Things mostly work normally when you have a majority because OVSDB HA uses a raft algorithm which only requires a majority of the nodes for full functionality, so you don't have to do anything too strange or extreme to recover. You essentially kick the bad node out and let it recover.

Recovering from a majority of OVN DB node failures or a total cluster failure

You probably shouldn't use this section if you don't have a majority OVN DB node failure. Just kick out the minority of bad nodes as indicated above instead. Use this section to recover from a failure of the majority of nodes.

As a first step, you will need to get database files to run the recovery. You can try to use files on your nodes as described below, or use one of the backup snapshots.

Trying to use OVN DB files in `/etc/origin/ovn` on the k8s nodes

You can use the information in this section to try to get the files to use for your recovery from your running k8s nodes.

The Kube-OVN shows trying to use OVN DB files from /etc/origin/ovn on the k8s nodes. You can try this, or skip this section and use a backup snapshot as shown below if you have one. However, you can probably try to use the files on the nodes as described here first, and then switch to the latest snapshot backup from the CronJob later if trying to use the files on the k8s nodes doesn't seem to work, since restoring from the snapshot backup fully rebuilds the database.

The directions in the Kube-OVN documentation use docker run to get a working ovsdb-tool to try to work with the OVN DB files on the nodes, but k8s installations mostly use CRI-O, containerd, or other container runtimes, so you probably can't pull the image and run it with docker as shown. I will cover this and some alternatives below.

Finding the first node

The Kube-OVN documentation directs you to pick the node running the ovn-central pod associated with the first IP of the NODE_IPS environment variable. You should find the NODE_IPS environment variable defined on an ovn-central pod or the ovn-central Deployment. Assuming you can run the kubectl commands, the following example gets the node IPs off of one of the the deployment:

kubectl get deployment -n kube-system ovn-central  -o yaml | grep -A1 'name: NODE_IPS'

        - name: NODE_IPS
          value: 10.130.140.246,10.130.140.250,10.130.140.252

Then find the k8s node with the first IP. You can see your k8s nodes and their IPs with the command kubectl get node -o wide:

kubectl get node -o wide | grep 10.130.140.246

k8s-controller01   Ready      control-plane   3d17h   v1.28.6   10.130.140.246   <none>        Ubuntu 22.04.3 LTS   6.5.0-17-generic    containerd://1.7.11
root@k8s-controller01:~#

Trying to create a pod for `ovsdb-tool`

As an alternative to docker run since your k8s cluster probably doesn't use Docker itself, you can possibly try to create a pod instead of running a container directly, but you should try it before scaling your OVN replicas down to 0, as not having ovn-central available should interfere with pod creation. The broken ovn-central might still prevent k8s from creating the pod even if you haven't scaled your replicas down, however.

Read below the pod manifest for edits you may need to make

apiVersion: v1
kind: Pod
metadata:
  name: ovn-central-kubectl
  namespace: kube-system
spec:
  serviceAccount: "ovn"
  serviceAccountName: "ovn"
  nodeName: <full name first _k8s_ node from NODE_IPS>
  tolerations:
  - key: node-role.kubernetes.io/control-plane
    operator: "Exists"
    effect: "NoSchedule"
  volumes:
  - name: host-config-ovn
    hostPath:
      path: /etc/origin/ovn
      type: ""
  - name: backup
    persistentVolumeClaim:
      claimName: ovndb-backup
  containers:
  - name: ovn-central-kubectl
    command:
      - "/usr/bin/sleep"
    args:
      - "infinity"
    image: docker.io/kubeovn/kube-ovn:v1.13.13
    volumeMounts:
    - mountPath: /etc/ovn
      name: host-config-ovn
    - mountPath: /backup
      name: backup

You also have to make sure to get the pod on the k8s node with the first IP of NODE_IPS from your ovn-central installation, as the Kube-OVN documentation indicates, so see the section on "finding the first node" above to fill in <full name first _k8s_ node from NODE_IPS> in the example pod manifest above.

You can save this to a YAML file, and kubectl apply -f <file>.

You may need to delete the backup stuff under .spec.volumes and .spec.containers[].volumeMounts if you don't have that volume (although a default Genestack installation does the scheduled snapshots there) or trying to use it causes problems, but if it works, you can possibly kubectl cp a previous backup off it to restore.

Additionally, you may need to delete the tolerations in the manifest if you untainted your controllers.

To reiterate, if you reached this step, this pod creation may not work because of your ovn-central problems, but a default Genestack can't docker run the container directly as shown in the Kube-OVN documentation because it probably uses containerd instead of Docker. I tried creating a pod like this with ovn-central scaled to 0 pods, and the pod stays in ContainerCreating status.

If creating this pod worked, scale your replicas to 0, use ovsdb-tool to make the files you will use for restore (both north and south DB), then jump to Full Recovery as described below here and in the Kube-OVN documentation.

`ovsdb-tool` from your Linux distribution's packaging system

As an alternative to the docker run, which may not work on your cluster, and the pod creation, which may not work because of your broken OVN, if you still want to try to use the OVN DB files on your k8s nodes instead of going to one of your snapshot backups, you can try to install your distribution's package with the ovsdb-tool, openvswitch-common on Ubuntu, although you risk (and will probably have) a slight version mismatch with the OVS version within your normal ovn-central pods. OVSDB has a stable format and this likely will not cause any problems, although you should probably restore a previously saved snapshot in preference to using an ovsdb-tool with a slightly mismatched version, but you may consider using the mismatch version if you don't have other options.

Conclusion of using the OVN DB files on your k8s nodes

The entire section on using the OVN DB files from your nodes just gives you an alternative way to a planned snapshot backup to try to get something to restore the database from. From here forward, the directions converge with full recovery as described below and in the full Kube-OVN documentation.

Full recovery

You start here when you have north database and south database files you want to use to run your recovery, whether you retrieved it from one of your k8s nodes as described above, or got it from one of your snapshots. Technically, the south database should get rebuilt with only the north database, but if you have the two that go together, you can save the time it would take for a full rebuild by also restoring the south DB. It also avoids relying on the ability to rebuild the south DB in case something goes wrong.

If you just have your PersistentVolume with the snapshots, you can try to create a pod as shown in the example manifest above with the PersistentVolume mounted and kubectl cp the files off.

However you got the files, full recovery from here forward works exactly as described in the Kube-OVN documentation, which at a high level, starts with you scaling your replicas down to 0.