Grouch mumbling about computers.


Messing up backups

This is the story of how I managed to trash my server, all my VMs and 2 databases while upgrading proxmox during a boring Sunday afternoon.

Upgrading the host

A few weeks ago Proxmox 6.0 was released and I decided to upgrade; this included a stretch -> buster upgrade.
I was quite confident that the changes were going to be successful as I have upgrade this installation all the way since Wheezy.. oh wow, it's been a long time.

The upgrade didn't go quite right, as the kernel would silently hang without any notice after a reboot. Trying to boot previous kernels didn't help, and after 2 days fighting in a chroot I opted to simply reinstall the system without taking any precautions; I was confident on my backups.

The mistake

Turns out that, while the data itself was safe, proxmox's configuration lives on an in-memory filesystem, mounted on /etc/pve and my backup script calls rsync with -x ( --one-file-system ).

The files that live within /etc/pve are purely metadata about the containers, like what storage is used, number of cores, memory size, vlan , and mounts.

While losing this metadata was quite annoying, it was not the end of the world, as all of the containers were created with some ansible playbooks a few years ago:

commit ab5015c7cd11a31c7a7159a0384c627962ff6439
Author: David
Date:   Sun Dec 18 20:53:07 2016 -0300

    init dns container

A small upside?

For now, to avoid this from happening again I've added /etc/pve to the list of filesystems to back up, and moved the creation of VM/containers to ansible as well, an example snippet:

- hostname: web
  disk_size: 8
  cores: 4
  memory: 2048
    - name: eth0
      bridge: vmbr20
  mounts: '{"mp0":"/storage/ownclouddata,mp=/ownclouddata"}'

Having metadata in a static representation has the (unintended) side effect that doing static-analysis is also easier.

Re-creating the VMs is then a simple call the the proxmox module in ansible in a loop:

- name: create
    node: "bigserver"
    api_user: "{{ api_user }}"
    api_password: "{{ api_password }}"
    hostname: "{{ item.hostname }}"
    storage: "storage"
    cpus: "1" # numa nodes
    pubkey: "{{ pubkey }}"
    ostemplate: "{{ item.template | default(_template) }}"
    unprivileged: "{{ item.unprivileged | default('yes') }}"
    cores: "{{ item.cores | default(1)}}"
    memory: "{{ item.memory | default(2048) }}"
    onboot: "{{ item.onboot | default(1) }}"
    disk: "{{ item.disk_size | default(3) }}"
    netif: "{{lookup('proxmox_interface_format', item.interfaces)}}"
    state: present
  tags: [lxc_setup]
  loop: '{{vms}}'

Restoring data

Once the VMs were re-created, I had to recover data from a few stateful containers. All of the data was accessible from the host, as the filesystems are ZFS subvolumes and they remained intact.


Restoring influx data was quite easy:

  • install influx
  • stop influx
  • overwrite /var/lib/influxdb/{data,wal}
  • run restore command
  • start influx

The restore command was:

root@db:~# sudo -u influxdb influx_inspect buildtsi -datadir /var/lib/influxdb/data/ -waldir /var/lib/influxdb/wal/


Restoring Gogs was also quite trivial, had to only restore files:

  • sqlite database
  • gogs daemon config
  • repositories


Restoring MySQL was a disaster.. by this point it was well past midnight and I made a grave mistake.. Copied the brand-new (empty) metadata files over the original metadata files, making the problem much worse.

With information from multiple sources I managed to re-generate the frm files.

To re-generate the metadata (frm files) I ran the following commands (as taken from history).

 # make a local copy the data to work on
 2005  scp -r root@bigserver:/tank/proxmox-images/subvol-105-disk-1/var/lib/mysql/owncloud/ .
 # to run mysqlfrm you need to have the mysql binaries installed locally
 2013  sudo apt install mysql-client mysqld
 # run a test to see the output
 2021  mysqlfrm --server=root:root@db owncloud:oc_accounts.frm --port=3307
 # this looks fine; simply outputs the `CREATE TABLE` commands

 # generate table schema for all tables
 2029  for f in *.frm; do mysqlfrm --server=root:root@db owncloud:"$f" --port=3308 >> results.sql; echo $f; done
 # 2 tables failed randomly -- re running the command fixed it
 2031  mysqlfrm --server=root:root@db owncloud:oc_properties.frm --port=3308 >> results.sql 
 2032  mysqlfrm --server=root:root@db owncloud:oc_retention.frm --port=3308 >> results.sql 
 # To make this valid SQL I had a few missing ;
 2033  sed 's/COMPRESSED$/COMPRESSED;/' results.sql > rr.sql
 # Import the sql file to create the tables
 2036  mysql -u root -proot -h db owncloud < rr.sql
 # Discard the newly created tablespaces with data
 2042  for f in *.frm; do echo $f; fname=$(echo $f | cut -d. -f1); mysql -u root -proot -h db owncloud -e "alter table owncloud.$fname DISCARD TABLESPACE;"; done
 # Overwrite the data
 2043  for f in *.ibd; do scp $f root@db:/var/lib/mysql/owncloud/$f; done
 # Re-import the tablespaces
 2044  for f in *.frm; do echo $f; fname=$(echo $f | cut -d. -f1); mysql -u root -proot -h db owncloud -e "alter table owncloud.$fname IMPORT TABLESPACE;"; done

This got the database back in working order.. it was quite stressful though.


For the rest of the VMs (music, web servers, reverse proxies, etc) it was just a matter of re-running the ansible playbooks against them.
It worked quite well; there were some differences that I had overcome with the change of the base image between jessie and buster.


Backups are not backups until tested. This showed that while the data I have is kind-of safe; the cost of drives dying (and thus losing all metadata as well) would be quite high. I intend to re-visit the backup mechanism in the near future:

  • Backup /etc/pve
  • Full mysql backup
  • Full influxdb backup
  • Full postgres backup

I will see if it makes sense to try out some semi-automated environment recoveries somehow.