Mumbling about computers

Docker-based images on baremetal

2021-01-23 [ systems-deployment ]

Exploring how to achieve some of the properties of containerized deployments on bare metal servers.

The properties I'm interested in for this post are:

  • Fast deployment (+ fast rollbacks)
  • Testing different OS versions with production workloads

I investigated before how to generate and flash custom OS images, and we can keep building based on those findings.

The changes will be:

  • Building the images with Docker instead of bash scripts
  • Multiple images in one disk

The change to Docker comes from tooling better being available, integrations with CI and to avoid reimplementing the layer+caching logic.

Docker based OS image

To contrast how easy it easy to start building the base image with docker, we can look at my initial Dockerfile:

FROM centos:7.9.2009
RUN yum install -y systemd kernel

Something that I know before even giving the base image a spin is that there is no init manager (systemd) and no kernel on container base images, because it is completely unnecessary.

We can build this to get at the files that are inside

docker build -f centos7.9 -t 7.9 .

Sadly, you can't export an image's filesystem[1], you have to export a container's filesystem -- the difference here being that a container is an instance of an image.

$ CONTAINER_ID=$(docker run -itd 7.9 /bin/sh)
$ docker export $CONTAINER_ID | sudo tar -C 79
$ docker stop $CONTAINER_ID

With the container data we want to put on disk, we have to do the same as in this post so I will just list the repeated steps and go into more detail for the new ones.

Create the disk

dd if=/dev/zero of=$DISK bs=1G count=3

Partition it (2 partitions, will use one per image)

parted -s -a optimal $DISK mklabel msdos
parted -s -a optimal $DISK mkpart primary 0% 50% mkpart primary 50% 100% set 1 boot on

For some reason, running the other commands at the same time as mklabel had no effect.

Write GRUB to the boot sector

dd if=/usr/lib/grub/i386-pc/boot.img of=$DISK bs=446 count=1 conv=notrunc
dd if=core.img of=$DISK bs=512 seek=1 conv=notrunc

Now, for each partition we have to mount it and write the image data to it.

$ parted -s $DISK unit B print all | grep -A 2 Number
Number  Start        End          Size         Type     File system  Flags
 1      1048576B     1610612735B  1609564160B  primary  xfs          boot
 2      1610612736B  3221225471B  1610612736B  primary  xfs

With the start + size fields we can mount each partition in a loop device

$ LOOP=$(sudo losetup -f)
$ sudo losetup $LOOP $DISK --offset $OFFSET --maxsize $SIZE
$ sudo mkfs.xfs -f $LOOP
$ sudo mount $LOOP $DIR

Copy the grub modules and grub config

$ sudo mkdir -p $DIR/boot/grub/i386-pc
$ sudo cp /usr/lib/grub/i386-pc/* 79/boot/grub/i386-pc
$ sudo cp embedded_combined 79/boot/grub/grub.cfg

$ sudo umount 79
$ sudo losetup -D $LOOP

The grub config has to have one entry per partition, pointing to their respective kernel and initrd files:

set timeout=1
menuentry "centos 8.3" {
    set root=(hd0,msdos2)
    linux  /boot/a09b09cbf0ce4eea811f98ace1b0f17d/4.18.0-240.10.1.el8_3.x86_64/linux root=/dev/vda2 rw console=tty0 console=ttyS0,115200
    initrd /boot/a09b09cbf0ce4eea811f98ace1b0f17d/4.18.0-240.10.1.el8_3.x86_64/initrd
menuentry "centos 7.9" {
    set root=(hd0,msdos1)
    linux  /boot/vmlinuz-3.10.0-1160.11.1.el7.x86_64 root=/dev/vda1 rw console=tty0 console=ttyS0,115200 modprobe.blacklist=floppy
    initrd /boot/initramfs-3.10.0-1160.11.1.el7.x86_64.img

Problems booting the image

Initrd does not support virtio disks

At least on the CentOS7.9 image, the virtio drivers are not built into the initrd, so it can't see the disk.

In the dockerfile we can re-generate the initrd with dracut, although given that we are running in a container we have to specify a bunch of options about on which kernel this will actually run (not the currently running kernel, as is the default).

RUN dracut --kmoddir /lib/modules/3.10.0-1160.11.1.el7.x86_64/ --kver 3.10.0-1160.11.1.el7.x86_64 --add-drivers "virtio_blk virtio_scsi xfs" --no-hostonly -M -f /boot/initramfs-3.10.0-1160.11.1.el7.x86_64.img

Initrd does not support XFS

Similar to previous step, xfs is unsupported so we have to tweak the dracut call with an extra driver (xfs) and --filesystems xfs

Older CentOS can't mount newer XFS versions (as built on the host machine)

When booting, the step that tried to boot the new fs from the initramfs would fail with a weird message:

XFS superblock has read-only compatible features (0x4) enabled

This can be mitigated by disabling reflink when building the fs: mkfs.xfs -m reflink=0, found here.

Default systemd target does not exist

Default target is graphical-target, which is not installed/configured so boot stalls

$ ls -lhrt ./usr/lib/systemd/system/
lrwxrwxrwx 1 root root 16 Dec 18 00:30 ./usr/lib/systemd/system/ ->

Fix by adding a step to the Dockerfile:

RUN cd usr/lib/systemd/system && ln -sf

Boot seems to progress normally but it never displays a TTY

The Welcome to CentOS message along with a bunch of [OK] Service XYZ messages fly through and suddenly... stall, without ever showing a login prompt.

Enable extra debug on systemd:

On the kernel cmdline, add systemd.log_level=debug systemd.log_target=console console=ttyS0,38400 console=tty1

which will spam the console endlessly... and uselessly as you can't read through so much text.

Luckily, we can run this disk as it is in a VM, and redirect the VM's serial device to a file:

$ kvm -drive file=$DISK,format=raw,if=virtio -m 2048 -chardev stdio,id=char0,logfile=serial.log,signal=off -serial chardev:char0

In this file, grepping for tty yields a very interesting message: Cannot add dependency job, ignoring: Unit is masked.

The tty job is disabled! Of course we won't see a log-in prompt!! Hours of debug are solved with a simple line:

RUN systemctl unmask

Toggling between OS versions

At this point we have everything we need to write multiple OS images to the same disk and boot to each.

The simplest option here would be to upgrade the GRUB defaults and just reboot, but that takes "long" -- depending on hardware configuration a reboot can take up to 20 minutes, 19 of which are spent on POST. Luckily there's a way to "warm reboot" by loading a new kernel and jumping into it, this process is called kernel execute (kexec for short) and it has similar characteristics to exec that is, switch a running image (process, kernel) with a new one and execute it.

With everything in place, kexec is super straight forward to run:

grep $other_disk /boot/grub/grub.cfg | awk '{$1=$2=""; print $0}'
kexec -l /$other_disk/boot/vmlinuz* --initrd=/$other_disk/boot/init* --append="$args"
kexec -e

That's it! you are now rebooting into another kernel.

Small demo:

[1] flattened; you can get every layer individually with docker save