tl;dr - I kinda succeeded in getting simplistic VM level isolation working on a container linux powered Kubernetes cluster with lots of failures along the way. This post is cobbled-together notes from the exploration stage, which ultimately lead to an extremely hackish CoreOS VM powered by qemu running inside a privileged Kubernetes pod running on top of a CoreOS dedicated machine. The notes that were cobbled together to make this post are very old, I’ve actually already switched to Ubuntu server for my kubernetes cluster, but I figured it was worth editing and releasing these notes for anyone interested that is experimenting with coreos container linux or flatcar linux.

UPDATE (09/08/2018)

A reader of this article named Fred recently contacted me to let me know that he made an ansible playbook with tasks for manipulating cloud images. In particular the cloudinit-iso.yml and templates folder are the places to look for some inspiration. Thanks again to Fred for pointing this out, wanted to pass this along to anyone who is interested in doing more proper/principled cloud image modifciation.

This blog post is part of a multi-part series:

Part 1 - Hacking my way to container+VM inception (this post)
Part 2 - rkt alternate stage1 experiments
Part 3 - giving kata-runtime a fair shake

One of the great things about the Kubernetes ecosystem (and containerization in general) is the innovation it’s spurred in the container runtime space. The programs tasked with running containers (“container runtimes”) have standardized, and new container runtimes have become more established. On the standardization front, Container <something> interfaces are pushing things forward – runtimes have the Container Runtime Interface to look to, networking has the Container Networking Interface, storage solutions have the Container Storage Interface. It’s probably worth repeating that containers are just sandboxed and isolated/restricted processes – BSD Jails, Solaris Zones, Linux’s LXC features are ways to isolate processes and Docker came along to put an approachable face and ergonomic CLI on all of it. Half of the time, depending on what you’re trying ot protect, containers don’t even contain. As containers are sandboxed/isolated processes, the usual process security mechanisms apply – namely seccomp and apparmor, with some exciting new entries like gvisor.

One of the most exciting things about CRI (and abstraction with interfaces in general) is the ability for different runtimes to provide different methods for the sandboxing that containers are supposed ot provide. To get more concrete, the following projects offer containerization in VMs, rather than normal processes:

clear-containers (deprecated in favor of kata-containers)
runV (deprecated in favor of kata-containers)
kata-containers (clear-containers + runV)
frakti (meta-project that makes it possible to mix VM isolation and process isolation)
containerd’s support for untrusted workloads in v1.1.0

NOTE: While Frakti v1 actually focused on providing a shim over multiple runtimes, right now Frakti v2 is aiming to focus on being primary a containerd plugin.

One of the many many projects I want to build is a service that makes it easy to spin up Kubernetes clusters on top of an existing Kubernetes cluster. Essentially this means building GKE, but with as much of the heavy lifting as part of the Kubernetes cluster. The smaller internal clusters would greatly benefit from being VM-level isolated, so I took some time to try and explore the untrusted workload space in Kubernetes and check out what my options were. I had some success, but mostly failed at the various things I wanted to try.

Methodology

I first set out to survey the landscape and see what options are available to me, and here’s what I found:

LXD machine containers + a k8s Operator - Actually using distribution-level support for LXD machine containers, and creating a Kubernetes operator (that would likely need to be privileged, and have things like hostNetwork and use hostPaths) that could spin up LXD machine containers. Roughly the steps would look like this:

Create a k8s cluster
Create an Operator that watches for a CRD like MiniCluster, TenantCluster, or even KubernetesCluster
Ensure that the operator spins up and manages lxd machine images and sets them up initially to serve as kubernetes nodes

Runtime suported isolation - Using proper runtime support (for example containerd’s untrusted workload options or frakti + kata-containers) to run pods that are automatically run by VMs when the container runtime attempts to get them started.

QEMU-in-a-pod - Actually running qemu in a pod. I actually didn’t know this was really viable until seeing a random dockerfile that proposed to be able to start qemu. This approach was interesting to me because VMs, of course, are processes – so it makes perfect sense that they’d be containerizable. This is likely the simplest approach conceptually; no additional runtimes needed, no lxd, just a kubernetes pod whose image is set to one that does nothing but run qemu. Resources given to the pods automatically become the VMs resources (so making a bigger VM isn’t a matter of creating a configuration lanaguage to use with a CRD, I can just use the pod’s facilities and make sure the VM uses everything.

NKube - I found a somewhat defunct-looking project called nkube that seems to handle my goals more directly. It’s great news that kubernetes-in-kubernetes has been done a lot because it made it easier to find resources:

A bunch of these approaches seem to rely heavily on docker-in-docker, and while I don’t actually use docker as my container runtime (I use containerd and am very happy with it) it was good to see people discuss it and learn what they went through.

After surveying the field, I figured I’d try first with Method 2 (runtime support) since it seemed the most official/correct way to get it done. Method 1 (lxc+lxd) was pretty attractive after watching an amazing talk on how it worked from the people behind SilphRoad but lxc/lxd didn’t statically compile at the time. I posted on LXD’s forum about it and also made an issue on LXC’s github about it. At the time of this post, lxc now builds staticly (the Github issue is now closed), but at the time I was exploring it was still unsupported – so basically tryign to install lxc+lxd on container linux seemed like a death sentence. Method 3 (QEMU-in-a-pod) also looked really appealing and easy, but it didn’t seem like the “right” way to do things, so I decided to try Method 2 (runtime support) a try first.

Exploring method 2: Runtime-supported VM-level isolation

During my exploration, I tried to use runV on it’s own as kata-containers wasn’t quite ready for primetime yet and I briefly looked at the clear-containers project but was put off that they didn’t have generic installation instructions. Along with runv I’d of course need to install qemu (and statically build it, because container linux is minimal), so I downloaded the qemu source from the wiki. At first I incorrectly assumed that multi-arch’s binary static qemu builds would be beneficial to me – it containers user stuff for qemu but what I needed was the -system binaries (as in qemu-system-x86_64).

After (wrongfully) thinking I had the qemu stuff sorted, I went ahead trying to install runv, which meant installing hyperstart. Unfortunately there weren’t easy-to-find static binaries for it, and all the instructions were for other linux distributions (with no easy/generic build) - I needed to try and build it myself.

Long story short, attempting to build a completely static hyperstart binary went absolutely terribly. My usual trick is to start with alpine linux in a container and try and work through the building instructions, fiddling with build switches to try and get things to build statically. That approach didn’t work at all. hyperstart depends on a site.com/script | bash type install and it basically just tells you that alpine wasn’t supported, so I just threw my hands up. As I don’t want to try and the static build on a different distribution (notably because of the lack of musl libc), I decided I’d try some of the other methods and maybe come back to runv with proper runtime support another time. I did a cursory search on the internet for anyone that had gotten runv/hyperstart working on container linux and couldn’t find anything which kind of confirmed my hastily formed biases.

Successes: 0, Failures: 1.

Exploring method 1: LXD + Operator (brief attempt)

This is the most difficult/code-intensive solution as it involves me writing an operator which likely must run in privileged mode, which will create, manage, and ultimately teardown lxc powered machine containers on the node. I identified some gotchas up front:

Security is going to be hard, this operator is going to need/have root access to make LXD OS containers with
Writing an operator is hard to get right (nowadays there are various libraries and frameworks to make writing operators easier, when I was looking it was a little more daunting)
Resources used by the CRD must be reflected in the Kubernetes system otherwise over-allocation is going to happen SUPER quick – I wasn’t exactly sure how this worked.

After looking into it for a bit, it looked like this appraoch suffers from the same issues as Method 2 – I need to install lxc and lxd binaries on my container linux machine, and I can’t find anywhere else that’s done it successfully. At this point I started to wonder if I could mix this with Method 3, and run lxc/lxd from an OS that already has it (let’s say Ubuntu server), but from a privileged container with access to the host system.

Turns out that isn’t possible according to an SO post. The basic reasoning is that docker (and presumably containerd) prevents syscalls and features that would be expressly necessary for something like LXD to work properly. The overwhelming majority of resources I could find were about running containers inside LXD and not the other way around. At this stage I reallly wasn’t looking to fight any windmills but I did want to give this approach a fair shake, and regardless I wanted to see if I could install lxc/lxd natively (as I supposedly can’t just run it in a container)on container linux, so I chose to try and do that.

Attempting to statically build lxc/lxd

The lxd Github repository was easy to access and pull down – so my first idea was to try and build it in Alpine Linux, with a healthy helping of compiler flags and musl libc. I had to hack a bit to even get started:

Use of https://github.com/gosexy/gettext/ (which is a C binding) means you need to export some variables so that it can find contituents of gettext-dev alpine package. I needed to add some flags (export CGO_LDFLAGS="-lintl -L/usr/lib" && export CGO_GCCFLAGS="-I/usr/include"). It took me a while to figure this out, but eventually I ran into a helpful github issue.
lxc-dev alpine package is required (https://github.com/lxc/go-lxc/issues/44)
go alpine package is obviously required

After going through the build, a quick check using file revealed that I wasn’t doing it right and the generated binary @ $GOPATH/bin was actually dynamically linked… Time to put some more elbow grease into making sure it’s static. It was surprisingly hard to find instructions on how to ensure static linking in Golang, as it had been a while since I did a golang project.

After the refresher course on static building golang programs, I was primed to edit the Makefile – I figured only one line really needed to change, the default target:

CGO_ENABLED=0 GOOS=linux go install -a -ldflags '-extldflags "-static"' -v $(TAGS) -tags logdebug $(DEBUG) ./...
#go install -v $(TAGS) -tags logdebug $(DEBUG) ./...

As is tradition, this initial attempt failed, the build errored because of two packages in particular:

# github.com/lxc/lxd/shared
shared/archive_linux.go:79: undefined: DeviceTotalMemory
shared/network_linux.go:20: undefined: ExecReaderToChannel
# github.com/CanonicalLtd/go-sqlite3
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:18: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:26: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:27: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:29: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:35: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:36: undefined: namedValue
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:44: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:49: undefined: SQLiteConn
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:54: undefined: SQLiteStmt
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:63: undefined: SQLiteStmt
../../CanonicalLtd/go-sqlite3/sqlite3_go18.go:36: too many errors

At this point I started to wonder if what I was doing was as reasonable as it seemed. It was at this point that I decided to ask a question on the lxc discussion forum, and wait until I got an answer to decide how to proceed. Trying to statically build go-sqlite3 and the shared stuff from lxc seemed too much like trying to boil the ocean.

At this point since I was stuck, I got enticed by a different rabbit hole, trying to run lxc from inside a privileged container (instead of trying to build it statically to run from the system itself).

Sidetrack: Running `lxc`/`lxd` from inside a privileged container

Since I was stuck with the static binary for lxc and lxd to use at the system level, I also briefly looked into whether you can run lxd inside a container (despite what I read in the afore-mentioned SO post). Even if I can’t run the whole thing inside the container, it might be useful to run the client in the container and patch the system’s lxc socket into the container and it could do it’s work.

It was relatively easy to install lxc & lxd inside the container, though I did run into some roadblocks:

Lots of dependencies (effectively apt-get update && apt-get install iproute2 lxc lxd lxd-client ca-certificates was all I needed)
ipv6 support had to be disabled (this seems to be due to the kernel for the ubuntu image not being compiled with ipv6 support although ip tables has it according to a gitub issue on the subject), so I needed to disable networking in the lxc init command invocation.

Once I actually tried to run a container though, I ran into a cgroupss issue:

root@7e301829204c:/# lxc launch ubuntu:16.04 first
Creating first
The container you are starting doesn't have any network attached to it.
  To create a new network, use: lxc network create
  To attach a network to a container, use: lxc network attach
Starting first
EROR[05-22|09:02:20] Failed starting container                action=start created=2018-05-22T09:02:17+0000 ephemeral=false name=first stateful=false used=1970-01-01T00:00:00+0000
Error: Failed to run: /usr/lib/lxd/lxd forkstart first /var/lib/lxd/containers /var/log/lxd/first/lxc.conf:
Try `lxc info --show-log local:first` for more info
root@7e301829204c:/# lxc info --show-log local:first
Name: first
Remote: unix://
Architecture: x86_64
Created: 2018/05/22 09:02 UTC
Status: Stopped
Type: persistent
Profiles: default
Log:
lxc 20180522090220.864 ERROR    lxc_start - start.c:lxc_spawn:1553 - Failed initializing cgroup support
lxc 20180522090220.864 ERROR    lxc_start - start.c:__lxc_start:1866 - Failed to spawn container "first"
lxc 20180522090220.864 ERROR    lxc_container - lxccontainer.c:wait_on_daemonized_start:824 - Received container state "ABORTING" instead of "RUNNING"

Looks like despite the fact that I started the container in privileged mode the cgroup isolation is still interfering. At this point I decided to stop and try working with Method 3 (QEMU-in-a-pod). Hopefully someone will get back to me about building lxc/lxd statically, since it’s not great that I have to do this in the first place.

Successes: 0, Failures: 2.

Method 4: nkube (brief dismissal)

I took a look at the nkube repo and all but instantly decided I didn’t want to try and use it. I figured I’d only try this IFF (if and only if) Method 3 and basically all other avenues failed.

Successes: 0, Failures: 2, Skips: 1.

Method 3: qemu-in-a-pod

As is almost always the case, the simplest/easiest method is the one that I’ve found myself reduced to pinning my hopes on. As a refresher, this approach was basically just running QEMU inside a pod. I found a dockerfile that runs qemu so I’ll basically be doing the minimum amount of effort to get it running.

My initial revision for the pod’s resource config looked like this:

---
apiVersion: v1
kind: Pod
metadata:
  name: qemu-test
spec:
  containers:
- name: qemu
  image: tianon/qemu
  imagePullPolicy: IfNotPresent
  resources:
    limits:
      cpu: 1000m
      memory: 512Mi
    requests:
      cpu: 1000m
      memory: 512Mi
  ports:
  - containerPort: 22
    protocol: TCP
  env:
  - name: QEMU_HDA
    value: /data/hda.qcow2
  - name: QEMU_HDA_SIZE
    value: "10G"
  - name: QEMU_PU
    value: "1"
  - name: QEMU_RAM
    value: "512"
  - name: QEMU_CDROM
    value: /images/debian.iso
  - name: QEMU_BOOT
    value: 'order=d'
  - name: QEMU_PORTS
    value: '2375 2376'
  volumeMounts:
  - name: data
    mountPath: /data
  - name: iso
    mountPath: /images
    readOnly: true
volumes:
- name: data
  emptyDir: {}
  - name: iso
    hostPath:
      path: /qemu

Note that I’m actually starting without use of /dev/kvm, the virtualization device, which greatly speeds up VMs. Without access to it the vm is much much slower, but can still function. Other than that small hidden caveat, the resource config is pretty straight forward – I did need to download some images and host them @ /qemu on the host system but other than that things went very smoothly. A lot of my time was spent inside the pod (via kubectl exec -it), poking around and trying to figure out what was going on.

DEBUG: Getting qemu working in the pod

Of course, things didn’t go perfectly right away, the first problem was that although qemu was running insidethe container, there wasn’t much output, and I couldn’t SSH into the machine that was running. It turned out that instead of using a CD ROM (which you would normlaly use to install ubuntu), it would be much easier to use a Ubuntu cloud image with qemu, and save myself a lot of time/manual work. Once I got that taken care made the output much much better, and I could skip the install-from-livecd step, to actually booting a ubuntu image itself. Here’s what the output at the end of the log looked like:

   [  OK  ] Started Getty on tty1.
   [  OK  ] Reached target Login Prompts.
   [  OK  ] Started LSB: automatic crash report generation.
   [  OK  ] Started LSB: Record successful boot for GRUB.
   [  OK  ] Started Authorization Manager.
   [  OK  ] Started Accounts Service.
Ubuntu 18.04 LTS ubuntu ttyS0

Even after this successful boot I couldn’t actually log in to the machine, even though I knew it started and was running – the reason being that cloud images don’t actually have username/password login enabled (which is nice from a security standpoint), they ONLY use ssh creds (via ~/.ssh/authorized_keys) for SSH auth. So this meant I needed to go back to the (container building) drawing board and figure out how to make sure my SSH creds were set up inside the image itself. While it was a bit of a chore to have to go back, it was nice to get a look at how to implment something that is not too different from how bigger providers like AWS must handle it when people provision instances with SSH keys pre-selected.

DEBUG: Modifying my QEMU image for easy SSH access

At this point I can confirm that booting the pre-loaded image (in the hostPath) in qemu is successful via the logs – now it’s time to try and ensure that the right SSH keys are inserted into the image so I can try SSHing in. I didn’t (and still don’t) have much experience with programmatic VM image modification, so I found and started looking into a program called uvtool, and some other resources listed here:

Turns out there is a Hashicorp tool also built for this called Packer. I looked into Packer but it seemed a bit heavyweight – I didn’t want to install it on my system so I found the packer docker image that could run it. I was mostly concerned with using packer to build a qemu image so the relevant documentation was:

Looking at how much there was to learn with packer, I started to get a little skittish and look for athoer way. All I’m trying to do is inject a single file (~/.ssh/authorized_keys) into an already existing ubuntu cloud image… Packer seemed at this point like overkill. As I started looking for other resources I came across a few that were enlightening:

Eventually I learned that someone’s solved this, and neither packer or qemu-image were necessarily the answer for the simple thing I wanted to do – libguestfsprovides a suite of tools that were perfect! In particular libguestfs’s virt-copy-in command was exactly what I was looking for. The suite of tools can also be easily used in a containerized environment so I don’t have to install it on my system. Now, instead of using uvtool or packer, the plan was to use the virt-copy-in from inside a transient container to change the image (mounted into the container). The libguestfs documentation wasn’t super clear on whether the image would change in-place or if a new one would be created but I wasn’t really too worried about it.

Before getting started on actually running virt-copy-in though, I figured it was time to stop and have a think about which image I really wanted to use.

DEBUG: Rethinking the base image for qemu to run

Ubuntu isn’t a bad image to piock, but I found myself wondering how hard it would be to run Alpine (which is likely what I’d run in production). I was away from the computer for a bit and realized that rather than building images of ubunutu, I’d much rather be building images of alpine, especially if all I’m going to use it for is to run Kubernetes. Turns out Alpine has a tool for this. Then I had an even better idea – why download the image, then start running shit to install docker when I could just pick an image that’s small, already has docker and other container tools, since I already know how to initialize kubernetes from relative scratch (and I’ve written a blog post about it)! If you’re wondering This image is CoreOS!

Turns out CoreOS actually supports booting with qemu, which make things pretty easy. So if you’ve been following along at home, we’re now building a CoreOS image to run in a container on top of Kubernetes the container orchestration system running on top of CoreOS. I should be able to do the following:

Grab a CoreOS image
Use my pre-developed ignition.yaml (that I already checked will properly boot up a single node load-bearing cluster) in the afore-mentioned prevoius post.
Start it in a pod, using the qemu image

The next issue I ran into was the [qemu firmware config][qemu-firwmware-config] (which is used to load ignition config files) wasn’t working as expected. I found a github issue that seemed related which pointed at lack of proper KVM support not being enabled as the issue. Up until now I’ve been using an unprivileged container, hoping that I could get all this working with minimal secuirty impact, but it looks like I’m going to have to go with a privileged container. I evne saw some indications that it might not be as insecure as I thought it might be to give a less-than-trusted VM process access to /dev/kvm:

Along the way I also found another guide where someone was doing the exact same thing. Of course, it turns out the person attempting it @dghubble was a CoreOS person, and even produced a video about the experiment. It was great to find this guide since what he was trying to do is exactly what I’m doing and he’s definitely got a lot more knowledge of CoreOS so it will be easy to learn of.

Coming back to my hackfest of a project, I was still having problems with the firmware config not being properly picked up, and realized that it’s actually not possible to expose only a subset of system devices (of which /dev/kvm is one) to a container, reading the github issue that was filed for it. It does look like people are cognizant of the issue/need though which is nice. Initially I thought Kubernetes’s support for Device Plugins would be of help after reading a few resources:

Unfortunately, neither of those approaches are a solution because they very explicitly can’t be shared amongst multiple pods. Obviously, /dev/kvm needs to be shared. There might be some thing I could do with symlinking or something, but I doubt it would work out. Unfortunately this means that if I want passable performance I’m going to need to embrace using a privileged container. I did post about it in the github issue, but for now I’m just going to proceed with using a privileged pod. The good news is that it worked and the VM was way way faster – CoreOS booted in seconds, and while it didn’t fix the firmware config loading problem, it was nice to see such a speedup.

DEBUG: More image modification trial and error

Ultimately I failed in getting qemu to pick up my fw_cfg option, so I resorted to actually writing in the authorized_keys file myself into the image itself. I found an amazing blog post that made if very clear, and set to writing some instructions that should have worked but didn’t because of libvirt not being available in the build container I was using. Here’s what actually worked though:

(inside a fedora image, the coreos `toolbox` command)
# dnf install libguestfs-tools
# mkdir -p share/oem/
# cp only-ssh-ignition.json share/oem/config.ign
# virt-copy-in -a coreos_production_qemu_image.img share/ /usr/

While these commands modified the image in place (yuck), it actually didn’t work. It was better for me to use the CoreOS helper script referenced in the qemu documentation. I kubectl exec’d into the qemu pod and ran the following:

(inside the qemu pod itself, which is an alpine image)
# apk add --update qemu-system-x86_64 bzip2 wget
# wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_qemu.sh
# chmod +x coreos_production_qemu.sh
# ln -s /images/coreos_production_qemu_image.img coreos_production_qemu_image.img
# cp /images/only-ssh-ignition.json config.ign
# # Try with just script (modified image) => NOPE
# #./coreos_production_qemu.sh -i config.ign -- -nographic
# # COPY your id_rsa.pub onnto the sever
# ./coreos_production_qemu.sh -a ~/.ssh/authorized_keys -- -nographic

With this I could actually finally build the appropriate image and get the VM running properly which was great! It’s been a long long process but I finally have a CoreOS VM running inside a pod on Kubernetes, on a CoreOS system (allbeit with questionable security).

Successes: 1, Failures: 2, Skips: 1.

Inception attempt and refining the process

Before I got too excited I figured I’d try a little inception and start a docker container inside the VM:

core@coreos_production_qemu-1688-5-3 ~ $ docker run alpine /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
ff3a5c916c92: Pull complete
Digest: sha256:7df6db5aa61ae9480f52f0b3a06a140ab98d427f86d8d5de0bedab9b8df6b1c0
Status: Downloaded newer image for alpine:latest
core@coreos_production_qemu-1688-5-3 ~ $ docker run -it alpine /bin/sh
/ #

For those of you keeping track at home, we’re now in a docker container inside a CoreOS VM running inside a pod on kubernetes on a CoreOS system, and it’s pretty responsive (mostly in part to containers being so light)!

I figured all this success was too much so I decided to try without /dev/kvm support, andfound that I actually needed to enable an switch to qemu-system-x86_64, the -s switch, to enable qemu to work without access to /dev/kvm. The good news is that it worked, but the bad news is that it was terribly slow! Either way, success on two fronts, even though the non-/dev/kvm/ version is very likely unusable.

Successes: 2, Failures: 2, Skips: 1.

The final test: starting the pod and SSHing in

Now that I’ve gotten it working very experimentally (kubectl execing into the pod and fiddling), the final step was to put all this into a pod spec and start it, then SSH in without any fiddling.

Here’s what the final YAML config for the pod looked like:

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: qemu-ssh-keys
data:
  authorized_keys: |
        ssh-rsa <GIBBERISH> user@machine

---
apiVersion: v1
kind: Pod
metadata:
  name: qemu-test
spec:
  containers:
  - name: qemu
    image: alpine
    #image: tianon/qemu
    # Really any image that builds & runs QEMU could go here,
    # turns out it's not that hard necessarily, even alpine can do it
    imagePullPolicy: IfNotPresent
    resources:
      limits:
        cpu: 1000m
        memory: 1Gi
      requests:
        cpu: 1000m
        memory: 1Gi
    #securityContext:
    #  privileged: true
    ### doesn't work
    ## securityContext:
    ##   runAsUser: 500 # on the machine, the core user is 500
    ##   runAsGroup: 78 # on the machine, the kvm group happens to be 78
    ports:
      - containerPort: 22
        protocol: TCP
    ## never got the fw_cfg cmd to work properly :(
    # command:
    #   - start-qemu
    #   - -fw_cfg
    #   - "name=opt/com.coreos/config,file=/images/only-ssh-ignition.json"
    #   - -nographic
    command: [ "/bin/ash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]
    env:
      - name: QEMU_HDA
        # value: /data/hda.qcow2
        # value: /images/bionic-server-cloudimg-amd64.img
        value: /images/coreos_production_qemu_image.img
      - name: QEMU_HDA_SIZE
        value: "10G"
      - name: QEMU_PU
        value: "1"
      - name: QEMU_RAM
        value: "1024"
      # - name: QEMU_CDROM
      #   value: /images/debian.iso
      # - name: QEMU_BOOT
      #   value: 'order=d'
      - name: QEMU_NO_SERIAL
        value: "1"
      - name: QEMU_PORTS
        value: '2375 2376'
      ##
      ## CoreOS Ignition config can be set, using the firmware config
      ## don't even have to copy in SSH configs!
      ##
    volumeMounts:
      - name: data
        mountPath: /data
      - name: ssh-keys
        mountPath: /ssh-keys
      - name: iso
        mountPath: /images
      - name: kvm-device
        mountPath: /dev/kvm
        # readOnly: true
  volumes:
    - name: data
      emptyDir: {}
    - name: ssh-keys
      configMap:
        name: qemu-ssh-keys
    - name: iso
      hostPath:
        path: /qemu
    - name: kvm-device
      hostPath:
        path: /dev/kvm

## TO RUN THIS:
#apk add --update qemu-system-x86_64 bzip2 wget
#wget https://stable.release.core-os.net/amd64-usr/current/coreos_production_qemu.sh
#chmod +x coreos_production_qemu.sh
#ln -s /images/coreos_production_qemu_image.img coreos_production_qemu_image.img
# mkdir -p ~/.ssh
#cat /ssh-keys/authorized_keys >> ~/.ssh/authorized_keys

After starting the pod it was as easy as running the following commands:

$ kubectl port-forward qemu-test 2222:22
$ ssh localhost -p 2222

As you might have guessed, it worked! After all the experimentation I did inside the container, all it took was making sure that I made sure the changes to the way I was doing things stuck and sure enough, QEMU booted up.

Security?

Theoretically, this set up should be safer than a regular pod because the user is exposed to the VM INSIDE the pod, so there would be two levels of security to escape. In addition to the usual best practices for kubernetes pod security (PodSecurityPolicy, seccomp, apparmor, etc), it might even be production-worthy. However, it’s terrible that we have to use a privileged pod – it makes it so much worse because a container + VM escape now equals a node-compromise.

Performance?

At first I wasn’t sure how to benchmark the performance, since I haven’t done much work on this side of the ops spectrum – I figured the important things to check were:

Network
Device access (IOPS)
instruction processing speed (CPUOPS)

I started looking around for tools to test these things, and found a few. For testing the network I figured I could use siege, maybe spinning up a very simple web server and pinging it. For device access, I found lots more resources:

But it was really when searching for how to test CPU ops that I found the best tool for the job so far. I came across a Stack Exchange question which lead me to stress-ng, which is a fantastic tool that checks CPU and IO metrics. It looked like the command I needed to run was something like this:

stress-ng --cpu 1 --io 2 --vm-bytes 1G --timeout 60s --metrics-brief

So I did, for all three environments:

Regular resource-constrainted pod run in k8s on the machine (machine -> k8s -> container, NO qemu) CMD: stress-ng --cpu 1 --io 2 --vm-bytes 1G --timeout 60s --metrics-brief RESULTS:

stress-ng: info:  [52] dispatching hogs: 1 cpu, 2 io
stress-ng: info:  [52] successful run completed in 60.07s (1 min, 0.07 secs)
stress-ng: info:  [52] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [52]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [52] cpu               15169     60.04     58.31      0.00       252.63       260.14
stress-ng: info:  [52] io                 4342     60.07      0.00      0.68        72.29      6385.29

This result is really weird, while I kind of believe the large jump in cpu ops, the dip in IO is really weird… I need to understand the tool more

QEMU VM in the resource constrained pod with /dev/kvm support (machine > k8s > container > qemu vm) CMD: ./stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief

stress-ng: info:  [1285] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [1290] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info:  [1285] successful run completed in 60.09s (1 min, 0.09 secs)
stress-ng: info:  [1285] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [1285]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [1285] cpu                3515     60.09     14.62      0.34        58.50       234.96
stress-ng: info:  [1285] io               147104     60.00      0.08      3.85      2451.73     37431.04
stress-ng: info:  [1285] vm                    0     10.09      0.00      0.00         0.00         0.00

I should note that I copied the stress-ng executable out of a container built from a Dockerfile in the stress-ng project. As usual, all I had to do was build that Dockerfile and copy the static binary out of it and run it on the CoreOS VM inside the pod.

Container inside the VM without KVM support (machine > k8s > container > VM (+/dev/kvm) > stress-ng container) CMD: docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief

core@coreos_production_qemu-1688-5-3 ~ $ docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
Unable to find image 'alexeiled/stress-ng:latest' locally
latest: Pulling from alexeiled/stress-ng
1160f4abea84: Pull complete
110786018a74: Pull complete
Digest: sha256:105518acaa868016746e0bd6d58e9145a3a437792971409daf37490dbfc24ea2
Status: Downloaded newer image for alexeiled/stress-ng:latest
stress-ng: info:  [1] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [9] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info:  [1] successful run completed in 60.25s (1 min, 0.25 secs)
stress-ng: info:  [1] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [1]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [1] cpu                  96     60.07      7.86      4.90         1.60         7.52
stress-ng: info:  [1] io                21908     60.00      0.21     22.63       365.12       959.19
stress-ng: info:  [1] vm                    0     10.38      0.00      0.03         0.00         0.00

Container inside the VM with KVM support in the machine (machine > k8s > container > VM (+/dev/kvm) > stress-ng container CMD: docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief

core@coreos_production_qemu-1688-5-3 ~ $ docker run -it --rm alexeiled/stress-ng --cpu 1 --io 2 --vm 1 --vm-bytes 1G --timeout 60s --metrics-brief
stress-ng: info:  [1] dispatching hogs: 1 cpu, 2 io, 1 vm
stress-ng: error: [11] stress-ng-vm: gave up trying to mmap, no available memory
stress-ng: info:  [1] successful run completed in 60.04s (1 min, 0.04 secs)
stress-ng: info:  [1] stressor       bogo ops real time  usr time  sys time   bogo ops/s   bogo ops/s
stress-ng: info:  [1]                           (secs)    (secs)    (secs)   (real time) (usr+sys time)
stress-ng: info:  [1] cpu                3508     60.03     14.54      0.28        58.43       236.71
stress-ng: info:  [1] io               104088     60.00      0.06      3.90      1734.80     26284.85
stress-ng: info:  [1] vm                    0     10.01      0.00      0.00         0.00         0.00

There’s a lot of caveats to the testing I’ve done and the numbers here, so take them with a boulder of salt but here are some observations/highlights:

The memory allocation failures were likely because the pod asked for 1GB of memory and the VM used 1GB – there probably wasn’t much left over (if anything at all really) for qemu to use.
There’s a huge bump in CPU performance, which when /dev/kvm is enabled, as one would expect, a ~35x difference in ops, a ~33x difference in ops/sec
Also, IO was better too, ~4x difference there with ~27x in per second - not sure why the ops/sec is so different
Some more testing would be a good idea, the VM tests that stress-ng runs are probably pretty important for a VM, but it’s quite obvious that the /dev/kvm speedup is huge (it might be better to run more benchmarks on /dev/kvm in a pod vs different stage1 container).
In the end, I’m still at the mercy of whether or not the --device command like in docker will be supported in kubernetes. it’s probably too insecure to run privileged containers.
The differential between the binary being run from hte VM itself and the docker container in it isminimal, 3515 ops vs 3508, ops per second were actually higher in the container (io was noticably different though, ~1.4x in both overall ops and ops/s)

SIDETRACK: rkt supports annotation-based stage-1 switching

While I was neckdeep in trying to figure things out, I was reminded of the fact that rkt (an alternative container runtime, the first reasonable alternative to docker) actually supports multiple “stage 1” images, which means that it could support VMs! Trying tomake it work is more like method 1 (runtime support) but I realized that this would be another avenue to explore at some point if/when I work back around to getting a more legitimate/safe setup working, as running qemu as the main process in a privileged pod is probably not advisable long term.

Turns out rkt, the very first runtime I started a Kubernetes cluster with supports alternative stage1s via annotations now!. This wasn’t the case when I first looked at it but it’s huge because it means I could explore that avenue for running untrusted workloads – I just have to switch out containerd for rkt and then use the annotations provided, and get easy runtime-level support working, theoretically. The PR that introduced the support was nice to read through as well.

If this works it would be mcuh better than the current working solution (running qemu in a privileged pod), but I’ll leave that for Part 2 since this post is already way too long!

Wrapup

It’s been a long wild ride, but in the end the only way I could get started even getting a glimpse at running a VM through Kubernetes on my container linux machine was through the dirty dirty hack of running a qemu in a privileged pod, with lots of trial and error. I went on a wild goose chase through the internet but ultimately got at least one super hacky method of untrusted workloads running on Kubernetes. It’s a bit dizzying standing on the tower of abstractions being used in this post, but as long as the tower isn’t crumbling I think it was a good build.

In the end, the easiest thing to get working was running QEMU form a regular pod, but the tradeoff of security vs performance is too great, I need to look into other solutions. I’m finding that I’m somewhat restricted in my use of CoreOS, and in the future I think I’m going to go with some distributions that support more of the technology that I think are options (namely lxc/lxd) without so much hard work (SPOILER: I DID).

VADOSWARE

Living in a yak shaver's paradise.

Running Untrusted Workloads K8s Container Linux Part 1

Categories

UPDATE (09/08/2018)

Methodology

Exploring method 2: Runtime-supported VM-level isolation

Exploring method 1: LXD + Operator (brief attempt)

Attempting to statically build lxc/lxd

Sidetrack: Running `lxc`/`lxd` from inside a privileged container

Method 4: nkube (brief dismissal)

Method 3: qemu-in-a-pod

DEBUG: Getting qemu working in the pod

DEBUG: Modifying my QEMU image for easy SSH access

DEBUG: Rethinking the base image for qemu to run

DEBUG: More image modification trial and error

Inception attempt and refining the process

The final test: starting the pod and SSHing in

Security?

Performance?

SIDETRACK: rkt supports annotation-based stage-1 switching

Wrapup

Categories

UPDATE (09/08/2018)

Methodology

Exploring method 2: Runtime-supported VM-level isolation

Exploring method 1: LXD + Operator (brief attempt)

Attempting to statically build lxc/lxd

Sidetrack: Running lxc/lxd from inside a privileged container

Method 4: nkube (brief dismissal)

Method 3: qemu-in-a-pod

DEBUG: Getting qemu working in the pod

DEBUG: Modifying my QEMU image for easy SSH access

DEBUG: Rethinking the base image for qemu to run

DEBUG: More image modification trial and error

Inception attempt and refining the process

The final test: starting the pod and SSHing in

Security?

Performance?

SIDETRACK: rkt supports annotation-based stage-1 switching

Wrapup

Sidetrack: Running `lxc`/`lxd` from inside a privileged container