My Virtualized Router

This post is mostly for me to remember what I did, but feel free to follow along.

Updates


Original Post

Recently I decided to switch jobs, for a number of reasons that aren't germane to this post. I haven't had any proper time off for years so this time I decided to take a big chunk of time between leaving my old job and starting my new one. Two and a half months, to be specific.

I'm spending this time doing a few things. First, I'm being more present with my family. I haven't been the kind of dad or husband that I want to be lately and I'm trying my best to fix that. Second, I fired up my XBox One and started playing Forza Horizon 5. It's ludicrious and mindless in the best possible way.

Third, the topic of this post: I'm building a virtualized router out of a Dell T20 server and a bunch of eBay'd networking gear.

But why?

Yeah, good question. There's a bunch of answers. Comcast is selling us 1.2Gbps service and I want to be able to use it all. I want a reliable failover WAN situation because Comcast goes out for about 5 minutes multiple times a day and that's really annoying in meetings. For example, I was in the middle of a job interview, deep in thought while on camera with two people from a company you've heard of, when Comcast fell over. I tethered to my phone quick but by the time I was reconnected I had lost my train of thought.

Beyond more and better, I want to get some more hands on experience with some tech that I only sort of tangentially know. Specifically, I've been running a Proxmox host for a couple years and it's been solid, but I don't know a lot about the guts. I've also been running a UniFi Security Gateway and then an Edgerouter, but I feel like I don't know how they actually do their job. I also want to play around with a thing called Open vSwitch within Proxmox and this seems like a good opportunity.

Also it's fun and my official job until early 2022 is to follow the dopamine.

Hardware Stack

The hardware is a mix of stuff I had on hand and a few things I've picked up:

  • Dell T20 minitower server with Xeon E3-1225v3 3.2GHz and 32GB of ECC memory
  • Samsung 500GB SSD (side note: SSDs have gotten ridiculously cheap since I last looked at them)
  • Intel X520-DA2 Dual SFP+ 10Gbps network card
  • Wiitek 10Gbase-T SFP+ interface
  • passive SFP+ DAC cable
  • Intel PRO/1000 VT quad port gigabit ethernet card
  • Motorola MB8611 DOCSIS 3.1 multi-gig cable modem
  • Netgear LB1120 LTE modem

The network card decisions deserve some explanation. Comcast gives us a 1.2Gbps cable connection handled by the MB8611 modem. That modem has a 2.5Gbase-T ethernet connection.

One way to handle this would have been a 2.5Gbase-T ethernet card in the router. This would have been a little cheaper but the fast connection would have ended at the router. I want to share the speed with my other server so I need another faster-than-gigabit port while also preserving the ability to, one day, maybe, upgrade to Comcast Gigabit Pro 3 Gbps fiber service. I'm also constrained by the T20's selection of three PCIe slots: 16x, 4x, and 1x. If I tried to do something with 2.5Gbase-T cards I probably would have run out of slots, but with the dual port SFP+ card in the 16x slot and the quad gigabit card in the 4x slot I'm fine.

I actually got two of those dual SFP+ cards, one for the router and another for my other server which will cross-connect with a passive DAC (directly attached copper) cable.

Software Stack

After putting the SFP+ card and SSD into the server I installed Proxmox VE 7.1 on the server and got started evaluating router distros.

The very first thing I installed was OPNSense, a FreeBSD-derived routing and firewall system. It came up wanting to be on the same IP as our current router (192.168.1.1) which was problematic for a bit until I figured out what was going on. After installing it I clicked around a bit and read some docs and decided that I wasn't really going to learn what I wanted to learn from it.

Next I installed VyOS, a routing and firewall package derived from Debian and Vyatta, which itself was strongly inspired by Juniper's router OS. This was sort of bewildering and overwhelming and after messing around a bit I moved onto the next thing.

Third, I installed NixOS and futzed around with the config. NixOS is a linux distribution that uses Nix to deploy and configure software. This is interesting but also weird and it doesn't get me multi-WAN failover out of the box. I'd have to build that myself, which is not super appealing.

I think what I'm going to do is reinstall VyOS and actually commit to learning how the CLI works. It gets me everything I want out of the box, it's just slightly more inscruitable.

I'm also planning on running a couple of ancilliary "network service" type VMs on this machine:

  • pi-hole for DNS and network-wide ad blocking
  • Unifi controller for our existing Unifi gear (mostly APs, some switches)
  • Ingress nginx proxy (this probably deserves it's own post)

Open vSwitch

One additional thing that I want to play with is Open vSwitch. This is a software network switch that lives inside Proxmox and ties everything together. It acts like a L3 hardware switch, just implemented entirely in software. It's optional within Proxmox but from what I've read it gives significantly better performace, which is aesthetically attractive if not strictly necessary. Nothing about this project is strictly necessary, though, so I feel justified.

What's next?

  • Set up a basic Open vSwitch configuration within Proxmox
  • Install VyOS and get it working as a basic router
  • When the quad port ethernet card arrives, install it and hook it up to the vSwitch and VyOS
  • Roll out to production!?

Progress Update 2021-11-25

I accomplished a couple of things last night and today:

I got a basic Open vSwitch config working within Proxmox! This was a bit of an ordeal because I installed a package that really shouldn't be installed, because apparently it breaks the entire network stack if you install it. So, protip, just do what the tutorial says and don't get fancy.

Here's the config, for posterity:

auto lo
iface lo inet loopback

# LAN interface, auto-tagged as VLAN-1
auto eno1
iface eno1 inet manual
    ovs_type OVSPort
    ovs_bridge vmbr0
    ovs_options vlan_mode=native-untagged tag=1

# WAN1 10GBase-T SFP+ Module, auto-tagged as VLAN-100
auto enp1s0f1
iface enp1s0f1 inet manual
    ovs_type OVSPort
    ovs_bridge vmbr0
    ovs_options vlan_mode=native_untagged tag=100

# Internal interface for the hypervisor itself attached to VLAN-1
auto vlan1
iface vlan1 inet static
    address 192.168.1.120/24
    gateway 192.168.1.1
    ovs_type OVSIntPort
    ovs_bridge vmbr0
    ovs_mtu 1500
    ovs_options tag=1

# Just one OVSBridge, the software equivalent of an L3 managed switch
auto vmbr0
iface vmbr0 inet manual
    ovs_type OVSBridge
    ovs_ports eno1 enp1s0f1 vlan1

OVS on Proxmox works like this:

  • Traffic comes into a OVSBridge through physical ports, represented by OVSPorts, as well as OVSIntPorts. These ports can have packets come in already tagged or tag them themselves (or both?)
  • Traffic to and from VMs transits via the virtual network adapters attached to each VM. These network adapters can be assigned to a VLAN or not. If not, they're assumed to be a trunk port (all VLANs).
  • The only things with IP addresses assigned are OVSIntPorts and the VM NICs.

I set up two LXC containers, one for pi-hole and another for the Unifi console. After setting up pi-hole I went into the Edgerouter config and told it to send DNS traffic there, and boy howdy is it interesting how much trash the various things on the network are talking to.

The Unifi console was a bit of work. I installed it using the 5.14.23 script from here and then restored a backup from my remote console. After that I told the Edgerouter to broadcast the new console's IP as the inform URL, which mostly worked. I had to forget one AP from the old controller and re-adopt it on the new one, and one of the switches (which happens to be the basement switch that sits in the critical path between the house where I was sitting and the office where the servers are) needed some hand holding, i.e. SSH'ing into it and running set-inform manually.

Tonight I set up Proxmox to relay email through Postmark with this tutorial and set up a scheduled weekly snapshot of all the VMs on the machine.


Progress Update 2021-11-30

Over the past few days I've tackled a couple router-adjacent mini projects.

10G Connection

As I described above, I want to have a fast connection between my primary VM host and my network services host. My first attempt at this was to cross-connect the two machines with a pair of SFP+ network cards and a DAC cable, but unfortunately no matter what I did I couldn't make this work. I ended up buying a Mikrotik CRS305 4 SFP+ port, 1 gigabit port switch and another DAC, and then waiting for it to show up.

Setting up the CRS305 was a little bit fascinating. It can power itself in a couple of ways, including with an included 12V wall wart and PoE over the gigabit port. The PoE option would have been perfect for my network so that's what I tried first, but a ridiculous thing happened as soon as I tried to connect a DAC cable to it. Every time a powered DAC touched an SFP+ port on the switch OR if a DAC plugged into the switch touched an SFP+ port or any other metal port on the computer, the switch immediately reset itself. If I tried to power the switch up after both sides were plugged in it wouldn't power up.

After exploring this for a good 45 minutes I gave up on PoE and tried the wall wart which of course immediately worked. A friend who heard this story has speculated that it could be a grounding issue and recommended an outlet tester. When it arrives on Friday I'll test the outlet and hope that I don't have to go repair it in just about the most awkard spot in the house.

Anyway, after all of that I got both servers hooked up with DACs to the Mikrotik and back to the rest of the network through the gigabit port. Performance is... kind of weird? I would have expected ~10Gbps with iperf3 between the machines but I was only getting ~7.7Gbps. Not sure what's up with that but it's fast enough that I don't really care at the moment.

Backup Internet

One aspect of this project that I've been pretty excited about is having transparent LTE backup internet. Comcast here is pretty unreliable, with momentary outages throughout the day and occasional hours long outages throughout the year. An LTE modem set up as failover would mean fewer interruptions and less chance of an important Zoom meeting getting thrown off course.

My original plan was to purchase a cheap Android phone with a prepaid LTE plan on Verizon's network, transplant the SIM into my Netgear LB1120 LTE modem, and hook it up as failover. The backup plan was to USB tether the phone directly to the router.

Surprise! Neither of these plans actually work.

Transplanting the SIM initially seemed to work, but after some time on the network the modem started doing this awful thing where it would cycle the ethernet port every 30 seconds or so. My current theory is that the modem's IMEI has been hellbanned from the Verizon network for not being a compatible piece of equipment. The additional fun part of this plan is that I needed another ethernet port on the machine so I bought a quad port card and waited for it to show up, and now that's just kind of sitting useless.

The backup plan for backup internet also doesn't work for two similarly annoying reasons. First, VyOS 1.3 doesn't seem to have the right drivers for USB tethered Android phones. I'm not sure if I'm just holding it wrong or what, but when I set up an Ubuntu 20.04 VM it got a working connection right away.

That connection isn't usable for my purposes as it only has a working IPv6 address. The IPv4 address seems to not route at all.

The backup to the backup plan, at this point, is to buy a new Netgear LM1200 modem which is specifically advertised as being compatible with Verizon. That is going to wait a bit because at this point I've spent more money than I wanted to and backup internet is not a project completion requirement.

Current status:

  • DHCP and DNS moved from the ER-X-SFP to a pihole LXC container
  • UniFi controller moved from an offsite VM to an LXC container
  • Quad port card installed
  • Open vSwitch configured well enough to get the router up and running
  • VyOS VM set up
    • trunked (all VLANs) connection to OVS and VLANs split out as virtual interfaces
    • WAN-LAN and WAN-LOCAL firewalls set up
    • SNAT from VLAN-1 (the rest of the network) to VLAN-1000 (the WAN connection)

Next steps:

  • Take a pre-announced downtime to move the cable modem and ER-X-SFP to the office
  • During that downtime, transfer the gateway IP from the ER-X-SFP to the VyOS VM
  • Do a bunch of testing to make sure it all works
  • Order a new LTE modem I guess

Progress Update 2021-12-10

I think this project is almost done. I've had it in production for about a week now without any major problems, but getting it there had a number of bumps.

LTE USB tethering does work, actually

I finally got USB tethering to work. Instead of passing the USB device through I created a USB ethernet device on the hypervisor and joined that device to the OVS bridge. After that it came right up in the VyOS VM and I was able to use it as a WAN device.

After I figured that out I scheduled a downtime for a morning where I knew no one else would be in the house.

OVS + Comcast = no DHCP!?

My downtime plan was pretty simple:

  1. Move the modem and ER-X-SFP from where it was to where it needed to be next to the new router machine in my office.
  2. Bring the ER-X-SFP back up connected to the Comcast modem, verify connectivity
  3. Change the ER-X-SFP static IP to something other than 192.168.1.1
  4. Change the Vyos VM static IP to 192.168.1.1
  5. Power down the Comcast modem, move it to the router machine, and power it back up
  6. Verify connectivity

I got all the way through to step 6 and then encountered an infuriating problem. The VyOS machine wasn't getting an IP address. I could see it making DHCP discover requests on the interface with tcpdump but the modem would just never respond.

After googling and futzing and sitting on the LTE connection for basically the entire day, way past the outage window, I finally figured it out. The problem was the OVS bridge. For whatever reason it wasn't allowing DHCP multicast packets through. I don't know if the problem is inherent to OVS or if there's a setting somewhere and at this point I don't care.

After creating a new non-OVS linux bridge, attaching the WAN interface, and attaching that to the VyOS VM, suddenly I had an IP and connectivity.

Failover works... kind of

My big reason for wanting failover is that when Comcast goes down VMSave goes down. This is aesthetically vexing. Someone finds out about VMSave, tries to use it, and it doesn't work or they can't even get to the page. It adds to people's grief, exactly the opposite of what I intend for it to do.

The LTE tether was working great, except that the nginx proxy running in Fly that sits in front of VMSave wasn't able to connect back to the server. Or rather, it could connect because I could ping both sides across the wireguard tunnel, but I couldn't pass any other kind of packets.

After a few minutes with tcpdump I discovered that something somewhere was corrupting checksums in the packets transiting the wireguard tunnel. I could see the request and VMSave would start responding but the connection would hang after a single packet.

Done for now

After seeing that and thinking through the implications of a triple NAT (CGNAT, Android kernel, VyOS) I decided to send the phone back and order a new Verizon-certified LTE modem. Amazon's troubles earlier this week have delayed this portion but I'm willing to call the project done enough for now. I have low ping times and hourly speed tests showing greater than 1Gbps the majority of the time so I'm happy.