June 4, 2020

How to create a multi-node Linux Cluster Using Ubuntu - Part 1

services are being migrated from an existing standalone LXD/LXC server into the new cluster. Various media libraries, web hosts, blog sites, and photo sharing, also a NextCloud instance I use instead of Google Drive. I don't know why I do this to myself.

How to create a multi-node Linux Cluster Using Ubuntu - Part 1

I'm using MAAS, Juju, LXD, Docker, Ceph, and an overly complicated network setup. The network consists of multiple gigabit links for each node, four QFSP 40 GBe links, using two dual port Mellanox Connect-X 3 adapters, configured as a mesh. MAAS gets its own network (so it can stomp all over DNS/DHCP without impacting my wife's Netflix watching). The ethernet links are bonded together for fail-over/performance, and bridge is created back to the main LAN to provide storage and web services to local clients.

Several services are being migrated from an existing standalone LXD/LXC server into the new cluster. Various media libraries, web hosts, blog sites, and photo sharing, also a NextCloud instance I use instead of Google Drive.

I don't know why I do this to myself.

Prerequisites

I've already setup a very basic (completely insecure) http server using NGINX for iso/network booting with my existing LXD server. The process (if you have LXD setup already) is as follows:

Web server for ISO loading

  • Launch the container
    lxc launch ubuntu:focal loader
  • Open the container
    lxc exec loader bash
  • Install web server
    sudo apt install nginx
  • Copy the ISOs for utilities and Linux distro of choice
    scp [email protected]:"/home/user/isos/ubuntu 20.04 server.iso" "/var/www/html/ubuntu20.iso"
  • Verify the ISO can be accessed in your browser by visiting:
    http://loader/ubuntu20.iso
    The ISO should start downloading, cancel the download (unless you love duplicating files for no reason).
Notes

ISOs are located at http://loader/ and will only be used to update server firmware, and load the first server. I only really have two ISOs used to get this up and running. The servers I acquired for this project were used, and some of the firmware was out of date, so I used the HP Supplementary Update DVD to flash new firmware. There is a last version supporting the HP Gen8 servers I'm using, and was downloaded before the HPE active support agreement was needed. Don't ask me for a copy.

After the MAAS controller is up, it will provision the worker nodes and control DNS and DHCP on the MAAS network. Make a note of your URL paths to the ISOs, so you can pop them into the ILO scripted URL window later:

ISO URLs

  • HP Firmware http://loader/hp.iso
  • Ubuntu 20.04 Server http://loader/ubuntu20.iso

Loading ISOs into iLO

This process is the same for the firmware update ISO and the Ubuntu DVD.

  • Navigate to the ilo URL and login
  • Click on Overview
  • Next to Integrated Remote Console, click HTML5 (note: if you don't have an HTML5 option, you need to use Internet Explorer or Java console, updating to iLO v2.70 or higher will enable the HTML5 console, you can find it... places... on the internet)
  • Click the little disc thing
  • Click CD/DVD
  • Scripted Media URL
  • Enter your URL from above and hit OK
  • Boot the server

Updating HP DL380 G8 firmware

  • Load the HP ISO
  • Boot the server
  • When prompted choose Automatic Firmware Update (unless you need to change storage controller modes or other options)
  • Wait (it will take a while, and the server will auto-reboot when done)
NOTE:

with HP Gen8 servers the P410i storage controller can be put into HBA only mode for better use with ZFS or software RAID, and this must be done in interactive mode.

WARNING:

Before you think that's a good idea, if you put the controller in HBA mode, the server no longer boots from internal drives. You must boot from an SD card on the system board or USB flash media (totally possible and recommended for an ESXi/ProxMox/other use, but not good for our use case here).

Install Ubuntu 20.04

We'll setup and install a basic 20.04 Linux installation, configure our network interfaces. Follow the previous URL/ISO loading directions and use the Ubuntu ISO this time around.

NOTE: If you are using Mellanox/HP QSFP cards, the following step is required for Linux to handle the card properly:
  • Boot from the Ubuntu ISO
  • When the purple Ubuntu loading screen comes up:
  • Press F6
  • When you see the Install, Check Disk, etc. selection screen
  • Press e to edit the grub command line
  • Add the following to prevent PCI memory reallocation, which allows the Mellanox Connect-X 3 card to be seen
    pci=realloc=off

Storage configuration

I assume there are two disks (SSD/SATA/SAS or other) already installed and configured using the HP Storage controller in a RAID1 for reliability. If your configuration is different, your on your own.

Configuring a basic RAID1 using HP PR410i

Press F6? when your machine is booting, add both available disks (you do have two disks, right?) as a new Logical Volume, in RAID1 mode

Setup

Walk through the setup steps and install to the first disk (the hardware controllers first and only RAID1 volume). I really should automate setup with a preseed file, but as I'm only doing it for a single server, I'm too lazy.
Make sure a network device shows during the network config step for the Mellanox card, mine was enos01, in addition to the four other ethX ethernet devices. We're just looking for basic network setup here, the complicated part comes later.

Assuming you don't see any errors, let the machine update itself and prompt you to reboot.

Configuration

Now for the good stuff, we start to configure the meat and potatoes. I'll start with networking, as my devices kept switching names and pissing me off.

Netplan.io

The new way! No longer are you editing /etc/network/interfaces to setup your network, now you need a YAML validator! So much better... mostly, after you kinda understand it and get it going.

Below is the config I used to force device names by matching the MAC address of each device, I tried to avoid names Linux would pick on its own. You might be able to skip all this silliness if your network device names are stable, with the differing speeds in "link up" status between the different adapters, mine jump around like kids on a trampoline.

I'm also using balance-alb instead of the more typical balance-rr due to out-of-order packets causing my SSH sessions to flake out all over. balance-tlb is limited in that it uses a single connection for sending, alb tries to keep the load more even. I have an LACP 802.11ad capable switch for enterprise grade link-aggregation, but it's not supported on both of my switches, and I want this config to be mostly free of hardware restrictions. Nothing like the only layer-3 switch you have taking a dive and the entire network doesn't work until you replace it; whereas with this config, I can swap cables into any old switch and keep going.

All the edits are in the NetPlan yaml files, edit the default or create your own:
sudo nano /etc/netplan/00-installer-config.yaml

Network config example

Here's the mostly finished config, fairly easy to figure out what everything does just by reading through:

network:
  bonds:
    bond-lan:
      dhcp4: true
      dhcp6: true
      interfaces:
      - eth_0
      - eth_1
      parameters:
        mode: balance-alb
    bond-maas:
      dhcp4: no
      dhcp6: no
      interfaces:
      - eth_2
      - eth_3
      addresses:
        - 10.0.0.2/24
      gateway4: 10.0.0.1
      nameservers:
          search: [local]
          addresses: [10.0.0.1]
      parameters:
        mode: balance-alb
    bond-mesh:
      dhcp4: no
      dhcp6: no
      interfaces:
      - eth_fiber0
      - eth_fiber1
      addresses:
        - 10.0.10.1/24
      parameters:
        mode: broadcast
  bridges:
    br0:
      dhcp4: yes
      dhcp6: yes
      interfaces:
        - bond-lan
  ethernets:
    eth_0:
      dhcp4: true
      dhcp6: true
      match:
        macaddress: 00:24:81:82:25:b0
      set-name: eth_0
    eth_1:
      dhcp4: true
      dhcp6: true
      match:
        macaddress: 00:24:81:82:25:b1
      set-name: eth_1
    eth_2:
      dhcp4: true
      dhcp6: true
      match:
        macaddress: 00:24:81:82:25:b2
      set-name: eth_2
    eth_3:
      dhcp4: true
      dhcp6: true
      match:
        macaddress: 00:24:81:82:25:b3
      set-name: eth_3
    eth_fiber0:
      #addresses:
      #  - 10.0.10.1/24
      match:
        macaddress: 24:be:05:b0:51:91
      set-name: eth_fiber0
    eth_fiber1:
      #addresses:
      #  - 10.0.10.2/24
      match:
        macaddress: 24:be:05:b0:51:92
      set-name: eth_fiber1
  version: 2

I'm setting up broadcast mode bond for mesh network, there is a performance overhead with this method, but as I'm using all Mellanox/HP 544+QFSP cards, they should be capable of 56gbps, so I've got some wiggle room without impacting performance too much. This would not fly for an enterprise production setup, but then again, you'd have budget/power for a full FDR QSFP switch fabric. My ebay scalping deals got me cheap adapters with direct attach copper, not cheap switches.

Deploy the network config

If you'd like to test first:
sudo netplan --debug generate
Implement the config (cross your fingers and hope your SSH session doesn't die)
sudo netplan apply

I usually reboot, and now that we have a steady MAC on the bridge:

Set a static IP on the DHCP server

Just for the br0/bond-lan though
MAC 3e:60:d4:69:3e:3d
Reservation @ 192.168.1.5
So the IP will quit jumping around

Mellanox Kernel Performance tuning

Now we'll add some tuning parameters to the kernel with sysctl, you may want to test these first, and tweak to your liking. These are directly from Mellanox (https://community.mellanox.com/s/article/performance-tuning-for-mellanox-adapters)

NOTE:

While some of these settings impact system-wide parameters, quite a few tweaks only apply to IPv4 traffic, and not IPv6, this doesn't matter for our IPv4 mesh network, but you may need to adjust.

I'll create a new sysctl config file for the settings, this allows us to remove/re-add large configuration changes with a single file and keep our changes out of the normal system settings:
sudo nano /etc/sysctl.d/20-mellanox-tweaks.conf

Paste in the following to enable every Mellanox tweak:

# Disable the TCP timestamps option for better CPU utilization:
net.ipv4.tcp_timestamps=0 
# Enable the TCP selective acks option for better throughput:
net.ipv4.tcp_sack=1 
# Increase the maximum length of processor input queues:
net.core.netdev_max_backlog=250000
# Increase the TCP maximum and default buffer sizes using setsockopt():
net.core.rmem_max=4194304
net.core.wmem_max=4194304
net.core.rmem_default=4194304
net.core.wmem_default=4194304
net.core.optmem_max=4194304
# Increase memory thresholds to prevent packet dropping:
net.ipv4.tcp_rmem='4096 87380 4194304'
net.ipv4.tcp_wmem='4096 65536 4194304'
# Enable low latency mode for TCP:
net.ipv4.tcp_low_latency=1
# The following variable is used to tell the kernel how much of the socket buffer space should be used for TCP window size, and how much to save for an application buffer.
net.ipv4.tcp_adv_win_scale=1

Apply your changes

sudo sysctl --system

Check you changes

Use the below command to check that your tweaks have been applied:
sudo sysctl -a

... Cluster adventures will continue in part 2