Quantum

Resurrecting a 13-year-old high school project: the BSoD lock screen

2026-01-14T21:43:26-05:00

One day, during the winter break away from work, I was reminded of this ancient program I built in high school—a Blue Screen of Death (henceforth BSoD) simulator. It really has been a while since I’ve covered coding topics on this blog, so I thought it might be time to dig up the program and resurrect it.

But you might ask the very natural question: why did I write such a program? For that, I blame the school. You see, I had to deal with school computers a lot back in the day, not being one of those “cool kids” who brought their own laptops to school. Naturally, the school computers ran Windows XP¹. For some reason, the school administrators decided to deploy group policy to deny screen locking.

This posed a conundrum: if I had to step away from the computer—like say, answering the call of nature—I was confronted with two deeply uncomfortable choices:

Save all my work, close all applications, and log out, then wait forever to log back in on those slow computers, while hoping that no one else took the fastest one of the bunch²; or
Leave the computer unattended, and if someone does something sketchy as a prank, I’d be the one in trouble.

One day, I saw a school computer stuck with a BSoD, and no one ever touched it, and that gave me an idea: what if I wrote my own custom “lock screen” program that masqueraded as a BSoD? And thus was the program born.

What did it do?

The program obviously needed to display a BSoD in full screen mode. The first iteration of the program simply loaded the PNG downloaded from Wikipedia, specifically this picture. It was very easy to tell after a while that the screen was fake due to it looking the same every time. It really doesn’t help that the current Unix time³ in hexadecimal is shown at the very end. For example, the Wikipedia picture says DateStamp 3d6dd67c, which is 1,030,608,508 seconds after the Unix epoch, or 2002-08-29 08:08:28 UTC. Given that in the 2010s, no timestamp should start with 3 in hexadecimal, this easily gave the game away to those who understand.

So naturally, the program had to render an actual BSoD, generating random but believable values for every field, using the current time in the DateStamp field.

It also needed several features to be effective as a lock screen, like completely blocking access to the system while it’s running, or what’s the point? So it had to:

be a topmost window in addition to being full screen, blocking all regular windows from view;
prevent random system popups, such as the built-in sticky keys prompt, which inevitably showed up when testers spammed the keyboard;
prevent the task manager from being launched to kill the program;
prevent the program from being killed some other way;
block most keyloggers through the use of a “secure desktop”, since I’d be typing in a password;
prevent someone from logging me out without my authorization; and
prevent the system from being shut down (except by pulling the plug).

The exact way these functionalities are accomplished will be described later, when we dive into the implementation details.

What it can’t do is intercept Ctrl+Alt+Delete, which is handled by winlogon.exe with no exceptions. Microsoft specifically designed it as a “secure attention sequence” to help the user confirm that they are typing their password into the real login screen and not a fake program phishing for the password. On school computers, pressing Ctrl+Alt+Delete shows a screen asking the user for various options, which unfortunately could not be blocked and looked something like this:

(Naturally, the “Lock Computer” option was greyed out on the actual school computers, or I wouldn’t have to write this program.)

Still, it did the best it could, and surprisingly few people know about the intricacies of Ctrl+Alt+Delete…

The development environment

Programs are the product of their environment, and my BSoD program is no exception. For something that will need to call a lot of Windows API calls directly, any higher-level language like Python or Java is automatically out. The obvious candidates were C and C++.

I wanted to be able to develop this program on a school computer, in case I found a bug that I wanted to fix immediately without waiting to go home, so it had to work on a tiny toolchain that I could bring to school and not take up too much space. For this purpose, I chose the Microsoft Visual C++ 6.0 toolchain (henceforth VC6) from 1998⁴, which, after debloating, compressed down to around 9 MiB with 7-Zip.

In this vein, I tried to make the final executable as lightweight as possible, so C++ was automatically out. That meant I had to write this program in C, and since VC6 pre-dated the C99 standard, the only acceptable form of C was C89. This had several annoying quirks, such as requiring all variable declarations to be hoisted manually to the top of every scope, no exceptions, or requiring the loop variable of a for-loop to be declared separately, but was otherwise reasonably similar to modern C.

Since the compiler predated 64-bit Windows, and the school computers were exclusively 32-bit anyway, no attempt was made to support compiling in 64-bit mode. The program would run fine on 64-bit Windows anyway through WoW64.

VC6 also had the advantage of supporting dynamic linking for a smaller executable size without requiring a runtime redistributable to be installed, which often isn’t the case for newer versions of Visual C++, and whether each school computer had each redistributable was completely random. However, when dynamically linking the C runtime on VC6, it simply used msvcrt.dll, which is part of the operating system. It’s quite understandable why Microsoft gave up on this approach, since it made it impossible to add newer features without causing “DLL hell” or application compatibility issues, but still… Naturally, it was possible to statically link the C library, but that added to the executable size.

Although in the end, none of the C library linking stuff ended up mattering, because I decided to ditch the C library to achieve the minimum possible executable size…

Raw Windows API programming

Most people who have written a Windows GUI program would know that the program entry point is:

int WinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPSTR lpCmdLine, int nShowCmd);

However, this is not strictly true. As Raymond pointed out, this is actually a function provided by the C runtime library to replicate the behaviour of 16-bit Windows, and the actual real entrypoint for the program is

DWORD APIENTRY RawEntryPoint(void);

Partly inspired by that blog post, this program did the insane thing of ditching the C library and programming directly to RawEntryPoint. This meant that no C library function could be used at all, only Windows APIs. Due to choosing C instead of C++, this was actually possible to accomplish, as I would otherwise have to avoid most C++ features like the plague due to the runtime being implicitly used in most of them.

On newer compilers, several security features, such as guards against stack buffer overflows, had to be turned off, as those require runtime support. Fortunately, VC6 didn’t have those features anyway…

In retrospect, this was totally not worth it, but that’s how the program was constructed. Clearly, I had too much time in high school…

Seeing the program in action

You can download the latest source code from GitHub and compile it yourself, if you have a copy of Visual C++ lying around. Most versions should work, but for the authentic experience, you should use VC6. A VC6-compiled binary is available on my Jenkins instance⁵.

It should look like this, with a random driver and error code selected:

To exit, hold down Alt and press the following keys in sequence: F2, F4, F6, F8, 1, 3, 5, 7. Then, press Ctrl+Alt+Shift+Delete.

Enter the password quantum5.ca to exit.

Lessons learned

Given that 13 years have gone by since the program was made, it’s inevitable that I would probably have done things differently now compared to then.

Code-wise, I didn’t have to make too many changes to satisfy my current self. The biggest issue was using funny magic numbers instead of defining them as proper constants, but that was easily rectified. I would probably not have ditched the C library writing this now, but it really wasn’t that big of a burden for this application, as you shall see.

The real lesson lies in the non-coding portion, and underscores how writing code is probably the easiest part of software development. There are many things I would have done differently:

Using source control. Apparently, my younger self didn’t bother with any kind of source control and just emailed the source code around in plain text. This made it quite difficult to determine the canonical latest version of this program. It is entirely possible that a version with more fixes here got lost over the years. These days, it’s completely unimaginable for me to not use Git to manage everything.
Using a proper build system, even if it’s just Makefiles. Apparently, my younger self just had a command line to compile the program written down separately, which depended on a bunch of generated files that were not obvious how to regenerate, such as a compiled .res file. This required a bit of forensic work to reconstruct the .rc file so I could compile it again. Even worse, the flags to compile the source under VC6 were written down in a completely different place compared to newer versions of the compiler… It really didn’t help that the source code was sent around as text in emails, without any supporting files. In the modernized Git repo, there is a proper Makefile, and a separate one for legacy VC6. Both of these will build the program completely from scratch, instead of relying on distributing intermediate files.
Using continuous integration. Apparently, my younger self had a ton of slightly different executables floating around in a directory somewhere, and it’s not clear which one is the “good” one. This was clearly insane. The modernized version has GitHub Actions set up to build every commit to ensure the repository successfully compiles under modern compilers, without dependency on any local, uncommitted files. I also have Jenkins set up specifically with VC6 to generate builds with the authentic original toolchain for every commit, ensuring that the latest executable is always available.
Using an autoformatter, like clang-format. My younger self decided to manually format all the code, resulting in a coding style that’s not quite consistent. While it doesn’t matter as much for a single-person project like this, having an autoformatter is essential in larger projects involving multiple people to avoid lots of different styles mixing and blending together. The current GitHub version has a .clang-format file defining the expected style.

Nevertheless, I am pleasantly surprised by my younger self at close to half my current age!

Annotated source code

Before we dive in, it’s important to keep in mind that the program is structured to have a bunch of macros that could be defined at compile time to control the behaviour:

NOAUTOKILL, which disables the automatic exiting after 50 seconds logic. This is invaluable when developing; and
NOTASKMGR, which disables the task manager. The way it does so is kinda sketchy, so this gives an opt-out.

Without further ado, let’s look at the source code, which canonically resides on GitHub.

Headers

#define _WIN32_WINNT 0x0500
#define WINVER       0x0500

#include 

First, we define version macros. In this case, we ask windows.h to make available all functions available in Windows 2000, which has everything we needed.

#include 
#include 

We then include two more Windows header files, one for ACL operations, and the other for the wnsprintf function. This is an obscure sprintf variant that I would never normally use, except sprintf isn’t available due to libc being ditched. Borrowing the version from shlwapi.dll was the natural solution.

Macros

#define ARRAY_SIZE(x) (sizeof(x) / sizeof *(x))

We define this handy ARRAY_SIZE macro so we can just pass in an array and compute its size as a constant. The way this works is by calling sizeof on the whole array to get its size in bytes, then calling sizeof to get the size of the first element of the array in bytes, and dividing. sizeof *(x) yields the size of the first element because arrays decay to the pointer to the first element in most contexts, including dereferencing, so *(x) yields the first element.

This macro will be broken if the argument isn’t an actual C array. If I cared more, I would make a better version that errors out when something else is passed in, but oh well.

#define WM_ENFORCE_FOCUS (WM_APP + 0)

#define TM_DISPLAY   0xBEEF
#define TM_AUTOKILL  0xDEAD
#define TM_FORCEDESK 0xFAC

#define AUTOKILL_TIMEOUT 50000
#define DISPLAY_DELAY    1000
#define FORCE_INTERVAL   1000

#define HDLG_MSGBOX ((HWND) 0xDEADBEEF)
#define IDC_EDIT1   1024

We then define a bunch of constants, including a window message, a bunch of timer messages, a bunch of time intervals, and some other values. Everything other than WM_ENFORCE_FOCUS manifested as magic numbers in the code. I’ve preserved the original values, but gave them proper names.

IDC_EDIT1 is notably duplicated in bsod.rc, the resource file. If I had more than one constant, I would have defined a header and included it in both bsod.rc and bsod.c.

Compatibility

#if defined(_MSC_VER) && _MSC_VER <= 1200
#define wnsprintf wnsprintfA
int wnsprintfA(PSTR pszDest, int cchDest, PCSTR pszFmt, ...);
typedef unsigned char *RPC_CSTR;
#endif

To ensure it builds on the ancient VC6 with its outdated Windows SDK headers, we define the prototype of wnsprintf, which is actually wnsprintfA in the DLL due to it being the non-Unicode version. Windows headers normally define a macro to replace the name with either the A or W version, depending on whether the UNICODE macro is defined.

We also define the RPC_CSTR type, since we’ll be using an RPC function later, and that type is somehow missing in VC6.

typedef BOOL(WINAPI *LPFN_SHUTDOWNBLOCKREASONCREATE)(HWND, LPCWSTR);
typedef BOOL(WINAPI *LPFN_SHUTDOWNBLOCKREASONDESTROY)(HWND);

This defines the function pointer types for ShutdownBlockReasonCreate and ShutdownBlockReasonDestroy functions introduced in Windows Vista, in anticipation of the school upgrading to Windows 7, which did happen eventually. These will be needed to actually block shutdown on newer versions of Windows.

Function prototypes

LRESULT CALLBACK WndProc(HWND, UINT, WPARAM, LPARAM);
INT_PTR CALLBACK DlgProc(HWND, UINT, WPARAM, LPARAM);
LRESULT CALLBACK LowLevelKeyboardProc(int, WPARAM, LPARAM);
LRESULT CALLBACK LowLevelMouseProc(int, WPARAM, LPARAM);

We now define a bunch of function prototypes.

Constants

#define PASSWORD_LENGTH (sizeof(szRealPassword) - 1)
const char szRealPassword[] = {0x54, 0x50, 0x44, 0x4b, 0x51, 0x50,
                               0x48, 0x10, 0xb,  0x46, 0x44, 0x00};
const char szClassName[] = "BlueScreenOfDeath";

This defines two constants, one for the encoded password, and one for the window class name.

The password is quantum5.ca, encoded by XORing with the number 37. I don’t remember why it was chosen, other than perhaps because it’s a prime number.

Variables

We now declare a bunch of global variables:

HINSTANCE hInst;
HWND hwnd;  // Main window
HWND scwnd; // Static bitmap control
HWND hdlg;  // Password popup

HACCEL hAccel;
HHOOK hhkKeyboard, hhkMouse;
HDESK hOldDesk, hNewDesk;
char szDeskName[40];

LPFN_SHUTDOWNBLOCKREASONCREATE fShutdownBlockReasonCreate;
LPFN_SHUTDOWNBLOCKREASONDESTROY fShutdownBlockReasonDestroy;

Note that using global variables like this is not recommended, but in this case, there’s no way one instance of this program will ever show more than one window, so whatever.

hInst exists solely because a bunch of Windows APIs ask for HINSTANCE due to 16-bit compatibilty reasons. Normally, you’d save the hInstance passed to WinMain, but we don’t have that. You’ll see how we calculate it later.

The HWND variables are self-explanatory.

The HACCEL is for storing a handle to the accelerators, and we will be using keyboard accelerator tables instead of implementing complex parsing logic for keystrokes.

HHOOK variables are used to create low-level keyboard and mouse hooks.

HDESK variables are used to store the handle to the original desktop and a new “secure desktop.” szDeskName stores the name of the secure desktop.

We also define variables to store the pointers to ShutdownBlockReasonCreate and ShutdownBlockReasonDestroy functions, if they exist.

#ifdef NOTASKMGR
HKEY hSystemPolicy;
#endif

If the task manager is to be disabled, we define an HKEY to store a handle to a certain registry key.

Keyboard accelerator table

ACCEL accel[] = {
    {FALT | FVIRTKEY,                     '1',       0xBE00},
    {FALT | FVIRTKEY,                     '3',       0xBE01},
    {FALT | FVIRTKEY,                     '5',       0xBE02},
    {FALT | FVIRTKEY,                     '7',       0xBE03},
    {FALT | FVIRTKEY,                     VK_F2,     0xBE04},
    {FALT | FVIRTKEY,                     VK_F4,     0xBE05},
    {FALT | FVIRTKEY,                     VK_F6,     0xBE06},
    {FALT | FVIRTKEY,                     VK_F8,     0xBE07},
    {FALT | FCONTROL | FSHIFT | FVIRTKEY, VK_DELETE, 0xDEAD},
};
BOOL bAccel[ARRAY_SIZE(accel) - 1];

Here, we define a keyboard accelerator table, uncreatively called accel. This will eventually be passed to CreateAcceleratorTable. We define 8 shortcut keys to be pressed, generating WM_COMMAND messages with codes 0xBE00 to 0xBE07 in wParam. These can be triggered by holding down Alt and then pressing the 1, 3, 5, 7, F2, F4, F6, and F8 keys. bAccel tracks whether any of these keys were pressed.

Finally, there’s a shortcut for Ctrl+Alt+Shift+Delete, which triggers a WM_COMMAND with 0xDEAD, that signals the program to trigger the exit routine.

Helper functions

void GenerateUUID(LPSTR szUuid) {
    UUID bUuid;
    RPC_CSTR rstrUUID;

    UuidCreate(&bUuid);
    UuidToString(&bUuid, &rstrUUID);
    lstrcpy(szUuid, (LPCSTR) rstrUUID);
    RpcStringFree(&rstrUUID);
}

This function generates a UUID, which will be used to create a random and guaranteed-to-be-unique name for a secure desktop. It uses a bunch of functions from rpcrt4.dll, like UuidCreate and UuidToString, to generate the UUID, and we use lstrcpy to copy it into the buffer passed in, before freeing the weird RPC_CSTR. Note that lstrcpy from kernel32.dll is used because we can’t use strcpy in the C library.

int UnixTime() {
    union {
        __int64 scalar;
        FILETIME ft;
    } time;
    GetSystemTimeAsFileTime(&time.ft);
    return (int) ((time.scalar - 116444736000000000i64) / 10000000i64);
}

This is effectively a reimplementation of the time function from the C standard library, which we can’t use. So instead, we use GetSystemTimeAsFileTime, which returns the current time as a 64-bit integer, divided into low and high parts as 32-bit DWORDs in the FILETIME structure. To make it easily interpretable, we convert it to __int64—Microsoft’s extension type for 64-bit integer in the days before long long is properly supported—through a union, which is the safe way to ensure it’s properly aligned.

Note that FILETIME is defined as measuring the time in 100 ns intervals since midnight UTC on January 1, 1601. To convert this to Unix time, we need to subtract 116,444,736,000,000,000 such intervals, i.e. 11.6 billion seconds, to make the zero point the Unix epoch, and then divide by 10 million to convert to seconds.

Note that this implementation technically uses the Microsoft Visual C++ library function _alldiv to perform the division. However, that file does not introduce any further dependencies on the C library, most notably the initialization code, which is the main source of bloat. However, if you really hate the C library, the following alternative implementation in x86 assembly will avoid it:

__declspec(naked) int UnixTime(void) {
    __asm {
        sub     esp,    8
        push    esp
        call    dword ptr GetSystemTimeAsFileTime
        pop     eax
        pop     edx
        sub     eax,    3577643008
        sbb     edx,    27111902
        mov     ecx,    10000000
        idiv    ecx
        ret
    }
}

But I see no need to make the code unreadable for the exact same executable size… Anyways, let’s go on:

int Random() {
    static int seed = 0;
    if (!seed)
        seed = UnixTime();
    seed = 1103515245 * seed + 12345;
    seed &= 0x7FFFFFFF;
    return seed;
}

We then use this crappy random number generator that uses the current Unix time as a seed. This isn’t going to win any awards or be suitable for cryptographic use, but it’s good enough for this program.

For those curious, this implementation is a linear congruential generator, specifically with the parameters used in glibc, with the output capped to be positive through bitwise AND with 0x7FFFFFFF.

Protection functions

And then we have a bunch of functions to make sure the program doesn’t get killed…

#ifdef NOTASKMGR
void DisableTaskManager(void) {
    DWORD dwOne = 1;
    if (hSystemPolicy)
        RegSetValueEx(hSystemPolicy, "DisableTaskMgr", 0, REG_DWORD, (LPBYTE) &dwOne,
                      sizeof(DWORD));
}

void EnableTaskManager(void) {
    DWORD dwZero = 0;
    if (hSystemPolicy)
        RegSetValueEx(hSystemPolicy, "DisableTaskMgr", 0, REG_DWORD, (LPBYTE) &dwZero,
                      sizeof(DWORD));
}
#endif

These functions disable and enable the task manager by setting a registry key that would normally be controlled by group policy. Specifically, it sets a value named DisableTaskMgr under the key in hSystemPolicy, which will be opened to HKEY_CURRENT_USER\Software\Microsoft\Windows\CurrentVersion\Policies\System.

This is only active if the NOTASKMGR macro is defined.

DWORD ProtectProcess(void) {
    ACL acl;

    if (!InitializeAcl(&acl, sizeof acl, ACL_REVISION))
        return GetLastError();

    return SetSecurityInfo(GetCurrentProcess(), SE_KERNEL_OBJECT, DACL_SECURITY_INFORMATION, NULL,
                           NULL, &acl, NULL);
}

This function creates a blank ACL and sets it on the current process object. Since a blank ACL does not grant any permissions, it means that no one has the permission to perform any operations on this process, for “nefarious” things such as terminating it.

Fun fact: I learned this trick from this crapware that the school had installed on the computers. I had a batch file that killed all the startup programs to squeeze out every drop of performance on the ancient school computers, and that one program couldn’t be killed. Naturally, I studied how it worked, and it did so by updating the ACL to deny everyone PROCESS_TERMINATE, i.e. the ability to call TerminateProcess on it. It had a fatal flaw: it didn’t stop someone from calling CreateRemoteThread on the process. Naturally, I wrote a helper program that created a remote thread in the crapware that ran ExitProcess, and lo and behold, the crapware was successfully killed.

Since I am using a blank ACL here, it uses a lot less code than updating the ACL as used by that crapware, and there’s the added bonus that it denies all sketchy operations, including CreateRemoteThread and any other debugging operations…

Also note that ACLs are supposed to be variable length and heap-allocated, with stuff coming after the ACL structure in the same block of memory, but the degenerate case of the empty ACL fits on the stack just fine.

STICKYKEYS StartupStickyKeys = {sizeof(STICKYKEYS), 0};
TOGGLEKEYS StartupToggleKeys = {sizeof(TOGGLEKEYS), 0};
FILTERKEYS StartupFilterKeys = {sizeof(FILTERKEYS), 0};

void AllowAccessibilityShortcutKeys(BOOL bAllowKeys) {
    if (bAllowKeys) {
        SystemParametersInfo(SPI_SETSTICKYKEYS, sizeof(STICKYKEYS), &StartupStickyKeys, 0);
        SystemParametersInfo(SPI_SETTOGGLEKEYS, sizeof(TOGGLEKEYS), &StartupToggleKeys, 0);
        SystemParametersInfo(SPI_SETFILTERKEYS, sizeof(FILTERKEYS), &StartupFilterKeys, 0);
    } else {
        STICKYKEYS skOff = StartupStickyKeys;
        TOGGLEKEYS tkOff = StartupToggleKeys;
        FILTERKEYS fkOff = StartupFilterKeys;

        if ((skOff.dwFlags & SKF_STICKYKEYSON) == 0) {
            skOff.dwFlags &= ~SKF_HOTKEYACTIVE;
            skOff.dwFlags &= ~SKF_CONFIRMHOTKEY;
            SystemParametersInfo(SPI_SETSTICKYKEYS, sizeof(STICKYKEYS), &skOff, 0);
        }
        if ((tkOff.dwFlags & TKF_TOGGLEKEYSON) == 0) {
            tkOff.dwFlags &= ~TKF_HOTKEYACTIVE;
            tkOff.dwFlags &= ~TKF_CONFIRMHOTKEY;
            SystemParametersInfo(SPI_SETTOGGLEKEYS, sizeof(TOGGLEKEYS), &tkOff, 0);
        }
        if ((fkOff.dwFlags & FKF_FILTERKEYSON) == 0) {
            fkOff.dwFlags &= ~FKF_HOTKEYACTIVE;
            fkOff.dwFlags &= ~FKF_CONFIRMHOTKEY;
            SystemParametersInfo(SPI_SETFILTERKEYS, sizeof(FILTERKEYS), &fkOff, 0);
        }
    }
}

This is the helper function that toggles the Windows accessibility shortcuts, specifically for sticky keys, toggle keys, and filter keys. We define three structures to store the state of each, which will be populated on startup. The helper function, when told to allow, will set those three back to the original state with SystemParametersInfo. When told to disallow, it will remove the bits for the keyboard shortcut and the confirmation dialogue from the flags for each.

Generating the blue screen

LPSTR bsod1 = "\r\n\
A problem has been detected and Windows has been shut down to prevent damage\r\n\
to your computer.\r\n\
\r\n\
The problem seems to be caused by the following file: ";

LPSTR bsod2 = "\r\n\r\n";

LPSTR bsod3 = "\r\n\
\r\n\
If this is the first time you've seen this stop error screen,\r\n\
restart your computer. If this screen appears again, follow\r\n\
these steps:\r\n\
\r\n\
Check to make sure any new hardware or software is properly installed.\r\n\
If this is a new installation, ask your hardware or software manufacturer\r\n\
for any Windows updates you might need.\r\n\
\r\n\
If problems continue, disable or remove any newly installed hardware\r\n\
or software. Disable BIOS memory options such as caching or shadowing.\r\n\
If you need to use Safe Mode to remove or disable components, restart\r\n\
your computer, press F8 to select Advanced Startup Options, and then\r\n\
select Safe Mode.\r\n\
\r\n\
Technical information:\r\n\
\r\n";

LPSTR bsod4 = "*** STOP: 0x%08X (0x%08X,0x%08X,0x%08X,0x%08X)";
LPSTR bsod5 = "\r\n\r\n\r\n***  ";
LPSTR bsod6 = "%s - Address %08X base at %08X, DateStamp %08x";

First, we have a bunch of string fragments that comprise the text shown on the blue screen. These are divided into six chunks, and bsod4 and bsod6 are actually printf-style format strings.

LPSTR lpBadDrivers[] = {
    "HTTP.SYS",  "SPCMDCON.SYS", "NTFS.SYS",   "ACPI.SYS",  "AMDK8.SYS", "ATI2MTAG.SYS",
    "CDROM.SYS", "BEEP.SYS",     "BOWSER.SYS", "EVBDX.SYS", "TCPIP.SYS", "RDPDR.SYS",
};

We then have a bunch of system drivers that could be blamed for the crash. I am not sure why most of these were selected in particular, but I think:

ATI2MTAG.SYS was selected because the school computers had ATI⁶ graphics cards, and the display driver often crashed;
BEEP.SYS was selected because I had code in MusicKeyboard that directly interfaced with it; and
BOWSER.SYS was selected because it had a weird name that was not a typo.

typedef struct {
    LPSTR name;
    DWORD code;
} BUG_CHECK_CODE;

BUG_CHECK_CODE lpErrorCodes[] = {
    {"INVALID_SOFTWARE_INTERRUPT",  0x07},
    {"KMODE_EXCEPTION_NOT_HANDLED", 0x1E},
    {"PAGE_FAULT_IN_NONPAGED_AREA", 0x50},
    {"KERNEL_STACK_INPAGE_ERROR",   0x77},
    {"KERNEL_DATA_INPAGE_ERROR",    0x7A},
};

Then we had a bunch of common error codes. Note that these are called BUG_CHECK_CODE because “bug check” is the proper name for the BSoD, and in fact, the kernel function that triggers them is called KeBugCheck. We needed both the string name and the internal code, because the code will be shown in the STOP: line.

HBITMAP RenderBSoD(void) {
    HBITMAP hbmp;
    HDC hdc;
    HBRUSH hBrush = CreateSolidBrush(RGB(0, 0, 128));
    RECT rect = {0, 0, 640, 480};
    HFONT hFont;
    char bsod[2048];
    char buf[1024];
    LPSTR lpName;
    BUG_CHECK_CODE bcc;
    DWORD dwAddress;
    int i, k;

Now we have the RenderBSoD function, which returns an HBITMAP containing the rendered image for the BSoD. Note that we hoist all the variable declarations to the top of the function due to C89 limitations.

Most of these variables are self-explanatory or will soon be obvious. The hBrush is the solid blue brush that will be used to paint the screen blue. bsod is the final generated string that we initialize, since it’ll be used.

    // Initialize RNG
    k = Random() & 0xFF;
    for (i = 0; i < k; ++i)
        Random();

We then initialize the random number generator somewhat by taking a random byte and skipping ahead by that many numbers. This makes the output a bit more random.

    hdc = CreateCompatibleDC(GetDC(hwnd));
    hbmp = CreateCompatibleBitmap(GetDC(hwnd), 640, 480);
    hFont = CreateFont(14, 8, 0, 0, FW_NORMAL, 0, 0, 0, ANSI_CHARSET, OUT_RASTER_PRECIS,
                       CLIP_DEFAULT_PRECIS, NONANTIALIASED_QUALITY, FF_MODERN, "Lucida Console");

We then create an HDC that’s compatible with our main window, whose handle will be in hwnd when this function is called. The DC, or device context, defines the attributes of the output device, which in this case is just a screen, and contains some state for rendering parameters.

We then create a 640×480 bitmap that’s similarly compatible, onto which the BSoD will be rendered. Since 640×480 is the screen resolution that Windows falls back to when rendering the actual BSoD, this will make it look correct.

We then create the font. After some experimentation and comparison with a real BSoD, I found that Lucida Console with height 14 and width 8 in GDI logical units looked indistinguishable from the real one. It is very important to pass NONANTIALIASED_QUALITY since there is no smoothing or ClearType™ on the BSoD, and this ensures that the text looks as ugly as it does on the real screen.

    lstrcpy(bsod, bsod1);
    lpName = lpBadDrivers[Random() % ARRAY_SIZE(lpBadDrivers)];
    bcc = lpErrorCodes[Random() % ARRAY_SIZE(lpErrorCodes)];
    lstrcat(bsod, lpName);
    lstrcat(bsod, bsod2);
    lstrcat(bsod, bcc.name);
    lstrcat(bsod, bsod3);

We populate bsod from the string fragments, inserting the driver name and the error code string as necessary. The random number generator is used to pick a random driver and code from the lists.

Note that we start the string with lstrcpy and then add onto it with lstrcat. These are just the kernel32.dll versions of strcpy and strcat, which we can’t use due to shunning the C library.

    switch (Random() % 4) {
    case 0:
        wnsprintf(buf, ARRAY_SIZE(buf), bsod4, bcc.code, Random() | 1 << 31, Random() & 0xF,
                  Random() | 1 << 31, 0);
        break;
    case 1:
        wnsprintf(buf, ARRAY_SIZE(buf), bsod4, bcc.code, Random() | 1 << 31, Random() | 1 << 31,
                  Random() & 0xF, Random() & 0xF);
        break;
    case 2:
        wnsprintf(buf, ARRAY_SIZE(buf), bsod4, bcc.code, Random() | 1 << 31, 0, Random() & 0xF,
                  Random() & 0xF);
        break;
    default:
        wnsprintf(buf, ARRAY_SIZE(buf), bsod4, bcc.code, Random() | 1 << 31, Random() | 1 << 31,
                  Random() | 1 << 31, Random() | 1 << 31);
        break;
    }
    lstrcat(bsod, buf);

We generate a temporary string for the STOP: line from bsod4 with wnsprintf (since we can’t use sprintf for aforementioned reasons), and then append it. There are four different ways the bug check arguments could be populated, since they aren’t always random numbers. If I cared more, I probably could have made the numbers more realistic based on the error codes, but right now, this just ensures that there are some small numbers, some big numbers, and some zeroes in the parameters. We also guarantee the high bit is set on the first parameter, which makes it look like a memory address in the upper half of the address space, commonly used for kernel mode addresses.

    lstrcat(bsod, bsod5);
    dwAddress = Random() | 1 << 31;
    wnsprintf(buf, ARRAY_SIZE(buf), bsod6, lpName, dwAddress, dwAddress & 0xFFFF0000, UnixTime());
    lstrcat(bsod, buf);

Finally, we add the last two lines, including the module name again and some addresses. We just generate a random faulting address, and set the module base to the start of the 64 kiB block, and then plug in the Unix time.

At this point, bsod contains all the text to be shown.

    SelectObject(hdc, hbmp);
    SelectObject(hdc, hFont);
    FillRect(hdc, &rect, hBrush);
    SetBkColor(hdc, RGB(0, 0, 128));
    SetTextColor(hdc, RGB(255, 255, 255));
    DrawText(hdc, bsod, -1, &rect, 0);

We now load the bitmap and font into the HDC, then fill the entire bitmap with the blue background brush. We then set the background and text colours, before drawing the text.

    DeleteDC(hdc);
    DeleteObject(hBrush);
    return hbmp;

We finally clean up the objects we’ve created, then return the bitmap created.

Entry point

Now we come to the startup sequence:

DWORD APIENTRY RawEntryPoint() {
    MSG messages;
    WNDCLASSEX wincl;
    HMODULE user32;

We declare stack variables for processing window messages and a window class to be registered, then an HMODULE to get a handle to user32.dll, whence we will load ShutdownBlockReasonCreate and ShutdownBlockReasonDestroy for newer versions of Windows.

    hInst = (HINSTANCE) GetModuleHandle(NULL);
    GenerateUUID(szDeskName);

    ProtectProcess();

We initialize hInst. In 32- and 64-bit Windows, the HINSTANCE is just the base address of the executable, which we obtain as above.

We then initialize szDeskName to be the secure desktop’s unique name, and protect the process from being killed.

    // Save the current sticky/toggle/filter key settings so they can be
    // restored them later
    SystemParametersInfo(SPI_GETSTICKYKEYS, sizeof(STICKYKEYS), &StartupStickyKeys, 0);
    SystemParametersInfo(SPI_GETTOGGLEKEYS, sizeof(TOGGLEKEYS), &StartupToggleKeys, 0);
    SystemParametersInfo(SPI_GETFILTERKEYS, sizeof(FILTERKEYS), &StartupFilterKeys, 0);

We save the initial state for the accessibility features too.

    hOldDesk = GetThreadDesktop(GetCurrentThreadId());
    hNewDesk = CreateDesktop(szDeskName, NULL, NULL, 0, GENERIC_ALL, NULL);
    SetThreadDesktop(hNewDesk);
    SwitchDesktop(hNewDesk);

We get the current desktop and then create a new one, then set the current thread to use that desktop, before switching the screen over to the new desktop.

If you have used Windows Vista or newer, you may remember the UAC prompt that dims the entire screen and heard it vaguely referred to as a “secure desktop”. This basically does exactly the same thing: we create a new desktop and switch to it. Since most normal GUI objects are desktop-scoped, including things like windows, menus, and most importantly, hooks, creating a new desktop isolates the window on it. In case of the secure desktop for UAC, no weird messages can be sent to windows on the new desktop, e.g. to simulate the “yes” button being clicked, and low-level keyboard hooks (which is how most keyloggers work) can’t be used to intercept the user’s keystrokes. This is how we defeat keyloggers.

The dimmed desktop for the UAC prompt is actually just a screenshot of the old desktop, dimmed, and rendered on a full screen window. Here, we don’t bother.

As a bonus, since no explorer is running on the secure desktop, it’s extra hard to find a way to break out of the BSoD.

#ifdef NOTASKMGR
    if (RegCreateKeyEx(HKEY_CURRENT_USER,
                       "Software\\Microsoft\\Windows\\CurrentVersion\\Policies\\System", 0, NULL, 0,
                       KEY_SET_VALUE, NULL, &hSystemPolicy, NULL)) {
        hSystemPolicy = NULL;
    }
#endif

If we are disabling the task manager, we need to load up hSystemPolicy. We open the registry key named, creating it if it doesn’t exist. If we should somehow fail, we set hSystemPolicy to NULL, in which case DisableTaskManager will fail gracefully.

    wincl.hInstance = hInst;
    wincl.lpszClassName = szClassName;
    wincl.lpfnWndProc = WndProc;
    wincl.style = CS_DBLCLKS;
    wincl.cbSize = sizeof(WNDCLASSEX);

    wincl.hIcon = LoadIcon(NULL, MAKEINTRESOURCE(1));
    wincl.hIconSm = LoadIcon(NULL, MAKEINTRESOURCE(1));
    wincl.hCursor = NULL;
    wincl.lpszMenuName = NULL;
    wincl.cbClsExtra = 0;
    wincl.cbWndExtra = 0;
    wincl.hbrBackground = (HBRUSH) GetStockObject(BLACK_BRUSH);

    if (!RegisterClassEx(&wincl))
        return 0;

We now populate the window class before registering this. This is just boilerplate, except that we set the background as BLACK_BRUSH to simulate the flash of black while the display mode changes, before the BSoD shows up.

    user32 = GetModuleHandle("user32");
    fShutdownBlockReasonCreate =
        (LPFN_SHUTDOWNBLOCKREASONCREATE) GetProcAddress(user32, "ShutdownBlockReasonCreate");
    fShutdownBlockReasonDestroy =
        (LPFN_SHUTDOWNBLOCKREASONDESTROY) GetProcAddress(user32, "ShutdownBlockReasonDestroy");

We get an HMODULE to user32.dll and then load up ShutdownBlockReasonCreate and ShutdownBlockReasonDestroy.

    hhkKeyboard = SetWindowsHookEx(WH_KEYBOARD_LL, LowLevelKeyboardProc, hInst, 0);
    hhkMouse = SetWindowsHookEx(WH_MOUSE_LL, LowLevelMouseProc, hInst, 0);

We also create low-level keyboard and mouse hooks to disable certain keys and all mouse movements.

    hwnd = CreateWindowEx(0, szClassName, "Blue Screen of Death", WS_POPUP, CW_USEDEFAULT,
                          CW_USEDEFAULT, 640, 480, NULL, NULL, hInst, NULL);

    ShowWindow(hwnd, SW_MAXIMIZE);

We create the window, and show it maximized.

    hAccel = CreateAcceleratorTable(accel, ARRAY_SIZE(accel));
    while (GetMessage(&messages, NULL, 0, 0) > 0) {
        if (!TranslateAccelerator(hwnd, hAccel, &messages)) {
            TranslateMessage(&messages);
            DispatchMessage(&messages);
        }
    }

We now run the classic Window loop, except we construct the accelerator table ahead of time and call TranslateAccelerator to interpret the keyboard accelerators. If that function returns true, then the message should not be processed further, per documentation, so we skip TranslateMessage and DispatchMessage in that case.

    ExitProcess(messages.wParam);
}

We close out RawEntryPoint with an ExitProcess call, since Windows by default passes the return value to ExitThread and leaves all other threads in the process running. Since threads may start for whatever reason these days, the process might linger on indefinitely, and we ensure this isn’t the case via ExitProcess.

The exit code is the wParam of the last message processed. The only way the message loop can terminate is if it receives WM_QUIT, which is generated by the PostQuitMessage function. Effectively, we exit the program with the value passed to PostQuitMessage.

Low-level mouse hook

This is the low-level mouse hook, which just swallows all mouse events:

LRESULT CALLBACK LowLevelMouseProc(int nCode, WPARAM wParam, LPARAM lParam) {
    if (nCode >= 0) {
        return 1;
    }
    return CallNextHookEx(NULL, nCode, wParam, lParam);
}

Low-level keyboard hook

This is the low-level keyboard hook, which swallows all keys not used in any of our accelerators or would be required in the password dialogue:

LRESULT CALLBACK LowLevelKeyboardProc(int nCode, WPARAM wParam, LPARAM lParam) {
    if (nCode >= 0) {
        KBDLLHOOKSTRUCT *key = (KBDLLHOOKSTRUCT *) lParam;

        switch (key->vkCode) {
        case VK_LWIN:
        case VK_RWIN:
        case VK_TAB:
        case VK_ESCAPE:
        case VK_LBUTTON:
        case VK_RBUTTON:
        case VK_CANCEL:
        case VK_MBUTTON:
        case VK_CLEAR:
        case VK_PAUSE:
        case VK_CAPITAL:
        case VK_KANA:
        case VK_JUNJA:
        case VK_FINAL:
        case VK_HANJA:
        case VK_NONCONVERT:
        case VK_MODECHANGE:
        case VK_ACCEPT:
        case VK_END:
        case VK_HOME:
        case VK_LEFT:
        case VK_UP:
        case VK_RIGHT:
        case VK_DOWN:
        case VK_SELECT:
        case VK_PRINT:
        case VK_EXECUTE:
        case VK_SNAPSHOT:
        case VK_INSERT:
        case VK_HELP:
        case VK_APPS:
        case VK_SLEEP:
        case VK_NUMLOCK:
        case VK_SCROLL:
        case VK_PROCESSKEY:
        case VK_PACKET:
        case VK_ATTN:
            return 1;
        }
    }
    return CallNextHookEx(NULL, nCode, wParam, lParam);
}

Window procedure

The window procedure is the event handler for the window, effectively. It starts like this:

LRESULT CALLBACK WndProc(HWND hwnd, UINT message, WPARAM wParam, LPARAM lParam) {
    int i;
    switch (message) {

We define a loop variable for use later due to C89 and use a switch statement to match on the message ID.

    case WM_CREATE:
        scwnd = CreateWindowEx(0, "STATIC", "", SS_BITMAP | WS_CHILD | WS_VISIBLE, 0, 0, 640, 480,
                               hwnd, (HMENU) -1, NULL, NULL);
#ifndef NOAUTOKILL
        SetTimer(hwnd, TM_AUTOKILL, AUTOKILL_TIMEOUT, NULL);
#endif
        SetTimer(hwnd, TM_DISPLAY, DISPLAY_DELAY, NULL);
        SetTimer(hwnd, TM_FORCEDESK, FORCE_INTERVAL, NULL);

        SetCursor(NULL);

        if (fShutdownBlockReasonCreate)
            fShutdownBlockReasonCreate(hwnd, L"You can't shutdown with a BSoD running.");

        // Force to front
        SetWindowPos(hwnd, HWND_TOPMOST, 0, 0, 0, 0,
                     SWP_ASYNCWINDOWPOS | SWP_NOACTIVATE | SWP_NOMOVE | SWP_NOSIZE);
        SetForegroundWindow(hwnd);
        LockSetForegroundWindow(1);

        AllowAccessibilityShortcutKeys(FALSE);
#ifdef NOTASKMGR
        DisableTaskManager();
#endif
        break;

In our WM_CREATE handler, which is called upon window creation, we create a static control to display our blue screen bitmap.

We also set a bunch of timers:

If NOAUTOKILL is not defined, then we set up a timer to signal the program to exit, which is invaluable for debugging;
We set a timer to display the BSoD after initially blacking out the screen; and
We set a timer to forcefully switch the desktop to our secure desktop, in case someone switches it back somehow.

We also hide the cursor, since a mouse cursor on the BSoD gives the game away pretty quickly.

We then create a shutdown blocking reason. Note that fShutdownBlockReasonCreate takes exclusively Unicode strings, so we needed to use a const wchar_t * literal with the L prefix on the string.

We then move the window to the front through SetWindowPos, with HWND_TOPMOST to guarantee the window shows up above all normal windows. We set it to the foreground window and prevent any other process from taking over with LockSetForegroundWindow.

Finally, we disable accessibility shortcuts and the task manager.

    case WM_SHOWWINDOW:
        return 0;

We ignore all WM_SHOWWINDOW by returning 0 immediately, skipping the processing in DefWindowProc, which may hide the current window.

    case WM_TIMER:
        switch (wParam) {

We now handle the timers.

        case TM_DISPLAY: {
            RECT rectClient;

            KillTimer(hwnd, TM_DISPLAY);
            SendMessage(scwnd, STM_SETIMAGE, (WPARAM) IMAGE_BITMAP, (LPARAM) RenderBSoD());
            GetClientRect(hwnd, &rectClient);
            SetWindowPos(scwnd, 0, 0, 0, rectClient.right, rectClient.bottom,
                         SWP_NOMOVE | SWP_NOZORDER | SWP_NOACTIVATE);
            InvalidateRect(hwnd, NULL, TRUE);
            break;
        }

For TM_DISPLAY, we stop the timer from firing again, then render the BSoD and set it as the bitmap shown by the static control. We then make the static control cover the whole screen, before forcing the whole window to repaint.

#ifndef NOAUTOKILL
        case TM_AUTOKILL:
            KillTimer(hwnd, TM_AUTOKILL);
            DestroyWindow(hwnd);
            break;
#endif

If the safety timeout triggers, we stop the timer from triggering again, and then call DestroyWindow to tear everything down.

        case TM_FORCEDESK:
            SwitchDesktop(hNewDesk);
            break;
        }
        break;

This is pretty self-explanatory.

    case WM_DESTROY:
        if (fShutdownBlockReasonDestroy)
            fShutdownBlockReasonDestroy(hwnd);

        UnhookWindowsHookEx(hhkKeyboard);
        UnhookWindowsHookEx(hhkMouse);

        DestroyAcceleratorTable(hAccel);
        LockSetForegroundWindow(0);
        AllowAccessibilityShortcutKeys(TRUE);
        SetThreadDesktop(hOldDesk);
        SwitchDesktop(hOldDesk);
        CloseDesktop(hNewDesk);
#ifdef NOTASKMGR
        EnableTaskManager();
#endif
        PostQuitMessage(0);
        break;

Upon receiving WM_DESTROY, we perform cleanup:

removing the shutdown block on newer Windows;
unhook the mouse and keyboard;
clean up the acclerator table;
unlock the foreground window;
reset the accessibility shortcuts;
switch the desktop back;
clean up the secure desktop;
re-enable the task manager if we disabled it; and
exit the message loop and the program.

    case WM_CLOSE:
        switch (DialogBox(hInst, MAKEINTRESOURCE(32), hwnd, DlgProc)) {
        case 1:
            DestroyWindow(hwnd);
            break;
        case 2:
            hdlg = HDLG_MSGBOX;
            MessageBox(hwnd,
                       "You got the password wrong!\n"
                       "Good luck guessing!",
                       "Error!", MB_ICONERROR);
            hdlg = NULL;
            break;
        default:
            hdlg = HDLG_MSGBOX;
            MessageBox(hwnd,
                       "You just abandoned the perfect chance to exit!\n"
                       "Good luck trying!",
                       "Error!", MB_ICONERROR);
            hdlg = NULL;
        }
        break;

Upon receiving WM_CLOSE, which means someone is trying to close the window, we invoke DialogBox to create a dialogue from our resource template and use DlgProc as the dialogue procedure.

If the dialogue procedure returns 1, signalling the password is correct, we destroy the window. If it returns 2, we show a taunting message for an incorrect password, and any other value means the dialogue box was closed, and we show a different taunting message.

For reference, the dialogue template is in bsod.rc and looks like:

#include 

#define IDC_EDIT1  1024
#define IDC_STATIC -1

32 DIALOGEX 0, 0, 256, 73
STYLE DS_SETFONT | DS_MODALFRAME | DS_FIXEDSYS | WS_POPUP | WS_CAPTION | WS_SYSMENU
CAPTION "Dialog"
FONT 8, "MS Shell Dlg", 400, 0, 0x1
BEGIN
    EDITTEXT        IDC_EDIT1,47,33,201,14,ES_AUTOHSCROLL | ES_PASSWORD
    LTEXT           "It seems like you are trying to exit a Blue Screen of Death, which is impossible. But if you are a hacker and knows the password, you might be able to.",IDC_STATIC,7,7,240,27
    LTEXT           "Password:",IDC_STATIC,7,36,38,8,0,WS_EX_TRANSPARENT
    PUSHBUTTON      "&OK",IDOK,145,49,50,14
    PUSHBUTTON      "&Cancel",IDCANCEL,198,49,50,14
END

Back to the window procedure…

    case WM_KEYDOWN:
        return 0;

We skip the weird WM_KEYDOWN handling for F10 in DefWindowProc by doing this.

    case WM_COMMAND:
    case WM_SYSCOMMAND:
        if (HIWORD(wParam) == 1) {
            if (LOWORD(wParam) == 0xDEAD) {
                for (i = 0; i < ARRAY_SIZE(bAccel); ++i)
                    if (!bAccel[i])
                        return 0;

                SendMessage(hwnd, WM_CLOSE, 0, 0);
            } else if ((LOWORD(wParam) & 0xFF00) == 0xBE00) {
                int index = LOWORD(wParam) & 0xFF;
                if (index < ARRAY_SIZE(bAccel))
                    bAccel[index] = TRUE;
            }
        }
        break;

This handles the keyboard accelerators. Whether WM_COMMAND or WM_SYSCOMMAND is generated is kinda complex and it doesn’t really matter here, so we just treat them the same. If HIWORD(wParam) == 1, then it’s an accelerator key, and LOWORD(wParam) is what we put into the accelerator table above:

If it’s 0xDEAD and all other accelerators have been pressed, we send WM_CLOSE, initiating the procedure above to show the password prompt.
If it’s 0xBExx and the low-order byte is a valid index, we mark it as pressed.

    case WM_QUERYENDSESSION:
        return 0;

This tells Windows XP that the user shouldn’t be allowed to log out. Typically, this is done in response to unsaved documents, but we are abusing it to prevent logouts without the password.

Perhaps due to applications just returning 0 for all messages instead of calling DefWindowProc, or otherwise abusing WM_QUERYENDSESSION, Windows Vista introduced the whole ShutdownBlockReasonCreate thing…

    case WM_ENFORCE_FOCUS:
        if (GetForegroundWindow() != hdlg) {
            if (hdlg && hdlg != HDLG_MSGBOX) {
                SetFocus(hdlg);
                SetForegroundWindow(hdlg);
            } else if (!hdlg) {
                SetFocus(hwnd);
                SetForegroundWindow(hwnd);
                SetWindowPos(hwnd, HWND_TOPMOST, 0, 0, 0, 0, SWP_NOMOVE | SWP_NOSIZE);
            }
        }
        SetCursor(NULL);
        ShowWindow(hwnd, SW_SHOWMAXIMIZED);
        break;

This is a custom window message that we use to reset the foreground window, should we somehow stop being it. We set the focus on hdlg if it exists, otherwise the main window, which we ensure is topmost. However, if we are showing a message box, we don’t change the focus.

We also hide the cursor and maximize the window.

    case WM_ACTIVATE:
        if (LOWORD(wParam) != WA_INACTIVE)
            break;
        if (!HIWORD(wParam))
            break;

    case WM_NCACTIVATE:
    case WM_KILLFOCUS:
        PostMessage(hwnd, WM_ENFORCE_FOCUS, 0, 0);
        return 1;

If we detect the window has been deactivated somehow or lost focus, we post the message WM_ENFORCE_FOCUS to be handled the next time the message loop runs.

    case WM_SIZE:
        if (wParam != SIZE_MAXIMIZED)
            ShowWindow(hwnd, SW_SHOWMAXIMIZED);
        break;

If we receive a WM_SIZE telling us we are no longer maximized, undo that.

    default:
        return DefWindowProc(hwnd, message, wParam, lParam);
    }
    return 0;
}

And we wrap up the window procedure by calling DefWindowProc for all other messages, and return 0 if we’ve handled any other message that didn’t return early.

Dialogue procedure

And finally, we have a dialogue procedure for the password dialogue:

INT_PTR CALLBACK DlgProc(HWND hWndDlg, UINT msg, WPARAM wParam, LPARAM lParam) {
    UNREFERENCED_PARAMETER(lParam);

    switch (msg) {

This is how it starts. We use the UNREFERENCED_PARAMETER macro to avoid a warning being generated by Microsoft’s compilers at /W4 level of warnings, since we never touch lParam.

We then handle each dialogue message, which acts like window messages…

    case WM_INITDIALOG:
        hdlg = hWndDlg;
        break;

Upon WM_INITDIALOG, which is sent when the dialogue is initialized, we set hdlg to save the window handle.

    case WM_COMMAND:
        switch (wParam) {

Upon WM_COMMAND, which happens when a button is clicked, we switch on wParam, which contains the ID of the button being clicked.

        case IDOK: {
            DWORD dwLength, i;
            TCHAR szPassword[PASSWORD_LENGTH + 1];
            dwLength = GetDlgItemText(hWndDlg, IDC_EDIT1, szPassword, PASSWORD_LENGTH + 1);
            if (dwLength != PASSWORD_LENGTH) {
                EndDialog(hWndDlg, 2);
                hdlg = NULL;
                break;
            }

            for (i = 0; i < PASSWORD_LENGTH; ++i)
                szPassword[i] ^= 37;

            EndDialog(hWndDlg, lstrcmp(szPassword, szRealPassword) ? 2 : 1);
            hdlg = NULL;
            break;

IDOK is the default ID for the OK button in a dialogue box, and we use GetDlgItemText to get the text for the dialogue item with IDC_EDIT1, which is our password edit control. We load the bytes into szPassword, which is just enough to fit the correct password.

If GetDlgItemText returns the wrong length, we end the dialogue with code 2, meaning incorrect password.

Then, we XOR every byte in the password with 37 to encode it, before using lstrcmp to compare it with the real password. lstrcmp is kernel32.dll’s version of strcmp, which again we can’t use due to shunning the C library.

We end the dialogue with code 1 if it’s correct, otherwise 2.

        case IDCANCEL:
            EndDialog(hWndDlg, 0);
            hdlg = NULL;
            break;
        }
        break;

We end the dialogue with 0 otherwise, to signal the dialogue was cancelled.

    default:
        return FALSE;
    }
    return TRUE;
}

And finally, we return TRUE if we handled the message, and FALSE if we didn’t to trigger the default behaviour. This is where dialogue procedures differ from window ones, and honestly, I think the DefWindowProc approach is more flexible.

Conclusion

And that’s the end of this little program. I hope you learned a bit about programming in C or writing Windows applications, especially a funky one like this.

Notes

Yes, this predated the Windows XP end-of-support date of April 8, 2014, but the school just continued running Windows XP anyway. It wasn’t until 2015 that all the computers were upgraded to Windows 7, and boy was it slow on those ancient Pentium Ds, which were around 10 years old by that time. It probably explained why they held onto Windows XP as long as possible.

For those who are too young to remember, that was a reasonably innovative era in PC hardware, so 10-year-old computers were a lot more unusable compared to 2026. ↩
I obviously knew which computer was the fastest in every lab and made sure to secure it for my own use whenever possible. They were still super slow, especially in the Windows 7 era, but better than being stuck with something even worse. ↩
Yes, Microsoft used Unix time on the BSoD. It is perhaps somewhat ironic, but they did make Xenix at one point… ↩
Yes, it was an ancient toolchain, surprisingly lightweight and popular, even a decade and half after its release. Pretty good for a toolchain as old as me. ↩
For those of you wondering how I got VC6 to run on Jenkins, the answer is Wine, because I am not specifically deploying a Windows server to run VC6. My debloated VC6 package just ran perfectly fine. ↩
For those too young to remember, AMD’s graphics card division was acquired from ATI. ↩

2025: Year in Review

2025-12-31T23:59:59-05:00

For the past three years, I’ve been writing year-end reviews to look back upon the year that had gone by and reflect on what had happened. As yet another year has gone by, it is time to continue the tradition.

Like last year, I’ll divide this into several areas, as opposed to grouping by month as I had done in the years before:

Without further ado, let’s begin.

BGP and operating my own AS

Perhaps the most dramatic change for my AS is that ARIN finally issued me a /22 of IPv4 after being on the waiting list for two years. This allowed me to deploy my own IPv4 in four different locations:

Toronto, Canada;
Kansas City, Missouri, USA;
Fremont, California, USA; and
Amsterdam, Netherlands.

I’ve also repurposed my old ASN, AS200351, to demonstrate how new networks can announce routes to the Internet, as well as documenting how to join an Internet Exchange.

Connectivity-wise, AS54148 joined two new Internet exchanges:

AMS-IX, the Amsterdam Internet Exchange, which is one of the largest in the world. Special shoutout to AMS-IX’s Bright Networks Club, which gave small networks like mine a free 1 Gbps port, as well as Lionel Douglas at AMS-IX for making it happen.
NVIX, the North Virginia Internet Exchange.

Location-wise, I’ve added an IPv6-only node in Zürich, Switzerland to my anycast network. I also lost BGP on the Hong Kong node due to issues with the provider, which is unfortunate. Still, I have reasonable coverage in Asia with the Singapore and Tokyo nodes.

My homebrew CDN

My CDN has undergone some further location changes this year. The following nodes were added:

Seattle, Washington, United States;
Ashburn, Virginia, United States; and
a new “China-optimized” node in Tokyo, Japan.

I dropped the node in Bangalore, India due to provider issues. At this point, cheap servers don’t really seem to be worth the toll they inflict upon my sanity.

Perhaps the biggest change is that I decided that my CDN is good enough on its own and has a better track record than Cloudflare in terms of uptime, given the number of Cloudflare outages we’ve seen this year. For this reason, I’ve fully eliminated Cloudflare from all my domains.

I’ve also documented how to set up PowerDNS to pick the closest available server, which includes instructions for anycast deployment, as well as how to use Galera to ensure the control plane is highly available.

Speaking of PowerDNS, remember how I mentioned last year that it was annoying that I had to ensure MaxMind had the correct data, since PowerDNS insisted on looking up the server’s location from IP in the GeoLite2 database? Well, this year, I got sick of doing this and created my own pipeline with mmdb-editor to postprocess GeoLite2 databases downloaded from MaxMind to include the correct locations for my own servers. This ensured the location is always correct. As a bonus, I can also have geolocation for internal VPN IPs, since nothing stops me from putting private IPs into the database.

My home server

As documented, I’ve replaced the crappy RAM in my server with ECC RAM, which ensured that any memory corruption would be detected instead of creating mysterious failures. I am pleased to announce that I’ve not seen any errors with the RAM. Specifically:

# ras-mc-ctl --error-count
Label   CE  UE
DIMM_A2 0   0
DIMM_B1 0   0
DIMM_A2 0   0
DIMM_B1 0   0
DIMM_B2 0   0
DIMM_A1 0   0
DIMM_B2 0   0
DIMM_A1 0   0

However, I decided to wait until I had the energy before doing the same upgrade to my desktop. This has proved to be a mistake due to RAM prices going up, and I refuse to pay the current market price…

I’ve also replaced the case, going from some crappy 20-year-old case to the Antec 1200, which I wrote about here. The case has been holding up quite well and was definitely a worthy upgrade.

Unfortunately, one of the HDDs in the RAID 0 array I used for “unimportant data” in my server decided to give up the ghost, causing the entire array to be destroyed. This forced me to recreate the array with new drives as RAID 1 to avoid the annoyance of redownloading the data, which I used as an opportunity to document how to recreate the LVM cache setup.

Speaking of storage failures…

My ADS-B feeder

Earlier this year, I wrote about creating a multi-network ADS-B feeder to report planes flying nearby. Unfortunately, my ADS-B feeder ran on a Raspberry Pi, which suffered an SD card failure.

Since I didn’t want to reinstall Debian on the Raspberry Pi again, I instead opted to use my old Atomic Pi to handle the ADS-B dongle and run dump1090. The various feeders are now run directly on my server in a separate VM instead, reducing dependency on crappy storage on Pis.

Unfortunately, as it turned out, feeders provided by certain plane-tracking websites are ARM only, and both my server and the Atomic Pi are amd64 platforms. For this reason, I had to build x86 packages for Flightaware’s PiAware feeder, which proved annoying.

However, what’s beyond the pale was AirNavRadar, whose rbfeeder only ran on Raspberry Pis and required fake sensor data to be emulated. It doesn’t build elsewhere. A Docker image for it exists that purportedly works on amd64, but upon further inspection, the Dockerfile simply installed qemu-user-static and armhf packages using Debian’s multiarch feature to run it under emulation. Another reason why you should never trust Docker images. For this reason, they are no longer getting a feed from me.

However, it’s not all doom and gloom. I would like to shout out to FlightRadar24 and Planefinder for providing amd64 Debian packages that just work out of the box.

My coding projects

I’ve also written a bunch of code this year outside of work:

I created TOTP.fyi, which is a code generator for the time-based one-time password (TOTP) algorithm, commonly used by authenticator apps like Google Authenticator. This is meant to help developers easily test TOTP in their apps, such as by passing in codes that were valid in the past or will be valid in the future, but are currently invalid. Hopefully, this encourages more websites to implement 2FA correctly. I am always so disappointed when banks insist on using text messages vulnerable to SIM swapping, instead of something like TOTP, which is better. The gold standard is still security keys though, and it’s disappointing that so few people support it.
On that note, I also created a tiny Rust app that returns the current time, called qtime, which I plan to integrate into TOTP.fyi to warn users if their local clocks are out of sync and the generated TOTP codes are actually invalid. Maybe that integration will be done next year.
I am also working on the new Looking Glass Indirect Display Driver (IDD), which enables Looking Glass to be used without the overhead of capturing the screen or a dummy HDMI or DisplayPort plug, as well as acting as a very nice way to interact with virtual machines without any GPUs. I am mostly focusing on implementing a UI to manage the driver inside the Windows virtual machine, rather than the driver itself.
I’ve also implemented a version of the Mayan calendar, in a similar vein to my French Republican calendar. Next year, I’ll probably write some posts explaining how the Mayan calendar works.

Next year, I also plan to resurrect one of my high school projects, clean it up, and publish it properly on GitHub. I’ll not spoil the surprise for now.

My mechanical keyboard

On the keyboard front, I continue to suffer from various problems. My old Corsair K70 MK.2 keyboard with Cherry MX brown switches finally gave up the ghost, generating random keystrokes without any keys being touched.

I had to bring out the custom mechanical keyboard I built last year, which continued to suffer from the problems I complained about last year, except the weird key registering problem spread to even more keys. Thankfully, I was able to prevent keys from double registering most of the time by changing the debouncing algorithm in the firmware, which is an advantage of programmable firmware on keyboards.

Based on other users’ reports online, the problem appears to be quality control in Akko’s keyboard switches. It was a shame because I really enjoyed typing on the Akko Lavender Purple switches when they weren’t busy acting up.

As a result, I decided to get similar keyboard switches from a different vendor, and I eventually settled on the TTC Bluish White Silent tactile switches. They feel quite good to type on, although they are perhaps a bit too silent. Still, this allowed me to revert the debouncing hack in my keyboard firmware, although there’s nothing I can do about the backlight situation caused by the “south-facing” LEDs, since I believe side-printed keycaps look silly.

Fortunately, the de Quervain’s tenosynovitis hasn’t returned, and I am cautiously optimistic about the future of my keyboard.

My travel router project

My travel router ended up having issues connecting to various hotel WiFi due to portals and weird routing in the GL.iNet software. I think I’ll need to rebuild it on stock OpenWrt and set up my own policy-based routing rules instead of fighting GL.iNet’s “user-friendly” prescriptive defaults. I am sure it works perfectly for most users doing basic things, but I needed more control.

Unfortunately, due to recent events, I am not travelling as often now, and so I haven’t really had much motivation to fix the router.

My music hobby

And finally, my music hobby. As I mentioned last year, I was planning to learn to play Debussy’s La fille aux cheveux de lin (meaning “the girl with the flaxen hair”), which I finally managed to learn this year. I created this recording:

This didn’t really turn out as well as I had hoped, but honestly, I just haven’t really had the time to practice piano. I am not sure I’ll have to energy to learn another piece next year…

Conclusion

That’s about it for 2025. I am not sure what 2026 will bring, but I suppose we’ll find out.

If you like my content and would like to support me, you can do so via GitHub Sponsors, Ko-fi, or Stripe (CAD)—though of course, this is strictly optional.

Alternatively, you can also check out my referral links for services that I use. Most of them will pay me some sort of commission if you sign up with my link.

Joining your first Internet Exchange

2025-11-05T00:15:00-05:00

Last time, I covered the process of announcing your very first route to the Internet via BGP, but that’s only the beginning. I promised to dive into the process of joining an Internet Exchange to bring better connectivity to a fledgling network, and now is the time.

For this exercise, I will be connecting the BGP VM running AS200351 to the Ontario Internet Exchange (ONIX), which has gladly provided me with a port for my test network to help me write this post. While ONIX isn’t the biggest Internet Exchange in Toronto, for that honour belongs to TorIX, it nevertheless has a reasonable amount of big peers, such as Hurricane Electric, which will come into play later.

Without further ado, let’s dive in.

What is an Internet Exchange?

I’ve written about this in more detail as part of my series on BGP, especially when I discussed autonomous systems. To summarize, an Internet Exchange (IX) is conceptually a really big switch into which many networks are connected, forming a massive peering LAN. Through this peering LAN, many networks can achieve a direct connection to each other and set up BGP sessions with each other without having to run a dedicated wire between each pair of networks. Instead, every network simply needs to run one wire to the IX. This reduces the number of wires required to be linear to the number of interconnected networks from quadratic.

With BGP sessions, networks can announce routes to their own networks to each other, a process called peering. However, naïvely running direct BGP sessions between all networks still requires a quadratic number of BGP sessions. To help reduce this, most IXes offer route servers (RSes). Every member of the IX peers with the route server, announcing their own routes, and the route server sends back routes that all other peers announce to it. This reduces the number of BGP sessions to be linear to the number of networks on the IX.

For this reason, most networks peer with the route servers. However, some networks would prefer more fine-grained control over route distribution, and thus set up bilateral BGP sessions between each other anyway. These are frequently referred to as bilats. Large networks typically prefer this approach.

The IX peering LAN can also be used for more than just peering. For example, one network might offer another network BGP transit over a bilateral BGP session, which involves announcing every route on the Internet and providing full connectivity to the Internet.¹

How to not get banned?

Before we talk about how to join an Internet Exchange, it is very important to discuss how to stay joined, i.e. not getting banned from an IX. You must have a firm grasp of this before connecting yourself to the IX.

While an Internet Exchange is effectively just a big LAN, it must not be carelessly treated as if it were a normal network. It is very important to follow several principles when dealing with Internet Exchanges, or you risk causing disruption to the entire Internet. I assure you, you do not want to wake up to news headlines of your network being accused of causing a major Internet outage. Such violations of the rules will also get you yelled at by the operators, resulting in your port getting disabled, or with repeated offences, get you permanently banned.

While the exact policy depends on the IX in question, and you are highly encouraged to read the relevant policies, there are several principles by which basically all IXes operate:

Only sending IPv4, IPv6, and ARP traffic. Specifically, this is usually framed as an EtherType restriction to:
- 0x0800: IPv4;
- 0x86dd: IPv6; and
- 0x0806: Address Resolution Protocol (ARP), a protocol to discover the MAC address associated with an IPv4 address.
An Internet Exchange is designed to exchange IP traffic, not other random traffic!
No multicast or broadcast traffic, since they go to every member, and sending a lot of them can easily threaten the stability of the IX itself. The following exception is carved out, as it’s required for MAC address discovery:
- ARP packets; and
- ICMPv6 neighbour discovery (NDP), specifically neighbour solicitation (ICMPv6 type 135). Note that this does not permit you to send router solicitation or router advertisement messages! Doing so will cause a lot of problems.
In fact, most IXes will very strongly prefer that you use very long timeouts for ARP and NDP caching to avoid generating too much broadcast traffic.
Do not query ARP/NDP for IPs not on the peering LAN. As a corollary to the previous point, querying ARP and NDP for IPs not on the peering LAN also generates spam traffic that harms the stability of the IX itself. This often happens if you add IPs not on the peering LAN to the IX interface, so do NOT do that.
No link-local protocols other than the aforementioned ARP and NDP protocols. Technically, the previous two requirements already forbid all of these, but it’s worth specifically calling out certain especially troublesome ones:
- Spanning Tree Protocol (STP), i.ie. IEEE 802.1d and IEEE 802.1w;
  - This also includes related protocols such as Cisco’s Unidirectional Link Detection (UDLD);
- ICMP Router Discovery Protocol (IRDP);
- ICMPv6 router solicitation and router advertisement;
- ICMP redirects;
- Vendor-specific discovery protocols, including but not limited to CDP, MNDP, and LLDP;
- Interior routing protocols, including but not limited to OSPF, IS-IS, or IGRP. Only BGP should be used over IXes;
- BOOTP/DHCP, both client and server, since IX ports should always use static IP addresses; and
- Protocol-Independent Multicast (PIM).
Instructions for Linux will be included in this post. If you are joining from any hardware router or L3 switch, make sure you consult the vendor documentation on all the protocols enabled by default. Depending on the IX, sending a single forbidden packet may result in your port getting instantly disabled.
Do not hijack other people’s IPs or MACs. Only ever use IP addresses that you have been assigned by the IX. Do not use ARP/NDP proxying. Do not respond to ARP/NDP for IPs that you haven’t been assigned to.
Do not offer L2 access to the peering LAN to others, unless previously authorized by the IX to operate an extension or resell ports. Many IXes enforce a limit of one MAC address per port and have strict ACLs for MACs.
Do not announce the peering LAN via BGP. The peering LAN is a private network between members of the IX and should not be reachable over the Internet.
Do not point default routes at other IX members. Specifically, you should only ever send traffic to a member if the destination address is covered by a route announced to you by that member, either directly or through the route servers.
Only send your routes in your cone to peers and route servers. The norm for peering is to send your cone, i.e. routes belonging to your network and your downstream networks, as authorized by IRR and RPKI. For more details, see my post on route authorization. You should not send all routes in your routing table to peers or the route servers. You may offer transit over bilateral BGP sessions, in which case you may send the full Internet routing table, but you must never offer route servers transit.
Do not sniff traffic between any other IX members. This should go without saying, but do not violate the privacy of other IX members.

Note that this is by no means an exhaustive list. Doing anything crazy that causes instability for the IX or any network on it will have consequences. To learn more about bad IX traffic, you can consult this post from Ben Cartwright-Cox.

Gaining access to the peering LAN

Okay, with all the dangers out of the way, you can apply to join an IX and eventually obtain access to the peering LAN. This will depend heavily on how you plan to join the IX, mostly depending on whether you are using a VM or a dedi, and whether your server provider is a reseller for the IX.

Renting servers from an IX reseller

The easiest way to join an IX, and the one used by most hobby networks, is renting a VM (or dedicated server) from a provider that also resells access to a desired IX. Usually, this will be advertised as an option on the VMs, and some providers may specifically sell IXP access VMs that are specifically designed to be hooked up to an IX.

When joining an IX this way, you would typically open a ticket with your server provider requesting access to the IX, and they will send you a form to fill out. Fill out the provided form, pay the relevant fees, and at some point, the peering LAN will be delivered to your server.

Typically, with VMs, the IX peering LAN will be delivered as a separate virtual interface on the VM. With dedicated servers, it’s usually delivered as a VLAN on the main network uplink, although it’s also possible for it to be plugged into a separate network port, if you have those available.

For this exercise, I asked ONIX if I could “resell” the connection I have on my dedicated server to my VM, and they said yes, then allocated an IPv6 address for AS200351 on the peering LAN: 2001:504:125:e1::bde. We will be using this later.

Apply to an IX on your own

This is definitely the harder option, but it’s necessary if you wish to join an IX that your provider doesn’t resell. I had to go through this process recently to join AMS-IX.

The first thing you should figure out is where your VM or dedicated server is located, and whether the IX is available in the same location. In my case, I am using iFog GmbH in Amsterdam, and the server is located in Digital Realty’s AMS17 datacentre, where AMS-IX is available through several “partners”, i.e. resellers.

If the desired IX is not available, not all is lost, since it might be possible to connect in a different location and have the traffic transported (or “hauled”) through someone else’s connection between datacentres. You would typically find a provider that would give you an L2 connection between two datacentres. You then connect to the provider, who then connects to the IX in the other location.

You should then figure out how much it takes to run a cross-connect (XC), which is basically a piece of optical fibre that goes between your server provider’s rack and the AMS-IX reseller’s (or directly to AMS-IX). In my case, an XC in DRT AMS17 came with a prohibitive monthly price, and therefore, it didn’t make sense to connect to AMS-IX in DRT AMS17.

However, not all is lost. I spoke with iFog GmbH for other options, and since they already had a connection between DRT AMS17 and NIKHEF (a nearby datacentre with no monthly fees for XCs), I can simply get an XC with AMS-IX in NIKHEF and the traffic hauled to me through iFog’s connection. This is a much more affordable option, and it’s the one I went with.

Knowing how to connect, I spoke to AMS-IX and they directed me to one of their resellers, A2B Internet. Once I sorted out the administrative stuff with them, they sent over a letter of authorization (LOA) authorizing an XC to be run between A2B Internet and Dynamic Quantum Networks in the next 30 days after issuance. I forwarded the LOA over to iFog, who then got someone in NIKHEF to run the XC between A2B’s rack and iFog’s.

With the XC in place, iFog hauled the connection as a VLAN through a bunch of his switches until it reached my server. Once done, I informed A2B of the fact, and they asked for the MAC address of my IX interface to configure the ACL. Once I produced that, I got onto AMS-IX’s peering LAN.

Okay, that would be the case if the XC worked. For some reason, iFog couldn’t see any light on the fibre, and I spent the next two weeks channelling messages between iFog and A2B, both 6 hours ahead of me due to timezones, until we finally determined that one of the fibre strands was damaged. After fixing that, I finally got onto AMS-IX. These things do happen occasionally, and I guess I just had bad luck. Most people I talked to seemed to have a much easier time with XCs, but you should be prepared for the worst if you are dealing with such real world things.

Configuring the IX interface

Note: on some IXes, you are placed onto a quarantine LAN first. If that’s the case, the configuration process is the same, just that they will verify you are doing things right before sending you over to the real peering LAN.

Like last time, we’ll use ifupdown2 to configure the interface. In this example, ONIX is on ens19, and we will bring it up by appending the following content to /etc/network/interfaces:

auto ens19
iface ens19 inet static
    # To prevent IP hijacking by people blindly copying this config,
    # this is not my actual IP on ONIX.
    # Change this to the IP and mask allocated by the IX.
    address 2001:db8:1234::5678/64
    # And if you have IPv4, put the IPv4 assigned by your IX to you here,
    # or otherwise remove this:
    address 192.0.2.123/24
    # WARNING: do NOT add any other IPs here. Your own IP goes onto lo, a
    # bridge, or a dummy interface. See the previous part.

    # Disables multicast, which is undesired on IX peering LANs.
    up ip link set multicast off dev "$IFACE"

Note that this specifically uses static IPs. We do not want DHCP on the IX peering LAN, as previously mentioned. If you are running any sort of DHCP server (such as dnsmasq) or route advertisement daemon (such as radvd) on the host, make sure it’s disabled for the peering LAN.

Before bringing up the interface, configure the following sysctls by creating /etc/sysctl.d/onix.conf with this content:

# Prevents the kernel from answering ARP requests with addresses from
# other interfaces, which is undesirable on the IXP peering LAN.
net.ipv4.conf.ens19.arp_filter = 1
net.ipv4.conf.ens19.arp_ignore = 1
net.ipv4.conf.ens19.arp_announce = 1

# Disables IPv6 autoconfiguration on this interface.
net.ipv6.conf.ens19.autoconf = 0

# Do not accept any IPv6 route advertisements on this interface.
# These shouldn't exist on the peering LAN, but some IXes don't ban people
# as quickly as we'd like.
net.ipv6.conf.ens19.accept_ra = 0

# Disables router solicitations.
net.ipv6.conf.ens19.router_solicitations = -1

# Bump ARP and NDP timeouts to 4 hours to avoid generating broadcast traffic.
net.ipv4.neigh.ens19.base_reachable_time_ms = 14400000
net.ipv6.neigh.ens19.base_reachable_time_ms = 14400000

With that configured, we can now reload sysctl config and bring up the interface:

sudo systemctl restart systemd-sysctl.service
sudo ifup ens19

If all goes well, you should now be able to ping the route servers. You can look up the IP addresses on the IX’s website, but in the case of ONIX, it’s 2001:504:125:e1::1 and 2001:504:125:e1::2. Let’s ping the first IX:

$ ping 2001:504:125:e1::1
PING 2001:504:125:e1::1 (2001:504:125:e1::1) 56 data bytes
64 bytes from 2001:504:125:e1::1: icmp_seq=1 ttl=64 time=0.275 ms
64 bytes from 2001:504:125:e1::1: icmp_seq=2 ttl=64 time=0.173 ms
^C
--- 2001:504:125:e1::1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1025ms
rtt min/avg/max/mdev = 0.173/0.224/0.275/0.051 ms

We are online!

If this doesn’t work, make sure your IP configuration is correct. If that still doesn’t work, you should run tcpdump -i ens19 and check if you see any broadcast traffic. If you aren’t receiving anything, speak with your server provider to see if the port is connected correctly.

Checking for bad traffic

Before we continue, we should take this opportunity to check that we aren’t sending forbidden traffic to the IX. For this, you will need tcpdump:

sudo apt install tcpdump

Now, to sanity check that it can receive traffic:

$ sudo tcpdump -e -i ens19 'broadcast || multicast'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens19, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:12:45.728543 f0:64:26:f5:b2:0b (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 60: Request who-has 149.112.50.178 tell hurricane-electric.as6939.ip.onix.cx, length 46
21:12:45.769472 50:6b:4b:3e:33:a5 (oui Unknown) > 33:33:ff:00:00:17 (oui Unknown), ethertype IPv6 (0x86dd), length 86: paradoxnetworks.as52025.ip6.onix.cx > ff02::1:ff00:17: ICMP6, neighbor solicitation, who has private-user.as51019.ip6.onix.cx, length 32
21:12:45.798912 e4:8d:8c:fb:c0:62 (oui Unknown) > 33:33:ff:00:00:15 (oui Unknown), ethertype IPv6 (0x86dd), length 86: smishcraft.as210667.ip6.onix.cx > ff02::1:ff00:15: ICMP6, neighbor solicitation, who has as17290.onix.cx, length 32
21:12:45.811583 bc:24:11:f1:22:57 (oui Unknown) > 33:33:ff:00:00:61 (oui Unknown), ethertype IPv6 (0x86dd), length 86: as213768.onix.cx > ff02::1:ff00:61: ICMP6, neighbor solicitation, who has bgptools-rc.as212232.ip6.onix.cx, length 32
...

Yes, we can see broadcast ARP and NDP packets on ONIX. Now, let’s filter it down to the outgoing bad stuff:

$ sudo tcpdump -Qout -e -i ens19 '(broadcast || multicast) && !arp && !(icmp6 && ip6[40] == 135)'
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on ens19, link-type EN10MB (Ethernet), snapshot length 262144 bytes

Leave this running for a while, say at least an hour. Make sure it doesn’t print anything more. Then press Ctrl+C to interrupt it, and make sure it prints 0 packets captured, like this:

0 packets captured

If you see any bad traffic, please address it immediately, before you get banned by the IX.

Set preferred IPs

To avoid making requests from IPs on the IX peering LAN on routes received from the route servers, you must set krt_prefsrc on protocol kernel to the desired default IP.

For example, we can uncomment the commented block in the IPv6 kernel protocol in /etc/bird/bird.conf from last time, resulting in something like this:

protocol kernel {
    scan time 60;
    ipv6 {
        export filter {
            if source = RTS_STATIC && proto = "default_v6" then reject;
            # FIXME change the IP to one of yours that you want to prefer
            if source = RTS_BGP then krt_prefsrc = 2602:fa43:f0::1;
            accept;
        };
    };
}

You will need to do so for IPv4 as well, if you have IPv4.

Setting up BGP sessions with route servers

First, we should define an identifier for ONIX in filter_bgp.conf, in the section # FIXME: define your IXPs here. For example:

define IXP_ONIX = 100;

This will be used in BGP communities generated by the library to identify routes originated from the IX’s route servers.

We can now append new content to the bird.conf we constructed last time.

First, we define a bird protocol template for BGP with our ONIX IPs to avoid repeating the same IPs over and over again:

# Drop this entire section if you don't have IPv4.
# FIXME change the template name if joining something not ONIX.
template bgp onix_v4 {
    # Replace this IP with your ONIX IPv4
    local 192.0.2.1 as 200351;
    default bgp_local_pref 100;
}

# FIXME change the template name if joining something not ONIX.
template bgp onix_v6 {
    # Replace this IP with your ONIX IPv6
    local 2001:db8:1234::5678 as 200351;
    default bgp_local_pref 100;
}

Note that we set bgp_local_pref to 100 for all IX peers by default, since that’s what we said in the previous part we were using for direct peers.

Now, let’s define templates for route servers:

# Drop this entire section if you don't have IPv4.
# FIXME change the template name if joining something not ONIX.
template bgp onix_rs_v4 from onix_v4 {
    ipv4 {
        import keep filtered;
        # FIXME change IXP_ONIX to the constant defined for the IX
        # you are joining.
        import where import_ixp_trusted(IXP_ONIX);
        # FIXME change the ASN here to that of the IX route servers.
        export where export_cone(57369);
    };
}

# FIXME change the template name if joining something not ONIX.
template bgp onix_rs_v6 from onix_v6 {
    ipv6 {
        import keep filtered;
        # FIXME change IXP_ONIX to the constant defined for the IX
        # you are joining.
        import where import_ixp_trusted(IXP_ONIX);
        # FIXME change the ASN here to that of the IX route servers.
        export where export_cone(57369);
    };
}

Note that helper functions are included from filter_bgp.conf, which we downloaded last time from my bird-filter library. You can look there for the exact implementations if you are interested in the details.

Now, let’s define the BGP sessions with the route servers:

# Drop this entire section if you don't have IPv4.
# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_rs4a from onix_rs_v4 {
    # FIXME change the description if not joining ONIX.
    description "ONIX Route Server A (IPv4)";
    # FIXME change the IPv4 and ASN to that of the first route server.
    neighbor 149.112.50.1 as 57369;
}

# Drop this entire section if you don't have IPv4.
# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_rs4b from onix_rs_v4 {
    # FIXME change the description if not joining ONIX.
    description "ONIX Route Server B (IPv4)";
    # FIXME change the IPv4 and ASN to that of the second route server.
    neighbor 149.112.50.2 as 57369;
}

# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_rs6a from onix_rs_v6 {
    # FIXME change the description if not joining ONIX.
    description "ONIX Route Server A (IPv6)";
    # FIXME change the IPv6 and ASN to that of the first route server.
    neighbor 2001:504:125:e1::1 as 57369;
}

# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_rs6b from onix_rs_v6 {
    # FIXME change the description if not joining ONIX.
    description "ONIX Route Server B (IPv6)";
    # FIXME change the IPv6 and ASN to that of the second route server.
    neighbor 2001:504:125:e1::2 as 57369;
}

If you are joining an IX with only one route server, then omit the protocol block for the B route server.

Now let’s reload our configuration:

$ sudo birdc configure
BIRD 2.17.1 ready.
Reading configuration from /etc/bird/bird.conf
Reconfigured

Now we should see the route server BGP sessions:

$ sudo birdc s p
BIRD 2.17.1 ready.
Name       Proto      Table      State  Since         Info
...
onix_rs6a  BGP        ---        up     22:43:37.758  Established
onix_rs6b  BGP        ---        up     22:43:37.713  Established

And we are now peered with the route servers!

If you are on a quarantine LAN, now would be a good time to ask the IX to validate your setup and move you to the real peering LAN.

Common issues

If your BGP session isn’t showing as established, then something has gone wrong. There are several things to try (this also applies to bilateral sessions, which we will demonstrate later):

Double check the IP addresses you are using. Make sure the ones on your side and the peer’s side are both correct.
Check if you can ping the BGP neighbour address, as seen in birdc s p a. If you can’t ping the IP, it means the IP might be incorrect or the other side is down. Double check the IP and talk to the peer or the IX for more information if you believe the IP is correct.
If you have a firewall, ensure TCP port 179 connections are allowed, as it is used for BGP.
Check BIRD logs with journalctl -u bird for any errors.

If you can’t figure this out, ask your reseller (or the IX’s contact when going direct), or ask for help in the IPv6 Discord.

Setting up bilateral BGP sessions

For fun, I will also show how to configure a bilateral BGP session over ONIX, using AS54148 as an example:

# Drop this entire section if you don't have IPv4.
# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_dqn4 from onix_v4 {
    # FIXME change the description if not peering with me.
    description "Dynamic Quantum Networks (IPv4)";
    # FIXME change the IP and ASN to that of your desired peer.
    neighbor 149.112.50.51 as 54148;

    ipv4 {
        import keep filtered;
        # FIXME change the ASNs below to that of your desired peer.
        import where import_peer_trusted(54148);
        export where export_cone(54148);
    };
}

# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_dqn6 from onix_v6 {
    # FIXME change the description if not peering with me.
    description "Dynamic Quantum Networks (IPv6)";
    # FIXME change the IP and ASN to that of your desired peer.
    neighbor 2001:504:125:e1::51 as 54148;

    ipv6 {
        import keep filtered;
        # FIXME change the ASNs below to that of your desired peer.
        import where import_peer_trusted(54148);
        export where export_cone(54148);
    };
}

For simplicity’s sake, I will not use any IRR or RPKI filters, but note that this is not good practice. I may cover this later on my blog, but Instructions are available in the README for my bird-filter repository.

Now let’s reload bird and validate the session:

$ sudo birdc configure
BIRD 2.17.1 ready.
Reading configuration from /etc/bird/bird.conf
Reconfigured
$ sudo birdc s p
BIRD 2.17.1 ready.
Name       Proto      Table      State  Since         Info
...
onix_dqn6  BGP        ---        up     23:20:17.861  Established

Check the common issues section above if you run into trouble.

Hurricane Electric

Hurricane Electric (AS6939) is a major global ISP, and you can peer with them over ONIX. In fact, they will typically reach out to you directly when your port goes up with something like this:

Hello AS200351,

We noticed you are live at ONIX and would like to initiate peering. If you would like to establish peering over this exchange or any other of our common exchanges please send me the details and when you have configured the session on your side. If you have any specific questions feel free to let me know.

I’ve included our information at the end of this message. If you need additional information, let me know and I’ll get it out to you ASAP.

[ONIX] AS6939 2001:504:125:e1::3

[ONIX] AS200351 2001:504:125:e1::bde

We look forward to hearing from you.

Thank you

Mindy Mosher
Hurricane Electric
AS6939

In which case, you can just reply and say yes. If not, you can email [email protected] for a BGP session as well.

You can set up the session on your end like any other peers, as before:

# Drop this entire section if you don't have IPv4.
# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_he4 from onix_v4 {
    description "Hurricane Electric (IPv4)";
    # Update the IP here if not peering with HE over ONIX.
    neighbor 149.112.50.3 as 6939;

    ipv4 {
        import keep filtered;
        import where import_peer_trusted(6939);
        export where export_cone(6939);
    };
}

# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_he6 from onix_v6 {
    description "Hurricane Electric (IPv6)";
    # Update the IP here if not peering with HE over ONIX.
    neighbor 2001:504:125:e1::3 as 6939;

    ipv6 {
        import keep filtered;
        import where import_peer_trusted(6939);
        export where export_cone(6939);
    };
}

However, they will also happily offer you free IPv6 transit, allowing you to get a second upstream to multi-home your network. If you say yes to their offer, you can use the following block instead for IPv6:

# FIXME change the protocol and template names if not joining ONIX.
protocol bgp onix_he6 from onix_v6 {
    description "Hurricane Electric Transit (IPv6)";
    # Update the IP here if not peering with HE over ONIX.
    neighbor 2001:504:125:e1::3 as 6939;
    # We use 50 for upstreams, as mentioned last time.
    # You can use a slightly bigger number to prefer HE, or a smaller number
    # to deprefer them.
    default bgp_local_pref 50;

    ipv6 {
        import keep filtered;
        import where import_transit(6939, false);
        export where export_cone(6939);
    };
}

Once done, run sudo birdc configure to reconfigure bird. The BGP session will be down until HE configures it on their end. For this exercise, it took HE a grand total of 2 minutes to set up the BGP session after I emailed them, which is rather impressive. Anyways, once it’s configured on both ends:

$ sudo birdc s p
BIRD 2.17.1 ready.
Name       Proto      Table      State  Since         Info
...
onix_he6   BGP        ---        up     23:44:43.056  Established
$ sudo birdc s p a onix_he6
...
    Routes:         224535 imported, 0 filtered, 3 exported, 155101 preferred

Since this is a transit session, you should see over 200k routes.

Once again, check the common issues section above if you run into trouble.

After a while, you should see AS6939 as a direct transit provider on bgp.tools for your prefix, e.g. for 2a07:54c1:d351::/48. This might take a day or two, just like how long it previously took the route to appear on the Internet for the first time.

Conclusion

Hopefully, you’ve successfully joined your first Internet Exchange and made your network multi-homed. You might even have found yourself a backup transit provider. Note that networks other than Hurricane Electric might also offer free IP transit over IXes; you’ll just have to ask around.

Remember to double check that you aren’t sending random multicast or broadcast traffic with the tcpdump command above!

You can follow the same procedure to join more Internet Exchanges, if more are available in your location, or if you decide to get another server.

Here are some providers I use that sell IX access:

iFog (non-affiliate): sells access to FogIXP, ONIX, FREMIX, NVIX, Frys-IX, NL-ix, SpeedIX, LSIX, ERA-IX, and more; ticket to get a quote for pricing;
Xenyth (non-affiliate): sells access to ONIX and NVIX;
F4 Networks (non-affiliate): sells access to KCIX, STLIX, HOUIX, F4IX, and more.

Finally, if you run into trouble, feel free to ask for help in the IPv6 Discord.

Notes

This is usually the case, but due to several high-profile peering disputes, not every transit provider is able to provide the complete Internet routing table. ↩

Announcing your first routes to the Internet via BGP

2025-10-08T03:06:11-04:00

A while back, I wrote about what I wish I knew when I got my ASN, which has helped many embark upon the epic quest of hobby networking. I’ve also written plenty about the theory of Internet routing. However, what was conspicuously missing was an introductory practical guide to make use of a new ASN and IP space. I’ve decided to rectify this gap now.

For this exercise, I will breathe new life into my original ASN, which I’ve since replaced globally with the new ARIN ASN 54148 for my main network.¹ Now, I will set up AS200351 on a new BGP VM² on AS54148’s infrastructure as a downstream, i.e. AS54148 will be the service provider that connects AS200351 to the Internet.

Prerequisites

If you want to follow along, you must have your own ASN and IP prefix. If not, please go back to the previous post in the series and figure that part out first.

You must also have a BGP-capable upstream. If you don’t, you can get a cheap VPS for $5/month or less. Here are several providers with which I’ve had good experiences, but of course, your mileage may vary:

There are more providers available on bgp.cheap and bgp.services, which are maintained by community members, but I can’t vouch for anything listed therein personally. Please exercise your own judgement before spending your own hard-earned money.

When selecting upstreams, the most important thing is probably latency, especially if you plan to use a tunnel to use your own IPs at home. Find test IPs (or looking glass IPs) for candidate locations from candidate providers, and compare the ping times from your location, and prefer the ones with the lowest ping.

Also, look for diverse connectivity. A general rule of thumb is to look up the test IPs on bgp.tools and see if there are many diverse paths to tier 1 ISPs (as opposed to reaching all other tier 1s through one tier 1), and also look up the ASN there to see which Internet exchanges they are on. Prefer the ones with multiple distinct tier 1 paths and more Internet exchange connectivity, especially ones local to the server location.

Platform

For the BGP VM, I am running the latest Debian stable release at the time of writing, which is Debian 13, codenamed trixie. If you are following along with some kind of VPS or a dedicated server, I would advise using something similar. Installation of Debian is out of scope for this post and is something you will need to arrange with your server provider, perhaps through the control panel.

We will be using the Linux kernel to do any actual packet forwarding, which is plenty for a hobby network, instead of using something more niche like Vector Packet Processing (VPP).

For the routing daemon, we will be using the BIRD 2.x series, which is the most stable and battle-tested version of BIRD at the point of writing. Specifically, we will use version 2.17.1, which comes with trixie, but 2.0.12 that came with bookworm (Debian 12) would work just as well.

We will also be using my BIRD filter library to do some basic route filtering. We’ll save setting up IRR and RPKI filtering for the next time.

Preparing IP space

Before announcing IP space, you need to make sure that IRR and RPKI on the prefixes are set up correctly so that everyone else on the Internet will accept your route.

For this exercise, I will use the following real IPv6 space:

2602:fa43:f0::/48 (ARIN); and
2a07:54c1:d351::/48 (RIPE).

This will allow me to show how to configure IP space on both ARIN and RIPE.

IRR

Setting up IRR entries for the IP space depends on whether the IP space has been reassigned to you or not. If you rented IP space from an LIR and they chose to not delegate the prefix fully to you, then you would need to inform them about which prefix you intend to announce from which ASN.

If you have received space directly from ARIN, you can follow ARIN’s documentation on IRR-online to create a route6 object, e.g. for 2602:fa43:f0::/48 and AS200351. The end result should look something like:

$ whois -h rr.arin.net 2602:fa43:f0::/48
route6:         2602:fa43:f0::/48
origin:         AS200351
descr:          DQN Test Network
admin-c:        DQNA-ARIN
tech-c:         DQNOC-ARIN
mnt-by:         MNT-GC-1348
created:        2025-10-05T21:08:36Z
last-modified:  2025-10-05T21:08:36Z
source:         ARIN

If you have received space directly from RIPE, or your LIR has granted you mnt-by or mnt-routes access to the prefix, you can just log into the RIPE database and go to this page (RIPE database → Create an object → route6) and enter the IPv6 prefix and ASN (with the AS prefix), then press “submit”. You should get something like:

$ whois -h whois.ripe.net 2a07:54c1:d351::/48AS200351
...
route6:         2a07:54c1:d351::/48
origin:         AS200351
mnt-by:         QUANTUM-MNT
created:        2025-10-06T03:58:36Z
last-modified:  2025-10-06T03:58:36Z
source:         RIPE
...

RPKI

Setting up RPKI Route Origin Authorizations depends strongly on whether you received the space directly from an RIR or from an LIR reselling it to you.

If you have received IP space directly from an RIR, consult the RIR’s documentation for using hosted RPKI, e.g. ARIN’s guide with videos or RIPE’s documentation. Delegated RPKI is also possible, but not something I would recommend for a beginner. If you insist, look at the Krill documentation.

Basically, what you are trying to do in the UI is to create an Route Origin Authorization (ROA) that permits your ASN to announce your desired prefixes. In my case, I would need an ROA that authorizes AS200351 to announce 2602:fa43:f0::/48, and another for the same ASN and 2a07:54c1:d351::/48.

If you received IP space from someone else, inform the reseller about which prefix you plan to announce from which ASN, and they will take care of it for you.

IPv4

Since IPv4 space doesn’t come cheap, I will just use 192.0.2.0/24 as an example, since I don’t exactly have spare IPv4 prefixes. The process of IRR and RPKI is basically the same for IPv4, except instead of creating a route6 object, you’d create a route object.

Requesting a BGP session

The other thing to do before we can start announcing IPs to the Internet is to inform the desired upstreams and get them to set up a BGP session.

Since I run AS54148, I don’t need to talk with myself, but this is what your upstream might ask before they set up the BGP session, and to save time, it would be good to do this pre-emptively:

Create an as-set, which I’ve described last time, but to summarize: create the equivalent of AS200351:as-all for your ASN. Mine looks something like:
```
$ whois -h rr.arin.net AS200351:as-all
as-set:         AS200351:AS-ALL
members:        AS200351
...
source:         ARIN
...
```
At this stage, simply include your own ASN as a member. You may include other ASNs once you are upstreaming other ASes, but that’s left as an exercise for the reader.
Edit your aut-num to include something like this, with AS54148 replaced with your upstream’s ASN and AS200351 replaced with your ASN:
```
import:         from AS54148 accept ANY
mp-import:      afi any.unicast from AS54148 accept ANY
export:         to AS54148 announce AS200351:as-all
mp-export:      afi any.unicast to AS54148 announce AS200351:as-all
```
This declares to the world that you intend to your upstream of choice to be your upstream and that you aren’t an impostor trying to trick someone into letting you use someone else’s ASN. Repeat this for each upstream ASN.

Note that on ARIN, you may have to create the aut-num object in the IRR.³ The IRR-online documentation also explains how to do this.
Some upstreams, especially in the Asia–Pacific region, may ask for a letter of authorization (LOA) to announce the IPs. Unless an upstream explicitly requires one, I wouldn’t bother preparing it ahead of time. If you do, you can use a website like loa.tools to generate one or use the following template:
To whom it may concern,

This letter serves as authorization for Example ISP (AS64500) to announce the following IP ranges:
- 192.0.2.0/24
- 2001:db8::/32
As a representative of Example LIR, the owner of the aforementioned IP ranges, I hereby declare that I am authorized to represent and sign this letter of authorization. This authorization shall remain in effect until revoked or modified in writing.

Should you have any questions about this request, please email me at [email protected] or call me at +1 (555) 555-5555.

Sincerely,

John Smith
CEO of Example LIR
If your space is leased from another LIR, they would need to be the ones to generate such an LOA, which may prove to be a tedious process on their end. Therefore, don’t ask unless absolutely necessary.
Decide if you want to receive a default route or a full table over BGP. A full table means the complete Internet routing table, containing every single route that’s on the Internet. This is typically only necessary if you have multiple upstreams and would like to use the best upstream for each destination, and uses a lot more resources. When in doubt, ask for a default route instead.

Note that some upstreams may insist on giving you a full table anyway, in which case you can filter out the unwanted routes with a filter to save on resources.

Once you’ve done the preparations, send the following information to your upstream, often by ticket, but sometimes they may have special automated forms for you to fill out:

Your ASN;
Your as-set (optional if you don’t want to do downstreams, and some providers may not allow downstreaming anyway);
Whether you want default routes or a BGP full table;
A list of prefixes you wish to announce, since some upstreams don’t generate filters from IRR or RPKI for whatever reason;⁴
A list of downstream ASNs if downstreaming, since some upstreams want an explicit list instead of letting you upstream arbitrary ASNs for fear of route hijacking;
The IP addresses on your end where you want the session to be established. Typically, this is the IPv4 and IPv6 address allocated to your server, but if your server comes with multiple IP addresses, this requires clarification; and
A BGP password, if your upstream insists on you providing one. This is actually the secret key used for TCP MD5 per RFC 2385 and doesn’t really offer much security, but some upstream will want you to provide one, while some others will generate one for you.

For AS200351, this is what I would tell AS54148, i.e. myself:

ASN: AS200351;

AS-set: AS200351:as-all;

Default routes only;

Prefixes to announce:

192.0.2.0/24,

2602:fa43:f0::/48. and

2a07:54c1:d351::/48;

No downstreams;

IP addresses on my end: 203.0.113.2 and 2001:db8::2; and

No BGP password.

If all goes well, your upstream will tell you the IP addresses for the BGP session on their end. Otherwise, answer any additional questions they might have.

For this exercise, pretend this is what AS54148 told AS200351:

The BGP session is set up. IP addresses on our end are 203.0.113.1 and 2001:db8::1.

Installing BIRD

On Debian, installing BIRD 2.x is as simple as:

sudo apt install bird2

For the basic setup, download my filter_bgp.conf as /etc/bird/filter_bgp.conf, making sure to change MY_ASN to your ASN:

sudo wget -O /etc/bird/filter_bgp.conf https://raw.githubusercontent.com/quantum5/bird-filter/refs/heads/master/filter_bgp.conf
sed -i -e '/define MY_ASN/c\define MY_ASN = 200351;' /etc/bird/filter_bgp.conf

(Or just edit filter_bgp.conf with a normal editor like a sane person.)

Then, replace /etc/bird/bird.conf with the following BIRD configuration template:

log syslog all;

# This is a 32-bit identifier that should be unique among your peers.
# FIXME: Change this to one of your router's IPv4 addresses if you can.
# If you have none, pick something random from 240.0.0.0/4.
# For this exercise, I am using the IPv4 BGP session address.
router id 203.0.113.2;

protocol kernel {
    scan time 60;
    ipv4 {
        # Note: this just avoids exporting the virtual default route in
        # filter_bgp.conf to the kernel.
        export where source != RTS_STATIC || proto != "default_v4";

        # NOTE: this basic export above doesn't make the routes inserted into
        # the kernel prefer your own IPs. Things will work fine with your
        # server's IP assigned by the provider if you have a single upstream
        # but strange things will happen if you have more than one peer.
        # Instead, to use your own IP as the default source IP for outgoing
        # connections on your system, add an IP from your range to an interface
        # on your system (which is explained later), remove the line above, and
        # use the block below, changing 192.0.2.1 to the IP used.
        #
        # export filter {
        #     if source = RTS_STATIC && proto = "default_v4" then reject;
        #     # FIXME change the IP to one of yours that you want to prefer
        #     if source = RTS_BGP then krt_prefsrc = 192.0.2.1;
        #     accept;
        # };
    };
}

protocol kernel {
    scan time 60;
    ipv6 {
        # Note: same as above.
        export where source != RTS_STATIC || proto != "default_v6";

        # NOTE: similar to above, use the following block to change the default
        # source IP for outgoing connections.
        #
        # export filter {
        #     if source = RTS_STATIC && proto = "default_v6" then reject;
        #     # FIXME change the IP to one of yours that you want to prefer
        #     if source = RTS_BGP then krt_prefsrc = 2602:fa43:f0::1;
        #     accept;
        # };
    };
}

protocol device {
    scan time 60;
}

include "filter_bgp.conf";

protocol static node_v4 {
    ipv4;

    # The "reject" here tells bird to drop all traffic to the prefix by default.
    # If you have more specific routes, they will take precedence and be used.
    # So this really just tells bird (and the kernel) to drop traffic for
    # UNUSED portions of your prefix instead of sending it back out to your
    # upstream and creating a routing loop.
    #
    # FIXME: Replace these with your own IP addresses, or delete if there are none.
    route 192.0.2.0/24 reject;
}

protocol static node_v6 {
    ipv6;

    # See above for what "reject" means.
    # FIXME: Replace these with your own IP addresses.
    route 2602:fa43:f0::/48 reject;
    route 2a07:54c1:d351::/48 reject;
}

# And now for our upstreams:
# FIXME: rename the BGP protocol to contain the name of your upstream.
protocol bgp dqn_v4 {
    # FIXME: change this description.
    description "AS54148 Upstream (IPv4)";

    # FIXME: change the IP addresses and the ASNs here.
    local 203.0.113.2 as 200351;
    neighbor 203.0.113.1 as 54148;

    # FIXME: if your upstream wants a BGP password, uncomment this and
    # change the password, otherwise delete:
    # password "hunter2";

    default bgp_local_pref 50;

    ipv4 {
        import keep filtered;

        # Note that import_transit(upstream_asn, true) means accepting
        # default routes. If you want to reject default routes and take a full
        # table, pass false.
        # FIXME: change the ASNs here to your upstream's.
        import where import_transit(54148, true);
        export where export_cone(54148);
    };
}

# FIXME: rename the BGP protocol to contain the name of your upstream.
protocol bgp dqn_v6 {
    # FIXME: change this description.
    description "AS54148 Upstream (IPv6)";

    # FIXME: change the IP addresses and the ASNs here.
    local 2001:db8::2 as 200351;
    neighbor 2001:db8::1 as 54148;

    # FIXME: if your upstream wants a BGP password, uncomment this and
    # change the password, otherwise delete:
    # password "hunter2";

    default bgp_local_pref 50;

    ipv6 {
        import keep filtered;

        # Meaning of true is defined above.
        # FIXME: change the ASN here to your upstream's.
        import where import_transit(54148, true);
        export where export_cone(54148);
    };
}

Note that if you are already running BIRD for other reasons, you will need to merge the configuration. You should have a single instance of protocol device and two instances of protocol kernel, one for IPv4 and the other for IPv6. You may need to merge the export filters for the latter.

As for bgp_local_pref, a higher value means routes received from that BGP session are more preferred. The exact values are specific to each AS, but I would recommend starting with something like this:

50 for upstreams;
90 for IXPs;
100 for direct peers; and
120 for downstreams.

If you have multiple upstreams, you can use 60 for a more preferred upstream as needed, for example.

If your upstream insists on sending you a full table and you don’t want it, you can change the import filter to net.len = 0 && import_transit([asn], true).

If you want to take a full table and reject any default route, say because you have multiple upstreams, you can import_transit([asn], false). This will insert around a million IPv4 routes and 230k IPv6 routes, so be sure that’s what you want.

Finally, double check that you’ve fixed all the things tagged FIXME in the config.

Once you are done, run sudo birdc configure. If there are any errors, it will tell you, in which case, fix your configuration.

Otherwise, you should see the BGP session come up with sudo birdc show protocols (or just sudo birdc s p):

$ sudo birdc s p
BIRD 2.17.1 ready.
Name       Proto      Table      State  Since         Info
device1    Device     ---        up     03:42:46.786  
default_v4 Static     master4    up     03:48:38.022  
default_v6 Static     master6    up     03:48:38.022  
node_v4    Static     master4    up     03:48:38.022  
node_v6    Static     master6    up     03:48:38.022  
dqn_v4     BGP        ---        up     03:48:38.172  Established   
dqn_v6     BGP        ---        up     03:49:14.381  Established   
kernel1    Kernel     master4    up     03:42:46.786  
kernel2    Kernel     master6    up     03:42:46.786  

As you can see, the sessions are established!

You can see the details for each protocol with sudo birdc show protocol all [name] (or sudo birdc s p a [name]), which might look like:

$ sudo birdc s p a  dqn_v6
BIRD 2.17.1 ready.
Name       Proto      Table      State  Since         Info
dqn_v6     BGP        ---        up     03:49:14.381  Established   
  Description:    AS54148 Upstream (IPv6)
  BGP state:          Established
    Neighbor address: 2001:db8::1
    Neighbor AS:      54148
    Local AS:         200351
    Neighbor ID:      203.0.113.1
    Local capabilities
      Multiprotocol
        AF announced: ipv6
      Route refresh
      Graceful restart
      4-octet AS numbers
      Enhanced refresh
      Long-lived graceful restart
    Neighbor capabilities
      Multiprotocol
        AF announced: ipv6
      Route refresh
      Graceful restart
      4-octet AS numbers
      Enhanced refresh
      Long-lived graceful restart
    Session:          external AS4
    Source address:   2001:db8::2
    Hold timer:       200.709/240
    Keepalive timer:  22.049/80
    Send hold timer:  321.314/480
  Channel ipv6
    State:          UP
    Table:          master6
    Preference:     100
    Input filter:   (unnamed)
    Output filter:  (unnamed)
    Routes:         1 imported, 0 filtered, 2 exported, 0 preferred
    Route change stats:     received   rejected   filtered    ignored   accepted
      Import updates:              1          0          0          0          1
      Import withdraws:            0          0        ---          0          0
      Export updates:              3          0          1        ---          2
      Export withdraws:            0        ---        ---        ---          0
    BGP Next hop:   2001:db8::2 fe80::1234:5678:9abc:def0

Confirm that the number of routes imported and exported is as expected under Routes:. Here, as expected, we have:

A single imported route: the default route;
Two IPv6 exported prefixes.

You can ignore Route change stats section as the numbers are not very meaningful.

Common BGP session issues

If your BGP session isn’t showing as established, then something has gone wrong. There are several things to try:

Check if you can ping the BGP neighbour address, as seen in birdc s p a. If you can’t ping the IP, it means the IP might be incorrect and you should verify it with your upstream. If it’s correct, something terrible is wrong with the routing on your system. Fixing that is left as an exercise for the reader, since I don’t know how you got into that state.
Check if the BGP neighbour requires multiple hops to reach. You can run traceroute or mtr on the IP address. If you see more than one hop show up, then you will need to add multihop [number of hops] into the protocol bgp block.
If you have a firewall, ensure TCP port 179 connections are allowed, as it is used for BGP.
Check BIRD logs with journalctl -u bird for any errors.

If you can’t figure this out, ask your upstream, or ask for help in the IPv6 Discord.

Seeing your new prefix on the Internet

You can check that your prefix is visible in the global routing table by going to bgp.tools and entering your prefix, then looking at the connectivity tab.

For example, for the ARIN test prefix used here, you can see that the prefix is fully reachable from the Internet.

If you instead see something like this, it means your prefix isn’t visible on the Internet:

You should probably give it some time, say a day or two. If it’s still not visible, you should:

Validate that you’ve set up IRR and RPKI correctly with IRR explorer. Enter your IP prefix, and it should tell you if anything is wrong;
Attempt to restart the BGP session with your upstream, e.g. sudo birdc restart dqn_v6, and wait another day or two; and
If nothing worked, you probably need to ask your upstream to update their filters.

Start using your own IPs

Now that your IP prefix has been announced to the Internet, it is time to start using it. This is OS-dependent, but I will use Debian as the example with ifupdown2, which supports multiple IP addresses per interface and can reload without taking down the entire interface, making it a lot more suitable than traditional ifupdown for routers. To install ifupdown2, run:

sudo apt install ifupdown2

This will uninstall the old ifupdown. At this point, either reboot the system or manually run sudo ifup on every interface on the system (including lo, the loopback interface), so that ifupdown2 tracks the interface state.

Simple method to use your prefix on one server

With that done, the simplest way to use it on your server is to add the IP to some interface, such as lo, by making this change in /etc/network/interfaces:

...
auto lo
iface lo inet loopback
    # FIXME replace these with some address inside your prefix.
    address 192.0.2.1
    address 2602:fa43:f0::1
    address 2a07:54c1:d351::1
...

The iface lo inet loopback should already be there and you can just add some lines afterwards. Once that’s done, you can run ifreload -c. If all goes well, the IPs added should now be pingable from the Internet.

WARNING: Under NO circumstances should you add your own IPs to an interface connected to someone else’s network, such as your upstream. Therefore, do NOT add your own IPs to the uplink interface. Use a bridge, lo, or a dummy interface instead.

Using a bridge

Of course, you probably want to use your IPs on more than a single server. For example, if you want to run some virtual machines or systemd-nspawn containers, you will be attaching VMs or containers to a bridge interface. To do this, you will need to create such a bridge interface and add IPs to it that would serve as the default gateway.

First, install the bridge-utils package:

sudo apt install bridge-utils

For example, to create the bridge br0, add the following stanza to /etc/network/interfaces:

auto br0
iface br0 inet static
    # FIXME replace these with some address inside your prefix.
    address 192.0.2.1/24
    address 2602:fa43:f0::1/64

    # If you want to attach physical interfaces to the bridge, list them instead
    # of `none` here.
    bridge-ports none
    bridge-stp off
    bridge-fd 0

    # This prevents `systemd-nspawn` and common VM interfaces from being kicked
    # off the bridge when running `ifreload -c`.
    bridge-ports-condone-regex ^vnet|^vb-|^br0p0$

    # Note that due to Linux bridge limitations, some interface has to be added
    # to the bridge for it to be considered "up". If you are using
    # `bridge-ports none`, you need to add a dummy interface. Otherwise, the IP
    # addresses on the interface would be considered unreachable.
    # This adds that dummy interface, and the `|| true` ensures it doesn't
    # error out when running `ifreload -c`.
    # You obviously don't need this if you don't use `bridge-ports none`.
    up ip link add name "$IFACE"p0 up master "$IFACE" type dummy || true

Remember to delete the lo addresses if you experimented with that approach earlier.

Note that we use a single /64 of IPv6 for this bridge since that’s all that’s needed for a LAN. You have to announce at least a /48 over BGP, which leaves you with 65535 other /64s to assign to different bridges or other purposes.

For IPv4, you can similarly partition your /24 into smaller blocks. For example, if you only need to add 5 to 13 devices on your network, you can use a /28 instead, using the remainder of the space for other purposes.

Also note that any devices added to the bridge will need to have static IPs configured. Using DHCP and SLAAC is out of scope and left as an exercise for the reader. Something like dnsmasq or radvd comes in handy, and they come with documentation.

Now, you can bring this interface up:

sudo ifup br0
sudo ifreload -c

You’ll also need to turn on IP forwarding. Create /etc/sysctl.d/forwarding.conf with the following contents:

net.ipv6.conf.all.forwarding = 1
# Delete the next line if not using IPv4.
net.ipv4.ip_forward = 1

Run sudo systemctl restart systemd-sysctl to make the changes take effect.

Your IP addresses on the bridge, along with anything else added to the bridge, should now be reachable from the Internet.

Tunnelling

You can obviously also tunnel portions of the IP space. For example, you can run OpenVPN in tap mode and add the tap interface to the bridge to virtually hook up any computer to it and thus use your own IPs, or run WireGuard so that you can route subsets of your IP ranges to other locations. The possibilities are endless.

Here’s a quick WireGuard example. Install wireguard-tools on the server and the client:

sudo apt install wireguard-tools

On the server side, generate a key pair:

$ wg genkey
EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=
$ echo EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M= | wg pubkey
mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=

On the client side, generate another key pair:

$ wg genkey
UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=
$ echo UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY= | wg pubkey
a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=

Then, on the server, create /etc/wireguard/wg_br0.conf (or whatever you would like to call the interface):

[Interface]
# Change the port to something else if you are already using it for something
# else, including another WireGuard instance.
ListenPort = 51820

# FIXME change these address to your own.
# Note WireGuard is an L3 VPN and can't be bridged, so it requires a separate
# network. Hence, we are using a /25 for IPv4 and a separate /64 for IPv6.
Address = 192.0.2.129/25, 2602:fa43:f0:1::1/64

# FIXME generate your own private keys and use it.
PrivateKey = EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=

[Peer]
# FIXME use your own client's public keys.
PublicKey = a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=

# FIXME change this to IPs used by your client.
# This allows the client to use a single IP, which should be from the same IP
# block as that in `Address` above. You can expand this to something larger to
# allow the client to use more IP space. You can also allow entire subnets, such
# 2602:fa43:f0:2::/64, so that the client has an entire LAN to work with,
# e.g. with a bridge. These don't have to be the same range as `Address`.
AllowedIPs = 192.0.2.130/32, 2602:fa43:f0:1::2/128

# You can repeat [Peer] sections as necessary for more peers.
# Note that WireGuard routes based on AllowedIPs, so make sure they don't
# overlap. Also, ensure that you use different private keys for each client and
# put their public keys in their [Peer] sections.

Now we enable WireGuard:

sudo systemctl enable --now wg-quick@wg_br0

Similar to bridges above, enable forwarding with sysctl.

On the client, create the corresponding configuration file, such as /etc/wireguard/wg_myip.conf:

[Interface]
# This is the IP addresses you want to add on the WireGuard interface,
# as well as anything else that should be routed on the same LAN.
# If you are tunnelling entire separate subnets like 2602:fa43:f0:2::/64, that
# should probably go on its own bridge interface and not here.
# FIXME change this to IPs used by your client.
Address = 192.0.2.130/25, 2602:fa43:f0:1::2/64

# FIXME generate your own private keys and use it.
PrivateKey = UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=

[Peer]
# FIXME use your own server's public keys.
PublicKey = mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=

# This adds a default route to the server.
# You can delete 0.0.0.0/0 if you aren't using IPv4.
AllowedIPs = 0.0.0.0/0, ::/0

# FIXME replace with an IP+port on your server that's reachable from the client.
Endpoint = 192.0.2.1:51820

Similarly, we enable WireGuard:

sudo systemctl enable --now wg-quick@wg_myip

If all goes well, your equivalent of 2602:fa43:f0:1::2 should be reachable from the Internet.

Doing anything else is left as an exercise for the reader.

Conclusion

Hopefully, you managed to announce your first prefix to the Internet and hooked up a bunch of other devices to use it. Congratulations! You are now on the Internet as a fully autonomous system with your own IP addresses, instead of using someone else’s IPs.

You can expand upon this to advertise the same prefix in multiple locations for anycasting, or to construct a backbone to support unicast routing to multiple locations on the same prefix.

If you run into trouble, feel free to ask for help in the IPv6 Discord.

Next time, I will cover how to connect to an Internet Exchange by having AS200351 join ONIX.

Notes

For those who don’t know, I switched to AS54148 because it’s a 16-bit ASN and is therefore compatible with classic BGP communities, and it’s not cursed like AS200351 with inter-RIR transfer messiness, which all started because RIPE wanted to levy extra fees upon existing ASNs… ↩
This is basically equivalent to any BGP-capable VPS providers you might find, except I am running it myself and not charging myself money for it. It’s also not really different software-wise compared to if you had a dedicated server or colocation service. ↩
In ARIN, IRR is optional and separate from the Whois database. Meanwhile, in RIPE, Whois and IRR are the same database, and the aut-num is the object for the ASN and is created when the ASN is issued. ↩
That is not to say that creating the IRR entries is useless when using these upstreams. Oftentimes, their upstreams will be generating their filters from IRR, or they might manually confirm that the IRR entries exist before adding you to their filters manually. ↩

Building a global WireGuard mesh “backbone” network with OSPF

2025-09-07T21:40:46-04:00

Years ago, before I tried to build my own autonomous system and run BGP, I had a few servers in different locations,¹ and I wanted them to be able to talk to each other over some sort of encrypted VPN, allowing plaintext protocols to be run between locations securely. WireGuard was a good option, but the classic deployment would necessitate all servers to connect to one main server. That simply wouldn’t do when I had multiple servers on each side of the Atlantic. If I put the main server on one side of the Atlantic, then two servers on the other side talking to each other would have to cross the Atlantic twice. That simply wouldn’t do.

Instead, I thought to myself: what if I had a bunch of WireGuard tunnels between the different locations I had, and then have something intelligent select an optimal path between nodes? Open Shortest Path First (OSPF) seemed like the perfect option, allowing optimal routes to be selected based on the lowest possible cost (typically latency) based on Dijkstra’s algorithm, while routing around any failed links. Better yet, it did so without requiring every location to be connected to every other location. This meant that I could connect nodes on each side of the Atlantic to each other for low latency, while creating redundant trans-Atlantic links. It was also able to route traffic faster than the direct connection at times, when that used horrible routing.²

Years later, when I started building my own BGP network, I ran into the problem of trying to use the same IPv4 /24 in multiple locations.³ While I could announce the same prefix from multiple locations as anycast, if I want to route an IP to a single host, I’d need something with which to route packets to that host, no matter which location it entered my network. Typically, a backbone network is used for this purpose. For large networks, this involves actual optical fibre between locations, but we can make do with tunnels for smaller networks. To keep things simple, I turned my existing WireGuard OSPF mesh into my very own “backbone”.

In this post, we’ll explore how to use WireGuard, how to use OSPF, and how to use them to construct a backbone network and an encrypted VPN connecting distinct sites.

Background

To understand this post, a basic understanding of IP and BGP networking is required. It might help to read the BGP series that I’ve written on this blog, especially the first introductory post.

You should also have a decent understanding of the OSI model, dividing networking protocols into layers of encapsulation. A very quick summary is this:

L2 is the data link layer and includes things like Ethernet, which can encapsulate L3 protocols like IP, but could also encapsulate other protocols. The exact protocol is determined by the EtherType, which is 0x0800 for IPv4 and 0x86DD for IPv6.
L3 is the network layer, consisting of protocols like IP, which encapsulate L4 protocols like TCP and UDP. The exact protocol is determined by the IP protocol number, e.g. 6 for TCP and 17 for UDP.
L4 is the transport layer, consisting of protocols like TCP and UDP, which encapsulate familiar protocols like HTTP over TCP port 80 or HTTPS over TCP port 443.

We’ll also be focused on using Linux software routers, not hardware routers. Examples will be for Debian, not things like Cisco console commands.

Choice of technology

It is important to note that building this mesh/backbone network involves two distinct pieces—the underlying transport and the routing protocol.

For the transport, we can use actual L2 links between locations, which probably involves renting an L2 service, a wave,⁴ or dark fibre from a vendor, which is very expensive. Alternatively, we can use some sort of tunnel, such as WireGuard, which is free. We’ll talk a bit more about the consequences later.

The routing protocol decides which link is used to reach which destination. This is the piece that manages the routes and selects the optimal one to each destination based on latency. Most routing protocols would do, but popular options for this kind of thing include OSPF and BGP. I chose OSPF because the default behaviour is finding the lowest cost route based on the connections that are available, whereas BGP would require a bit more convincing.

To run routing protocols, we’ll also need a routing daemon. For this exercise, I chose bird, because in addition to OSPF, it can also handle other protocols like BGP. This would prove to be a good choice once I started building my own autonomous system, since the same daemon could be used for all my routing needs. Specifically, I am using the 2.x series, because it’s more battle-tested than the new 3.x series, and it’s more memory efficient, able to take in the full Internet routing table with 1 GB of RAM.

Ultimately, transport and routing are independent pieces and can be swapped out. If you are doing something like this, you can use a different tunnel and still use OSPF as your routing protocol, or use WireGuard and some other routing protocol instead. You can also use a different routing daemon instead, but note that bird’s OSPF implementation is reputed to not be very compatible with other implementations.

Choice of transport

To understand the consequences of our choice of transport, we must first understand what a maximum transmission unit—or as it’s commonly abbreviated, MTU—is. This is effectively the maximum size of an IP packet that can be sent over a link, in bytes.

Traditionally, IP over Ethernet has an MTU of 1500. This is basically the expected MTU size on the Internet when you aren’t using tunnels. Sometimes, ISPs would use PPPoE, which is effectively a form of tunnelling and has an 8-byte overhead, resulting in an MTU of 1492.⁵ This may cause packet loss if path MTU discovery is broken to certain destinations, causing TCP to connect but be unable to transmit any data, resulting in a “blackhole” connection. This is a solved problem by applying TCP MSS clamping, shrinking the maximum segment size of TCP so that the resulting IP packets fit in the MTU.

If you are using a real L2 link, it probably will have an MTU of at least 1500, and oftentimes, larger jumbo packets are supported. However, this, along with everything else, is highly dependent on your L2 provider. For the rest of this post, we’ll focus on tunnels, since those are a lot more accessible and standard.

There are two types of tunnels based on which OSI layer they encapsulate:

L2 tunnels encapsulate full L2 frames, such as Ethernet. This allows connected devices to appear directly on the LAN and work with things like Ethernet broadcasts, but has more overhead due to the Ethernet header.
L3 tunnels encapsulate only the IP packet and have less overhead.

Since we are routing between locations, we don’t actually need L2 access when L3 tunnels will do the job just fine, so we’ll ignore L2 tunnels and focus only on L3 ones. There are several common L3 tunnel options:

Simple IP-over-IP: This is a tunnel using a special IP protocol number to encapsulate another IP packet inside—4 for IPv4 and 41 for IPv6. The MTU overhead is just the size of the IP packet header, which is 20 bytes for IPv4 and 40 bytes for IPv6, resulting in an MTU of 1480 when transporting over IPv4 and 1460 over IPv6 with full Ethernet MTU. This has the smallest possible overhead and yields the largest possible MTU, but there is zero encryption or checksum, and it doesn’t work with most NATs.⁶ Furthermore, only one tunnel can exist between any pair of IP addresses.

Traditionally, an IP-over-IP tunnel can only do either IPv4 or IPv6 inside the tunnel, but modern Linux has removed the restriction. For example, assuming you are on 192.0.2.1 and want to create a tunnel named v4transport to 192.0.2.2, a tunnel encapsulating both IPv4 and IPv6 with an IPv4 transport can be created with:
```
ip link add name v4transport type sit mode any local 192.0.2.1 remote 192.0.2.2 ttl 255
```
Similarly, an IPv6-based tunnel named v6transport from 2001:db8::1 to 2001:db8::2 can be created with:
```
ip link add name v6transport type ip6tnl mode any local 2001:db8::1 remote 2001:db8::2 ttl 255 encaplimit none
```
The same command with local and remote flipped would need to be run on the other end of the tunnel.
Generic routing encapsulation (GRE): This is another tunnel with a separate header inside the IP packet, containing an EtherType, allowing non-IP protocols to be encapsulated. It optionally also has:
- checksums for packet integrity;
- sequence numbers to prevent out-of-order delivery; and
- a key, allowing multiple tunnels to run between the same source and destination IP address pair.
The basic header has 4 bytes of overhead, and enabling checksums, sequence numbers, and a key each requires 4 additional bytes. This results in a variable overhead between 4 and 16 bytes. Since GRE is encapsulated inside IP as protocol number 47, the resulting MTU is between 1464 and 1476 over IPv4 and between 1444 and 1456 over IPv6 with full Ethernet MTU.

A GRE tunnel could be created on Linux as follows, assuming the same endpoints as the IP-over-IP example before:
```
ip link add name v4gre type gre local 192.0.2.1 remote 192.0.2.2 ttl 255
ip link add name v6gre type ip6gre local 2001:db8::1 remote 2001:db8::2 ttl 255 encaplimit none
```
WireGuard: This is an encrypted tunnel over UDP, supporting only IPv4 and IPv6 inside the tunnel, and usable over both IPv4 and IPv6 transport. Using UDP, it can punch through NAT easily, as long as one end has a public IP to connect to. It naturally has more overhead, including the IP header, the 8-byte UDP header, and 32 bytes of additional protocol overhead, resulting in an MTU of 1440 over IPv4 and 1420 over IPv6. To ensure wide compatibility, no matter the underlying transport, the default MTU is 1420. It also has more CPU overhead due to the cryptography.

WireGuard uses public key cryptography and allows peers to be defined with their public key. It also has some basic routing support, allowing multiple peers to connect and be routed based on AllowedIPs in the peer configuration. When using a routing protocol like OSPF over it, we have to use it peer-to-peer and allow all possible IPs over the peer.

There are obviously other tunnelling protocols, such as FOU⁷, GUE, VXLAN, and OpenVPN, but using those is left as an exercise for the reader.

For this post, we’ll use WireGuard as the example, as that’s what I am using. The actual routing protocol setup should apply to any tunnel type you may choose to use.

Note that WireGuard is only recommended if you need encrypted transport. If you are just building a tunnelled backbone network between locations for data that’s sent over plaintext over the Internet already, IP-over-IP is probably a better bet due to its simplicity and lower overhead. I’ll also show a quick IP-over-IP example.

Example layout

For our example, we will define three routers—A, B, and C—each in a distinct location, but the idea can easily be extended to many more locations. On a bigger scale, it might make sense to write a configuration generator, but that’s left as an exercise for the reader.

We’ll assume the following public IPs for each server:

A: 192.0.2.1 and 2001:db8::1
B: 192.0.2.2 and 2001:db8::2
C: 192.0.2.2 and 2001:db8::3

For the VPN portion, we’ll use the following IP allocations:

A: 10.137.0.0/24 and 2001:db8:1000::/48
B: 10.137.1.0/24 and 2001:db8:1001::/48
C: 10.137.2.0/24 and 2001:db8:1002::/48

For the backbone, we’ll assume that all servers are advertising 198.51.100.0/24 and 2001:db8:2000::/48⁸, but partitioning it as such:

A: 198.51.100.0/29 and 2001:db8:2000::/64
B: 198.51.100.8/29 and 2001:db8:2000:1::/64
C: 198.51.100.16/29 and 2001:db8:2000:2::/64

For tunnels, we’ll use link-local addresses for IPv6 and for IPv4, we’ll allocate /31s for point-to-point links, per RFC 3021, allocated from 203.0.113.0/24.

We’ll connect all three sites with tunnels in this example, but note that this actually isn’t necessary and specifically isn’t required in bigger setups. The only real requirement is that it must be possible to reach all locations via some sequence of tunnels.

Setting up WireGuard tunnels

For this example, we’ll use Debian’s wireguard-tools as an example. It comes with the wg tool for configuring tunnels, and the wg-quick command and the systemd unit [email protected] to configure a bunch of tunnels with an INI-like config file. The process may differ on other distros.

To create a WireGuard tunnel, we first have to generate private keys and the corresponding for each end. This can be done with the wg tool:

$ wg genkey 
EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=
$ echo EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M= | wg pubkey 
mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=

We’ll use this private key for the A end of the tunnel between A and B. Obviously, use your own keys instead of copying the ones I generated while writing this post.

Let’s also generate a key pair for the B end of that tunnel:

$ wg genkey 
UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=
$ echo UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY= | wg pubkey 
a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=

Since WireGuard runs over UDP, it is important for one side to be listening on a certain IP and port and the other side to connect to that. We’ll let A be the listening end, using port 15000.

Now, on A, we create /etc/wireguard/wg_b.conf with the following contents:

[Interface]
ListenPort = 15000
Address = 203.0.113.0/31, fe80::a:b:1/64
Table = off
PrivateKey = EFr1rNiP5NsYJYp+J1v5v+D4w9VO7HJwDuH/sgyOv1M=
MTU = 1420

[Peer]
PublicKey = a5MlLRW8tY0NOFhHH8GxUUbytWUGvJiLqGYw5eXn9gI=
AllowedIPs = 0.0.0.0/0, ::/0

Note that WireGuard interfaces don’t come with a link-local address, so we make one up. I am using distinct link-local IP addresses for each tunnel for some reason, and I can’t remember if I ran into problems doing fe80::1 and fe80::2 with WireGuard for every tunnel. You can try using fe80::1 and fe80::2 if you feel bold enough, and let me know in the comments if it worked.

Further note that we use Table = off to avoid WireGuard adjusting the routing table based on AllowedIPs, since we intend to run our own routing protocol instead of sending all traffic on the system to the other end.

We can then start the A end of the tunnel with:

sudo systemctl enable --now wg-quick@wg_b.service

If you have a firewall, you will need to open port 15000 to all IPs (or at least, any IP whence B may choose to connect).

On B, we create /etc/wireguard/wg_a.conf with something similar:

[Interface]
Address = 203.0.113.1/31, fe80::a:b:2/64
Table = off
PrivateKey = UEgGif7OyKe59WA9BNciAFDBmT0Jw+M7Wf1HYQCOzlY=
MTU = 1420

[Peer]
PublicKey = mJv/wYe0PZeOek8ZEsXQ3PAkBzK73kJerikNtDukTW4=
AllowedIPs = 0.0.0.0/0, ::/0
Endpoint = 192.0.2.1:15000

Start the B end of the tunnel with:

sudo systemctl enable --now wg-quick@wg_a.service

At this point, A and B should be able to talk to each other. You should be able to ping 203.0.113.1 and fe80::a:b:2%wg_b on A, and 203.0.113.0 and fe80::a:b:1%wg_a on B.

You’ll of course need to repeat this process for any other pair of hosts that you need to connect. In this example, a tunnel is required between B and C, and also A and C, using very similar configurations. Remember that ListenPort has to be different for each WireGuard tunnel on the same host, and the UDP port must not be used by anything else on the system.

Aside: Setting up IP-over-IP tunnels

You can replicate the WireGuard setup with IP-over-IP tunnels. For this example, we’ll use the ifupdown2 package on Debian and set up the tunnels with IPv4 transport, but the concept should be easily generalizable to IPv6 or even GRE.

On A, add the following block to /etc/network/interfaces:

auto sit_b
iface sit_b
    pre-up ip link add name "$IFACE" type sit mode any remote 192.0.2.2 local 192.0.2.1 ttl 255
    post-down ip link delete "$IFACE"
    address 203.0.113.0/31
    address fe80::a:b:1/64
    mtu 1480

Then bring the tunnel up with sudo ifup sit_b.

Similarly, on B, add the following block to /etc/network/interfaces:

auto sit_a
iface sit_a
    pre-up ip link add name "$IFACE" type sit mode any remote 192.0.2.1 local 192.0.2.2 ttl 255
    post-down ip link delete "$IFACE"
    address 203.0.113.1/31
    address fe80::a:b:2/64
    mtu 1480

Then bring the tunnel up with sudo ifup sit_a.

Note that if you have a firewall, you’ll need to allow IP protocols 4 and 41 between the hosts. It’s very important to note that these aren’t port numbers!

Setting up OSPF

After setting up the underlying transport links between the hosts, it’s time to set up the routing protocol. For this exercise, we are using bird2, so let’s get that installed on each host:

sudo apt install bird2

We then replace /etc/bird/bird.conf with the following block on A:

log syslog all;

# Change this to an IPv4 address on the server. It should ideally be unique.
router id 192.0.2.1;

protocol kernel {
    scan time 60;
    ipv4 {
        export where source = RTS_OSPF;
    };
}

protocol kernel {
    scan time 60;
    ipv6 {
        export where source = RTS_OSPF;
    };
}

protocol device {
    scan time 60;
}

protocol ospf v3 {
    ipv4 {
        import all;
        export none;
    };

    area 0 {
        # Change these to the prefixes you want to run OSPF on.
        stubnet 10.137.0.0/24;
        stubnet 198.51.100.0/29;

        interface "wg_b" {
            type ptp;
            cost 10; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };

        interface "wg_c" {
            type ptp;
            cost 50; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };
    };
}

protocol ospf v3 {
    ipv6 {
        import all;
        export none;
    };

    area 0 {
        # Change these to the prefixes you want to run OSPF on.
        stubnet 2001:db8:1000::/48;
        stubnet 2001:db8:2000::/64;

        interface "wg_b" {
            type ptp;
            cost 10; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };

        interface "wg_c" {
            type ptp;
            cost 50; # change this based on the actual latency
            hello 5; retransmit 2; wait 10; dead 20;
        };
    };
}

For laziness, we are using import all and export none and declaring the prefixes we want in OSPF with stubnet, but you can export routes obtained from other protocols (such as BGP) into OSPF by configuring an appropriate export filter. Doing this is left as an exercise for the reader.

Naturally, if you are only after the VPN, don’t bother with /29s, and if you only want the backbone to split your /24 for BGP, don’t bother with the VPN prefixes.

For the kernel protocol, we are only exporting routes obtained via OSPF. A better export filter is required if you are using bird for other purposes, such as BGP. Also note that routes exported into OSPF on another host will show up with proto = RTS_OSPF_EXT1 or RTS_OSPF_EXT2, so be prepared to allow those if you are exporting external routes into OSPF, e.g. with export where proto = RTS_OSPF || proto = RTS_OSPF_EXT1.

For the cost, it needs to be based on the latency between hosts. I typically use half of the ping⁹ between the tunnel endpoints in milliseconds. For example, if the ping between A and B is 20 ms, I’d use a cost of 10. This will allow OSPF to find the optimal path based on latency between endpoints. You can periodically recalculate this latency and update the bird configuration to make this more dynamic and reflective of real network conditions.¹⁰

Also, hello 5; retransmit 2; wait 10; dead 20; is configuring various timeouts for OSPF. This is a really popular set of numbers, since the default is commonly deemed way too generous and slow at detecting outages. You can naturally also use BFD to detect outages even quicker, but that’s left as an exercise for the reader.

Finally, if you are using OSPF over an unencrypted tunnel, you are advised to turn on authentication so that people can’t just inject random packets into the tunnel by spoofing. This can be done by adding the following snippet into each interface block:

authentication cryptographic;
# Change this to something unique.
# It has to be the same on both ends of the same tunnel.
password "hunter2";

Once bird is configured, reload it by running sudo birdc configure. It should tell you if there are any syntax errors and reload the configuration if it’s valid.

Repeat this process on every router, and you should be able to see the routes to other hosts in ip route. For example, on A, you should see something like:

$ ip -4 route | grep 'proto bird'
137.1.0/24 via 203.0.113.1 dev wg_b proto bird metric 32
137.2.0/24 via 203.0.113.3 dev wg_c proto bird metric 32
51.100.8/29 via 203.0.113.1 dev wg_b proto bird metric 32
51.100.16/29 via 203.0.113.3 dev wg_c proto bird metric 32
$ ip -6 route | grep 'proto bird'
db8:1001::/48 via fe80::a:b:1 dev wg_b proto bird metric 32 pref medium
db8:1002::/48 via fe80::a:c:1 dev wg_c proto bird metric 32 pref medium
db8:2000:1::/64 via fe80::a:b:1 dev wg_b proto bird metric 32 pref medium
db8:2000:2::/64 via fe80::a:c:1 dev wg_c proto bird metric 32 pref medium

If you don’t, something has gone terribly wrong. Check sudo journalctl -u bird.service to see if there are any errors.

You can also try debugging with sudo birdc show ospf neighbors and see if any neighbours are found. If not, then OSPF traffic is blocked somehow. If you are using a firewall, remember to allow IP protocol 89 (remember, this is not a port number!). Otherwise, run tcpdump and hope you can figure it out.

If neighbours are found, then check out the topology by running sudo birdc show ospf state ospf1 and sudo birdc show ospf state ospf2 to see if it’s seeing the networks you expect. If not, something has gone wrong with defining the networks.

Finally, double check birdc show route protocol ospf1 and birdc show route protocol ospf2 to see if the routes made their way to the internal bird routing table. If so, then the problem has to do with exporting the routes to the kernel. Otherwise, something is wrong with importing routes from the OSPF protocol, and you may want to double check your import filter.

Turn on forwarding

At this point, you may find yourself unable to reach endpoints on the other networks, even though the routes exist. This is because you need to turn on IP forwarding. You’ll need to configure the following sysctls:

net.ipv4.ip_forward=1 and
net.ipv6.conf.all.forwarding=1.

On Debian, this can be done by uncommenting these lines in /etc/sysctl.conf and then running sudo sysctl -p. At this point, you should be able to ping endpoints in the other locations.

If you are using BGP and announcing 198.51.100.0/24 in all locations, you should be able to see that 198.51.100.0/29 goes to A from the entire world. Useful tools for verifying this include ping.sx, ping.pe, and mtr.tools.

Conclusion

At this point, you should have your own mesh network that intelligently routes based on the lowest latency, and this can be used for a multi-site encrypted VPN or a backbone network, depending on your needs.

With OSPF, it can detect outages and route traffic around them. For example, if the link between A and C is down, it can send traffic from A to B to C. Similarly, if the latency from A to C is greater than the sum of the latency from A to B and B to C, then it would take the indirect route for better latency.

Armed with something like this, the possibilities are endless. In my case, I use the encrypted VPN for many things, such as running the MariaDB replication and Galera for my PowerDNS anycast cluster, while cramming unicast IPs for a bunch of different locations onto a single /24. I hope you found this post useful in building your own network.

Notes

This was back when I was a student, so my main concern was how cheap the servers were, not whether they were close by. ↩
And this was the moment I learned how ISPs, especially really cheap hosting ones, have terrible “scenic” routing instead of short and direct ones, which eventually started my journey towards playing around with my own BGP network. For more details about route selection, see my post on the topic. ↩
Remember that /24 is the minimum announcement size for IPv4, but due to IPv4 exhaustion, it is not in plentiful supply. So instead of getting a /24 for each location, you’d often want to fit all your locations into a single /24 to save money if they don’t actually need that many addresses. ↩
By “wave”, we typically mean renting a specific wavelength on an existing optical fibre owned by someone else. ↩
To work around this MTU issue, the ISP could bump the MTU on the underlying network to 1508, resulting in an MTU of 1500 inside the PPPoE connection. My home ISP, Bell Canada, does this. This is called “baby jumbo”, because “jumbo frames” is used to refer to any Ethernet frame encapsulating an IP packet size larger than 1500, but typically, jumbo frames are closer to 9000 bytes and not 1508. ↩
Network address translation, or NAT, is a hack to deal with IPv4 address exhaustion, breaking end-to-end connectivity in the process. Unless it’s a one-to-one mapping used by certain cloud providers to allow the same IP address to be pointed to different hosts, it will break a lot of tunnelling protocols, such as IP-over-IP or GRE. On residential routers, the DMZ host option may or may not work for the tunnel. Note that with NAT, the local IP address for the tunnel should be the private address inside the NAT. ↩
Note that FOU can’t encapsulate both IPv4 and IPv6 packets over the same tunnel, at least on Linux. For some reason, the Linux implementation makes it an encapsulated variant of IP-over-IP and an IP protocol number is required when creating FOU, and only one such number is allowed, and it’s unable to just read the first four bits of the IP header to see if it’s a 4 or 6. Therefore, it will not work with OSPFv3 for IPv4, since for that, OSPF communications happen over IPv6. You can try separate tunnels and use OSPFv2 for the IPv4 part, but that’s not worth it. If you need UDP encapsulation and don’t want encryption, GUE is the better bet. If you want encryption, definitely go with WireGuard. ↩
Note that /48 is the minimum announcement size of IPv6. I can’t really think of a very good reason to do this instead of advertising separate /48s from each site, since IPv6 is cheap. ↩
I use half the ping since ping shows the round-trip time. Using half the ping makes it so that the total cost of the path is the time it takes to get the packet one-way to the destination, assuming the route is symmetric. ↩
I have a script that does this every hour. Since my network is officially called “Dynamic Quantum Networks,” it’d be awkward if it’s not at least a bit dynamic. ↩

Enabling highly available global anycast DNS modifications with Galera

2025-08-10T01:17:37-04:00

Last time, we set up a global anycast PowerDNS cluster with geolocation and health checks. This enables it to provide high availability for other services, always sending the user to a node that’s up.

However, there was one flaw—there is a single master node, which is the single point of failure for writes. When it goes down, it becomes impossible to make changes to the DNS, even though the “slave”¹ nodes in the anycast will continue to serve the zone, including performing availability checks.

Last time, I also hinted at MariaDB Galera clusters becoming important. This time, we’ll leverage Galera, a multi-master solution for MariaDB, to eliminate the dependency on the single master. We’ll also make poweradmin highly available so that we can edit DNS records through a nice UI, instead of running API requests through curl or executing SQL queries directly.

Without further ado, let’s dive in.

Motivation

The master PowerDNS node going down wasn’t just a theoretical problem. I was inspired to eliminate that single point of failure because of something that actually happened a few weeks ago.

As it happened, the datacentre my master node was in suffered a failure with their cooling system, resulting in the following intake temperature graph for my server:

I can’t imagine it was very comfortable inside the datacentre at 53 °C, and a lot of networking equipment inside agreed, shutting down to prevent damage. This brought my server down, including the PowerDNS master running in a VM. After this incident, I decided to make the master node highly available.

What is Galera?

Galera is a multi-master solution for MariaDB. Every node in a Galera cluster can be written to, and the data has to be replicated to a majority of nodes in the cluster before the write is deemed successful. Galera requires at least 3 nodes, with an odd number of nodes being preferred to avoid split-brain situations, which can happen after a network partition and prevent the cluster from figuring out which side of the partition has the majority and can continue serving writes. A lightweight node can be run with Galera Arbitrator (garbd), which participates in consensus decisions but doesn’t have a copy of the database. Therefore, the minimum Galera cluster is two database servers and one garbd.

Also note that only the InnoDB storage engine is supported with Galera. It’s the default engine, and there’s very little reason to use any other engine with MariaDB unless you are doing something really special.

It is possible to mix Galera and standard replication, of which we made heavy use last time, such as by having replication slaves replicate from Galera cluster nodes.

Choosing a topology

My first idea was to use a Galera cluster involving every node, replacing standard replication. This resulted in a Galera cluster with 10+ nodes spread around the globe. Due to network reliability issues with transcontinental links, the Galera cluster regularly disconnects and has to resync, resulting in transactions sometimes taking over a minute to commit.

That was definitely a bad idea, but it proved that using standard replication is the right choice for cross-continental replication.

Therefore, I instead chose to deploy three Galera nodes, one in Kansas City, one in Toronto, and another in Montréal. At these distances, Galera performs perfectly fine for occasional DNS updates. The result is something like this, with red indicating the changes from last time:

Note that with three nodes, the cluster can only tolerate a single node failing. It would no longer allow any writes when two nodes fail, as the remaining node would no longer be part of a majority to establish consensus. So naturally, you want your Galera nodes to be in separate locations to avoid a single event, such as a natural disaster or a regional power outage, disrupting multiple nodes simultaneously.

The slave nodes will continue to use standard replication, replicating from any available master. This is not possible with the default MariaDB setup, but I found a solution for it anyway, which we’ll discuss later.

Converting the master node to a Galera cluster

We assume that we spin up two additional masters, at 192.0.2.2 and 192.0.2.3, respectively. Remember, you need at least three nodes in a Galera cluster, at least two of which must be MariaDB. We will use three MariaDB nodes. Using garbd is left as an exercise for the reader.

Converting the master node to Galera was surprisingly simple. All I had to do was insert a bunch of lines into /etc/mysql/mariadb.conf.d/71-galera.conf, which should be created on every Galera node:

[mariadbd]
# Turns on Galera
wsrep_on                 = 1
# You may need to change the .so path based on your operating system
wsrep_provider           = /usr/lib/libgalera_smm.so
# You might want to change this cluster name
wsrep_cluster_name       = example-cluster
wsrep_slave_threads      = 8
# List out all the nodes in the cluster, including the current one.
# Galera is smart enough to filter that one out, so you can just keep this file
# in sync on all masters.
wsrep_cluster_address    = gcomm://192.0.2.1,192.0.2.2,192.0.2.3
wsrep_sst_method         = mariabackup
wsrep_sst_auth           = mysql:
# Some tuning to tolerate higher latencies, since we aren't replicating in the same datacentre
wsrep_provider_options   = 'gcs.max_packet_size=1048576; evs.send_window=512; evs.user_send_window=512; gcs.fc_limit=40; gcs.fc_factor=0.8'

# Required MariaDB settings
binlog_format            = row
default_storage_engine   = InnoDB
innodb_autoinc_lock_mode = 2

# Galera replication
wsrep_gtid_mode          = ON
wsrep_gtid_domain_id     = 0
log_slave_updates        = ON
log_bin                  = powerdns
# You want the same server_id on all Galera nodes, as they effectively function
# as one for the purposes of standard replication.
server_id                = 1

# Enable writes
read_only                = 0

Most of /etc/mysql/mariadb.conf.d/72-master.conf has been made redundant. Only bind_address should be kept. You should define bind_address on all Galera nodes so that they can be replicated from.

For good measure, you should add a unique value for gtid_domain_id on each server, master or slave, Galera or not, in either 72-master.conf or 72-slave.conf, that differs from wsrep_gtid_domain_id and each other, though this is not strictly necessary if all writes happen with Galera.

Note that for wsrep_sst_method = mariabackup to function, mariadb-backup needs to be able to authenticate when executed by the mariadbd process running under the mysql user. Instead of using a password, we can use unix_socket authentication, which requires the 'mysql'@'localhost' user to be created and granted certain permissions. Run the following command in sudo mariadb on your existing master to create the user:

CREATE USER 'mysql'@'localhost' IDENTIFIED VIA unix_socket;
GRANT RELOAD, PROCESS, LOCK TABLES, BINLOG MONITOR ON *.* TO 'mysql'@'localhost';

If you are using a firewall, you’ll need to allow TCP ports 4567 (for regular Galera communications), 4568 (for incremental state transfers), and 4444 (for regular state transfers) between Galera nodes.²

Now, we are ready to restart the MariaDB master in Galera mode:

sudo systemctl stop mariadb.service
sudo galera_new_cluster

On the new masters, simply restarting MariaDB will cause it to join the Galera cluster:

sudo systemctl restart mariadb.service

Note that this will destroy all data currently on those servers, so back those up if you have anything important.

You can run sudo mariadb on any Galera master and run SHOW STATUS LIKE 'wsrep_cluster_%'; to see the status. You should see wsrep_cluster_size=3 (or however many nodes you actually have) and wsrep_cluster_status=Primary. This means the cluster is alive and ready for operation.

You should probably also set up some sort of monitoring for this, such as using mysqld-exporter and Prometheus, or perhaps use the wsrep_notify_cmd script, but doing any of this is left as an exercise for the reader.

If it didn’t work, check sudo journalctl -u mariadb.service for errors.

Making the slave nodes select an available master

We note that CHANGE MASTER TO can only set one MASTER_HOST, which means if we pointed it at any given Galera node, replication will stop if that master node dies. Since the cluster can still be updated by the other masters, this will cause stale data to be served, which is not ideal.

I first thought about using something like haproxy or nginx stream proxy, which can connect to a different node when connection to the upstream fails, and make the slave replicate from that. However, I quickly realized that MariaDB MaxScale is a much better solution. It supports several forms of high availability for MariaDB, such as failing over standard replication that we talked about last time. However, it can also proxy to arbitrary Galera nodes, with awareness of Galera state, so that it can always route the request to a member of the cluster that’s functional.

While MaxScale is a paid product and each release is licensed under a proprietary licence called BSL, three years after the initial release, the release branch reverts to GPLv2 or newer. This means that at the time of writing, MaxScale 21.06 has reverted to GPL and can be freely used without restrictions, while still receiving patches. Since we don’t really need any newer features, this suits us just fine.

So we need to install MaxScale 21.06 on every slave. You can use MariaDB’s repository setup script, or just run the following commands:

curl -o /etc/apt/keyrings/mariadb-keyring-2019.gpg https://supplychain.mariadb.com/mariadb-keyring-2019.gpg
cat > /etc/apt/sources.list.d/maxscale.sources <<'EOF'
X-Repolib-Name: MaxScale
Types: deb
URIs: https://dlm.mariadb.com/repo/maxscale/21.06/debian
Suites: bookworm
Components: main
Signed-By: /etc/apt/keyrings/mariadb-keyring-2019.gpg
EOF
apt update
apt install -y maxscale

We then replace /etc/maxscale.cnf on every slave:

[maxscale]
threads=1
skip_name_resolve=true

[galera1]
type=server
address=192.0.2.1
port=3306

[galera2]
type=server
address=192.0.2.2
port=3306

[galera3]
type=server
address=192.0.2.3
port=3306

[galera-monitor]
type=monitor
module=galeramon
servers=galera1,galera2,galera3
user=maxscale_monitor
password=hunter2
monitor_interval=2s

[galera-reader]
type=service
router=readconnroute
servers=galera1,galera2,galera3
router_options=synced
user=maxscale
password=hunger2

[galera-replicator]
type=listener
service=galera-reader
address=127.0.0.1
port=3307

This effectively makes MaxScale listen on 127.0.0.1:3307, which, when connected to, will route to any available Galera node.

Also note that the communication between MaxScale and MariaDB nodes is unencrypted, so this is only suitable over LAN or VPN. For a direct WAN connection, you should probably set up TLS.

Note that MaxScale requires two users to be created, so let’s create them on the Galera cluster with sudo mariadb:

-- The user maxscale_monitor is used exclusively to monitor the cluster.
-- Usual quirks regarding host restrictions and passwords apply.
CREATE USER 'maxscale_monitor'@'%' IDENTIFIED BY 'hunter2';
GRANT SLAVE MONITOR ON *.* TO 'maxscale_monitor'@'%';

-- The user maxscale is required for MaxScale to read system tables and
-- authenticate clients.
-- Usual quirks regarding host restrictions and passwords apply.
CREATE USER 'maxscale'@'%' IDENTIFIED BY 'hunter2';
GRANT SELECT ON mysql.user TO 'maxscale'@'%';
GRANT SELECT ON mysql.db TO 'maxscale'@'%';
GRANT SELECT ON mysql.tables_priv TO 'maxscale'@'%';
GRANT SELECT ON mysql.columns_priv TO 'maxscale'@'%';
GRANT SELECT ON mysql.procs_priv TO 'maxscale'@'%';
GRANT SELECT ON mysql.proxies_priv TO 'maxscale'@'%';
GRANT SELECT ON mysql.roles_mapping TO 'maxscale'@'%';
GRANT SHOW DATABASES ON *.* TO 'maxscale'@'%';

Now, restart MaxScale on every slave node with sudo systemctl restart maxscale.service. Then, repeat the slave provisioning process, but instead run this CHANGE MASTER command:

CHANGE MASTER TO
   MASTER_HOST="127.0.0.1",
   MASTER_PORT=3307,
   MASTER_USER="replication",
   MASTER_PASSWORD="hunter2",
   MASTER_USE_GTID=slave_pos;

There’s probably a way to avoid rebuilding the entire slave, but given that I have scripts to provision them, I just reprovisioned. Doing it without a full reprovisioning is left as an exercise for the reader.

Either way, once you START SLAVE;, you should check SHOW SLAVE STATUS\G and verify that replication is working. If it is, great!

I am also quite impressed by MaxScale’s memory efficiency, using around 15 MiB of memory at startup, and that goes down to around 2-3 MiB when just holding a connection open. It’s basically a drop in the bucket for this replication setup.

Deploying additional poweradmin instances

Now, having the Galera cluster isn’t super helpful when the master node with the PowerDNS API and poweradmin goes down, as you have no way of modifying DNS records short of running direct queries against the surviving masters. This is why you need to deploy PowerDNS and poweradmin on all masters.

In both cases, the state is stored in the database, so simply duplicating all the configuration files and running them on each master will work. For poweradmin, this means copying /srv/poweradmin/inc/config.inc.php instead of rerunning the installer.

However, there’s a slight problem: poweradmin uses PHP sessions, which by default are stored on disk. This means that when you hit a different node, you are logged out—clearly not ideal.

I implemented a database-powered backend for PHP sessions³, which you can store as /srv/poweradmin/inc/pdo_sessions.php:

class PDOBackedSession implements SessionHandlerInterface {
    public function __construct($dsn, $user = null, $pass = null, $options = []) {
        $this->dsn = $dsn;
        $this->user = $user;
        $this->pass = $pass;
        $this->options = array_merge([
            PDO::ATTR_ERRMODE => PDO::ERRMODE_EXCEPTION,
        ], $options);
    }

    public function register() {
        session_set_save_handler($this, true);
    }

    public function open($path, $name) {
        try {
            $this->pdo = new PDO($this->dsn, $this->user, $this->pass, $this->options);
            return true;
        } catch (PDOException $e) {
            return false;
        }
    }

    public function close() {
        $this->pdo = null;
        return true;
    }

    public function read($sid) {
        $stmt = $this->pdo->prepare('SELECT data FROM php_sessions WHERE id = ?');
        $stmt->execute([$sid]);
        $result = $stmt->fetch(PDO::FETCH_ASSOC);
        return $result ? $result['data'] : '';
    }

    public function write($sid, $data) {
        $stmt = $this->pdo->prepare('REPLACE INTO php_sessions VALUES (?, ?, ?)');
        $stmt->execute([$sid, time(), $data]);
        return true;
    }

    public function destroy($sid) {
        $stmt = $this->pdo->prepare('DELETE FROM php_sessions WHERE id = ?');
        $stmt->execute([$sid]);
        return true;
    }

    public function gc($lifetime) {
        $stmt = $this->pdo->prepare('DELETE FROM php_sessions WHERE access < ?');
        $stmt->execute([time() - $lifetime]);
        return true;
    }
}

While this class is intended to be extensible to support non-MySQL/MariaDB databases, I don’t have the energy to implement it. Contributions are welcome. It’s currently available as a gist, so check there for an updated version if one becomes available.

You will need to run sudo mariadb powerdns on a Galera node and create this table:

CREATE TABLE php_sessions (
    id VARCHAR(32) NOT NULL PRIMARY KEY,
    access BIGINT NOT NULL,
    data BLOB NOT NULL
);

Then, you’ll need to patch /srv/poweradmin/index.php:

--- a/index.php
+++ b/index.php
@@ -39,6 +39,9 @@ session_set_cookie_params([
     'httponly' => true,
 ]);

+include __DIR__ . '/inc/pdo_sessions.php';
+include_once __DIR__ . '/inc/config.inc.php';
+(new PDOBackedSession($db_type . ':host=' . $db_host . ';dbname=' . $db_name, $db_user, $db_pass))->register();
 session_start();

 $router = new BasicRouter($_REQUEST);

Now, poweradmin should be storing PHP sessions in the database.

High availability for poweradmin

Now we just need a way to always send users to an available instance of poweradmin. We can, of course, just use PowerDNS Lua records for this. Assuming you are running example.com with the PowerDNS cluster, you would need to move poweradmin to poweradmin.example.com.

We’ll assume that the public IPv4 addresses for the master nodes are 198.51.100.1, 198.51.100.2, 198.51.100.3 and the IPv6 addresses are 2001:db8::1, 2001:db8::2, 2001:db8::3.

Then, simply create the following records:

poweradmin.example.com.   300   IN  LUA   A    "ifurlup('https://poweradmin.example.com/index.php?page=login', { {'198.51.100.1', '198.51.100.2', '198.51.100.3'} }, {stringmatch='Please provide a username', selector='pickclosest'})"
poweradmin.example.com.   300   IN  LUA   AAAA "ifurlup('https://poweradmin.example.com/index.php?page=login', { {'2001:db8::1', '2001:db8::2', '2001:db8::3'} }, {stringmatch='Please provide a username', selector='pickclosest'})"

Updating the nginx configuration is left as an exercise for the reader.

Now, poweradmin.example.com should automatically point to the nearest available instance of poweradmin, and each instance of poweradmin connects to the local Galera node. DNS updates are possible even if a single node fails. In fact, you won’t even be logged out of poweradmin!

You can also do something similar for the PowerDNS API if you have scripts that rely on it, but that’s also left as an exercise for the reader.

Conclusion

Last time, we successfully constructed a highly available PowerDNS cluster with availability checks and geolocation to steer users towards nearby available instances of services, enabling high availability. This time, we leveraged it—along with Galera—to deploy a highly available application: poweradmin, which is used to manage this cluster.

I hope this demonstrates the power of this approach to high availability. Similar concepts can be applied to most other web applications, although a new mechanism becomes necessary if applications depend on storing data as files instead of a database.

In either case, I hope you found this post useful. See you next time!

Notes

For various reasons, people have been trying to replace the terms used for the different node types in replication. For simplicity and to reduce confusion, we are going to stick with the traditional terminology since they are the only ones that work in every context with MariaDB.

The newer terms are inconsistently applied at the time of writing, with the MariaDB documentation freely mixing different terminology, making it hard to understand and harder to know which commands to use. Even worse, the new commands they introduced are also internally inconsistent, leading to confusing situations like the following interaction, which doesn’t happen with the old commands:
```
MariaDB [(none)]> SHOW REPLICA STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
...
```
Needless to say, this should not be construed as any endorsement of the practice of enslaving humans. ↩
If you are using multicast, you may also need UDP port 4567. However, setting up multicast is left as an exercise for the reader. ↩
I hate PHP. The language is super unintuitive, a naming convention is non-existent, and the documentation is horrible. You know something is wrong when you have to rely on corrections in the comments on the docs to successfully code a simple class. ↩

Building highly available services: global anycast PowerDNS cluster

2025-08-04T15:17:06-04:00

As I’ve written about before, this blog has multiple geographically distributed backend servers serving the content, with my anycast PowerDNS cluster selecting the geographically closest backend server that’s up and returning it to the user.

Due to various outages I’ve experienced recently, I’ve been thinking a lot more about making my self-hosted services highly available (HA), staying up even if a few servers go down. This is mostly for the sake of my sanity, so that I could just shrug if a server goes down and wait for the provider to bring it back up, instead of panicking. Of course, the added availability also helps, but it’s probably a bigger concern in the enterprise space than it is for hobbyists. As a bonus, if you have nodes spread out across multiple locations, you can also route the user to the geographically closest one for lower latency and faster response times.

Either way, I thought it was time to start a series about building highly available services. We begin with the most important building block—DNS, which is basically required to make any other service highly available.

The stack I’ve chosen for this is MariaDB and PowerDNS, mostly because these are fairly easy to set up and I already have experience with them. Many other alternative tech stacks are probably equally viable, but that’s left as an exercise for the reader. The general idea should apply anyway. Note that anycast isn’t really required, since you can still follow along and deploy two unicast DNS servers for redundancy.

Without further ado, let’s dive in.

Why make DNS highly available?

Perhaps the first question you might have is why a DNS server is necessary for HA, when I am using anycast to make the DNS highly available and could theoretically just use anycast for everything else too. There are three main reasons:

You need DNS anyway if you want your domain to resolve, so you need HA for DNS anyway. I suppose you could just use any free or cheap DNS hosting provider with anycast instead of hosting your own if that was the only reason though.
Most services (including HTTP) use TCP. With anycast, multiple devices share the same IP address, with the traffic routed to the “closest” device (as BGP understands it). Notably, this happens on OSI layer 3—the network layer—which is one layer below TCP on the transport layer (layer 4). This means that anycast doesn’t understand anything about TCP connections. Routers will happily send traffic to whichever device they think is the closest at any given time, even if you have a TCP connection with some other device. The new device will get very confused and send a “connection reset” packet back, forcing the connection to be re-established. While this may not matter so much for HTTP due to the very short-lived connections, for other services, it would prove disastrous.
BGP’s idea of “closest” may be very wrong. As mentioned in the post on anycast, BGP prefers the shortest AS path instead of the shortest round-trip time, and ASes prefer routes from their own customers over routes from peers. As such, it’s quite easy for it to get into pathological situations where it chooses to route traffic to another continent. I have to regularly run traces on my anycast from around the world to fix the pathological routing, which is very annoying. While IP geolocation is not an exact science either¹, it’s less likely to have such pathological behaviour².

Choosing a tech stack

For the DNS server, I chose PowerDNS since it’s popular, in the Debian repository³, and extensible with Lua, enabling complex logic to generate DNS records, such as based on server availability and geolocation. It also supports letting the database take care of the replication, instead of building something bespoke with AXFRs, so I can reuse the underlying database replication mechanism for other services, which will become important later.

Other contenders included:

bind9, which is immediately ruled out because it can’t do health checks. It also has a very cursed way of doing other things. For example, instead of automatically finding which server is the closest based on geographical distance, you are supposed to use ACLs to match against individual countries, and then use different views serving different zone files based on that. That just doesn’t seem very ergonomic… The main way of defining zones is also through plain text zone files, which are sort of a pain to work with too, especially with automation.
gdnsd, which can do the availability and geolocation thing very well, but only those. I used to use it, and then the built-in HTTP health check turned out to be burning a ton of CPU cycles due to a bug, so I had to replace it with a curl script… It also relies on cursed bind-style zone files to define records, which is annoying, and it doesn’t support DNSSEC⁴ either. For replication, it requires copying the zone files around.
CoreDNS, which is a DNS server written in Go that makes everything a plugin. On the surface, there is a geoip plugin, but there’s no way to use the data it generates short of writing my own plugins in Go. There is another external plugin, gslb, which has availability checks and geolocation, but requires manually declaring locations instead of finding the nearest one automatically. I’d also need to deploy my own Go application, since it’s not packaged.

For the database storing zone data in PowerDNS, I chose to use MariaDB because I am familiar with it, it’s easy to work with, it has really nice built-in replication, and there’s the Galera cluster extension that supports building highly available multi-master clusters. We will not be using Galera in this post, but the next one in the series will use it to address some of the shortcomings in this setup.

I know the PostgreSQL crowd is going to inevitably say that PostgreSQL is the only correct choice for a database (just like how the ZFS crowd insists that I am doing it wrong with LVM and btrfs), but I have PostgreSQL deployed for applications that insist on it, and it always seemed rather painful to deal with for various reasons:

The need to deploy a connection pooler for good performance;
The pain of major version upgrades, since physical replication doesn’t work across major versions, and logical replication doesn’t replicate any schema changes, so I can’t just use that; and
The lack of a simple, built-in multi-master solution like Galera. There are various third-party solutions that handle failover, but they are all quite complex compared to Galera, which has first-party support and just works out of the box.

Furthermore, none of the advanced PostgreSQL features matter for this use case. When replicating to 512 MiB VMs⁵ on the other side of the world, MariaDB gets the job done and is just a lot easier to work with.

MariaDB replication

Before we go any further, we should understand how MariaDB replication works.

The form of replication we are interested in is called standard replication, which is the replication from a master (also called primary or source) node to a “slave” (also called replica, secondary, or standby) node.⁶

With standard replication, only master nodes can handle write requests, since any writes to a slave would not be replicated to any other node. Slave nodes can only handle read requests as a result. On the other hand, if the slave is disconnected from master, it can and will serve reads with potentially stale data, and keep trying to connect.

Furthermore, each slave can only replicate a single table from a single master, even with multi-source replication. This means that the master node is the single point of failure by default. A failover solution can be used to promote slave to master and reconfigure all other slaves to replicate from it to ensure availability, but this is quite annoying to do in practice. There are products, such as MariaDB MaxScale, that handles failovers by reconfiguring replication.

Choosing a replication topology

I have 10+ anycast nodes around the world. All these nodes will need read-only access to the PowerDNS database. For simplicity’s sake, I decided to use standard replication, running a replication slave on each PowerDNS node, since that’s the most reliable setup and tolerates network disconnection events.

I first started out with a single master node, figuring it was enough because the anycast nodes would still stay up even if the master node is down. While this is true, it does render me completely unable to update the DNS while the master node is down, which is annoying. I’ll cover how to eliminate this single point of failure next time.

This is what the topology looks like, with the arrow showing the flow of data when a change is made on the master node:

Note that if you don’t intend to use anycast, you can just start with one master node for your DNS without any slaves. The next post in the series will discuss how to create multiple masters, at which point you can just use the master nodes as your nameservers.

Setting up your first MariaDB node

To start, we need to deploy the master node for the database. We’ll use Debian as an example here, but the procedure is similar on other distros.

First, we set up MariaDB’s repositories. You can follow the instructions on the MariaDB documentation, or use the packages provided by your distro.

Alternatively, just run the following script as root to install MariaDB 11.4, which is the latest version when I started deploying this cluster (the latest version at the time of writing is 11.8):

apt install apt-transport-https curl
mkdir -p /etc/apt/keyrings
curl -o /etc/apt/keyrings/mariadb-keyring.pgp 'https://mariadb.org/mariadb_release_signing_key.pgp'
cat > /etc/apt/sources.list.d/mariadb.sources <<'EOF'
# MariaDB 11.4 repository list
# https://mariadb.org/download/
X-Repolib-Name: MariaDB
Types: deb
URIs: https://deb.mariadb.org/11.4/debian
Suites: bookworm
Components: main
Signed-By: /etc/apt/keyrings/mariadb-keyring.pgp
EOF
apt update
apt install -y mariadb-server mariadb-backup

Note that mariadb-backup will become very important later.

The default settings are insecure, so you should run sudo mariadb-secure-installation. You are highly advised to use unix_socket authentication for root, disable anonymous users, disallow remote root logins, and remove the test database.

Now, your master node is ready.

Setting up PowerDNS

For PowerDNS, we will be using the generic MySQL/MariaDB backend (gmysql). Since this is a generic backend that supports various different schemas, PowerDNS will not create a database or any tables for you. Instead, you’ll need to do this yourself.

Run sudo mariadb to enter the MariaDB shell, then run the following SQL queries:

CREATE DATABASE powerdns;
CREATE USER 'powerdns-reader'@'%' IDENTIFIED BY 'hunter2';
CREATE USER 'powerdns-writer'@'%' IDENTIFIED BY 'hunter2';
GRANT SELECT ON powerdns.* TO 'powerdns-reader'@'%';
GRANT SELECT, INSERT, UPDATE, DELETE ON powerdns.* TO 'powerdns-writer'@'%';

You can replace @'%' to restrict which hostname or IP address can connect. For example, if you only expect PowerDNS to connect from localhost, you could restrict it to @'localhost'. Since I am running this exclusively on a VPN, it doesn’t really matter. Also, for obvious reasons, do not use hunter2 as the password, and use a different password for each user.

Next, you want to create the schema. Run USE powerdns to switch to the newly created database, and paste in the schema from the PowerDNS documentation.

Then, you want to install PowerDNS Authoritative Server:

sudo apt install pdns-server pdns-backend-mysql pdns-backend-lua2 pdns-backend-geoip pdns-backend-bind-

This command will install the MySQL/MariaDB backend, the GeoIP and Lua plugins, and avoid installing the useless bind zone file backend that we aren’t going to use.

We’ll create the following config file /etc/powerdns/pdns.d/common.conf for configuration shared between writable nodes and read-only nodes in the cluster:

# MariaDB
launch += gmysql
gmysql-dbname = powerdns
gmysql-user = powerdns-reader
gmysql-password = hunter2

# Lua records
enable-lua-records = yes

# Enable geolocation capability
launch += geoip
geoip-database-files = /var/lib/GeoIP/GeoLite2-City.mmdb
edns-subnet-processing = yes

You can obtain the GeoLite2-City.mmdb file through various means, such as the geoipupdate tool from MaxMind, which you can install on Debian with sudo apt install geoipupdate if the contrib repository is enabled.

We then create the following config file /etc/powerdns/pdns.d/master.conf specifically for the master nodes:

gmysql-user = powerdns-writer
gmysql-password = hunter2

# Enable API
api = yes
api-key = hunter2
# Remember to change the API key!

Now, we can restart PowerDNS with sudo systemctl restart pdns.service, and it should be querying from MariaDB. There currently is nothing to serve, but we’ll rectify that soon enough.

Setting up poweradmin

Poweradmin is the most popular web frontend for PowerDNS, and the only one that’s actively being maintained at the time of writing. It’s somewhat unfortunate that it’s written in PHP, but oh well.

There are many ways to deploy PHP applications. Since I use nginx everywhere, I decided to deploy it with nginx + php-fpm. Basically, we’ll deploy poweradmin’s code into /srv/poweradmin and set up a php-fpm pool for it, and run poweradmin’s installer.

First, we install nginx. There are many ways of doing it, but here’s one way:

sudo apt -y install curl gnupg2 ca-certificates lsb-release debian-archive-keyring
curl https://nginx.org/keys/nginx_signing.key | gpg --dearmor | sudo tee /usr/share/keyrings/nginx-archive-keyring.gpg > /dev/null
echo "deb [signed-by=/usr/share/keyrings/nginx-archive-keyring.gpg] http://nginx.org/packages/debian $(lsb_release -cs) nginx" | sudo tee /etc/apt/sources.list.d/nginx.list > /dev/null
sudo apt update
sudo apt -y install nginx

We then install php-fpm and the required PHP extensions:

apt install -y php-fpm php-intl php-mysql php-mcrypt

Now, we create a new php-fpm pool specifically for poweradmin to encourage isolation in /etc/php/8.2/fpm/pool.d/poweradmin.conf with the following contents:

[poweradmin]
user = poweradmin
group = poweradmin
listen = /run/php/poweradmin.sock
listen.owner = poweradmin
listen.group = nginx
listen.mode = 660
pm = dynamic
pm.max_children = 3
pm.start_servers = 1
pm.min_spare_servers = 1
pm.max_spare_servers = 2
pm.max_requests = 1000
env[PATH] = /usr/local/bin:/usr/bin:/bin
catch_workers_output = yes
php_admin_flag[log_errors] = on

You may need to change the PHP version in the path to whatever you have installed. Debian bookworm uses 8.2. You may also need to change listen.group to whatever group nginx is running under, which is either nginx or www-data, depending on whose package you installed.

You will naturally need to create the poweradmin user and group for php-fpm:

adduser --system --group poweradmin

You’ll also need to create a poweradmin user in MariaDB by running this in sudo mariadb:

CREATE USER 'poweradmin'@'%' identified by 'hunter2';
GRANT ALL ON powerdns.* TO 'poweradmin'@'%';

Finally, we are going install to poweradmin 3.9.5, which is the latest version at the time of writing:

mkdir -p /srv/poweradmin
cd /srv/poweradmin
curl -L https://github.com/poweradmin/poweradmin/archive/refs/tags/v3.9.5.tar.gz | tar -xz --strip-components=1

We now configure nginx to serve poweradmin by passing it to php-fpm. Create /etc/nginx/conf.d/poweradmin.conf:

server {
    listen       80;
    listen       [::]:80;
    server_name  poweradmin.example.net;

    root /srv/poweradmin;

    location / {
        try_files $uri $uri/ /index.php?$query_string;
        index index.php;
    }

    location /inc/ { return 404; }

    location ~ [^/]\.php(/|$) {
        fastcgi_pass unix:/run/php/poweradmin.sock;
        fastcgi_index index.php;
        fastcgi_split_path_info ^(.+?\.php)(/.*)$;

        if (!-f $document_root$fastcgi_script_name) {
            return 404;
        }

        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
    }
}

Setting up HTTPS is left as an exercise for the reader, but it is highly recommended, especially if you are allowing access outside of your LAN without a VPN.

Reload your configuration with sudo systemctl reload nginx.service php8.2-fpm.service (you may need to systemctl start nginx.service if it’s not already running), and you should now be able to install poweradmin at https://poweradmin.example.net/install/.

Simply follow the instructions there, installing into the same database with the specifically created poweradmin database user, and then once you are done, delete the installer:

rm -r /srv/poweradmin/install

You should now be able to go to https://poweradmin.example.net, log in, and create your zones.

Note that if you plan to use DNSSEC, you should edit /srv/poweradmin/inc/config.inc.php and append the following snippet to allow poweradmin to invoke the PowerDNS API to rectify the zone:

$pdnssec_use = true;
$pdns_api_url = 'http://localhost:8081';
$pdns_api_key = 'hunter2'; // change to whatever you set api-key to in PowerDNS

Once you have created your zone, you should be able to use dig and hit your local powerdns instance. For example, if you created an A record for example.com, you should be able to see it with dig A example.com @localhost on the master node.

Setting up replication

Now for the more interesting part—setting up MariaDB replication. We’ll assume all servers are on the VPN 192.0.2.0/24, with the master node at 192.0.2.1 and an example slave node at 192.0.2.10. Setting up this VPN is out of scope for this post and will be left as an exercise for the reader.

Additionally, we assume they have public IPv4 addresses 198.51.100.1 and 198.51.100.10, respectively, and IPv6 addresses 2001:db8::1 and 2001:db8::10, respectively. Note that by default, MariaDB replication is unencrypted, so you should run it on a LAN or VPN. There are ways to encrypt the traffic, and setting those up when replicating over the public Internet is left as an exercise for the reader.

First, install MariaDB on every slave node.

On every node, create /etc/mysql/mariadb.conf.d/70-powerdns.cnf (we avoid modifying existing configuration files to avoid conflicts when upgrading):

[mariadbd]
skip_name_resolve = 1
performance_schema = 1
read_only = 1

Then, on the master node, create /etc/mysql/mariadb.conf.d/72-master.conf:

[mariadbd]
server_id = 1
bind_address = 127.0.0.1,192.0.2.1

log_bin = powerdns
binlog_format = mixed
read-only = 0

On each slave node, create /etc/mysql/mariadb.conf.d/72-slave.conf:

[mariadbd]
server_id = 10

Remember to set the server_id to something different on each node! Also, since we are running PowerDNS on the same node as MariaDB, we don’t need to adjust bind_address on the slaves. The default localhost binds are sufficient.

We also need to create a replication user on the master node. Run sudo mariadb there and run the following queries:

CREATE USER 'replication'@'%' IDENTIFIED BY 'hunter2';
GRANT REPLICATION SLAVE ON *.* TO 'replication'@'%';

Feel free to change the username and password, along with adding any host-based restrictions. Note that if you have a firewall, you need to allow TCP port 3306 on the master from any IPs that the slaves might use to connect.

We then use mariadb-backup to copy the database over to the slaves. The easiest way to use mariadb-backup here is to export as xbstream and pipe it over SSH. For example:

ssh root@slave mkdir /root/snapshot
ssh root@master 'mariadb-backup --backup --stream=xbstream | zstd' | ssh root@slave 'unzstd | mbstream -x -C /root/snapshot'

Then, as root on the slave, you can run the following command to load the snapshot into MariaDB:

systemctl stop mysql.service
rm -rf /var/lib/mysql
mariadb-backup --prepare --target-dir=/root/snapshot
mariadb-backup --move-back --target-dir=/root/snapshot
chown -R mysql:mysql /var/lib/mysql/
systemctl start mysql.service

We now need to tell MariaDB to replicate from the master node. First, we read /root/snapshot/mariadb_backup_binlog_info, which contains exactly where the replication got to. It should look something like:

powerdns.000000 1234 0-1-5678

We are interested in the third whitespace-separated value, e.g. 0-1-5678, which is the GTID (Global Transaction ID). This is the new and recommended way of setting up replication at the time of writing.

We then run sudo mariadb and run the following query:

STOP SLAVE;
SET GLOBAL gtid_slave_pos = "[the GTID from mariadb_backup_binlog_info]";
CHANGE MASTER TO
   MASTER_HOST="192.0.2.1",
   MASTER_PORT=3306,
   MASTER_USER="replication",
   MASTER_PASSWORD="hunter2",
   MASTER_USE_GTID=slave_pos;
START SLAVE;

Wait a bit, then run SHOW SLAVE STATUS\G. If all goes well, you should see lines like:

Slave_IO_State: Waiting for master to send event
Slave_IO_Running: Yes
Slave_SQL_Running: Yes

This means it’s working. You should probably monitor Slave_IO_Running and Slave_SQL_Running and generate a notification if the value ever becomes No, such as using mysqld-exporter and Prometheus, but doing so is left as an exercise for the reader.

You can follow the same procedure as above to set up PowerDNS on the slave node, but only create /etc/powerdns/pdns.d/common.conf and not /etc/powerdns/pdns.d/master.conf. Once you make a change to your zones on poweradmin, you should be able to see it by running dig against 198.51.100.10.

Repeat the same procedure for any other slave node you have.

Regular unicast DNS

At this point, you should have two instances of PowerDNS. That’s enough for a traditional unicast DNS setup. You could just use them to serve your zone, say example.com. There are two ways of doing this.

If you own a domain example.net and have the DNS hosted elsewhere, you can create the following records:

ns1.example.net.    86400   IN    A     198.51.100.1
ns1.example.net.    86400   IN    AAAA  2001:db8::1
ns2.example.net.    86400   IN    A     198.51.100.10
ns2.example.net.    86400   IN    AAAA  2001:db8::10

(This is in bind-style zone file format. The fields are as follows: domain name, TTL, class⁷, record type, value.)

Then you can set the DNS servers as ns1.example.net and ns2.example.net.

Alternatively, if your registrar supports glue records, you can make the DNS resolution faster by directly storing the nameserver IPs for example.com and eliminating the dependency on example.net. To do this, first create the following records in poweradmin for example.com:

ns1.example.com.    86400   IN    A     198.51.100.1
ns1.example.com.    86400   IN    AAAA  2001:db8::1
ns2.example.com.    86400   IN    A     198.51.100.10
ns2.example.com.    86400   IN    AAAA  2001:db8::10

Then, on the registrar, create glue records for the DNS for ns1.example.com and ns2.example.com, specifying the same IP addresses for each nameserver, and set them as the DNS server for example.com.

Deploying anycast

Let’s assume you are using 203.0.113.0/24 and 2001:db8:1000::/48 for anycast. You can add the IPs 203.0.113.1, 203.0.113.2, 2001:db8:1000::1, and 2001:db8:1000::2 on every DNS server you want participating in anycast. In my case, I did this on all the slave nodes, since I am reserving the master node for writes.

Then, announce the prefixes 203.0.113.0/24 and 2001:db8:1000::/48 from every server to your upstream, following instructions in the previous post to tune your announcements to avoid pathological routing. Since this strongly depends on your upstreams and the type of equipment you have, the details are left as an exercise for the reader.

You can then follow a very similar procedure as unicast, except using 203.0.113.1, 203.0.113.2, 2001:db8:1000::1, and 2001:db8:1000::2 instead.

Once this is done, DNS requests for example.com should go to the “closest” server as determined by BGP.

Using Lua records to select backends

Once you have the PowerDNS master node, you can use Lua records to perform uptime checks and select nodes via geolocation.

For example, say you have two backend instances serving example.com:

node 1 has addresses 198.51.100.101 and 2001:db8:2000::1; and
node 2 has addresses 198.51.100.102 and 2001:db8:2000::2.

You can check if the backend returns 200 when https://example.com is requested, and always prefer node 1 if it’s up, by creating the following Lua records:

example.com.  300   IN  LUA   A    "ifurlup('https://example.com', { {'198.51.100.101'}, {'198.51.100.102'} })"
example.com.  300   IN  LUA   AAAA "ifurlup('https://example.com', { {'2001:db8:2000::1'}, {'2001:db8:2000::2'} })"

Note that in poweradmin, select LUA as the record type and enter A "ifurlup(...)" as the content of the record.

You can also tell PowerDNS to instead choose the nearest node, and also check that the response contains the string Example:

example.com.  300   IN  LUA   A    "ifurlup('https://example.com', { {'198.51.100.101', '198.51.100.102'} }, {stringmatch='Example', selector='pickclosest'})"
example.com.  300   IN  LUA   AAAA "ifurlup('https://example.com', { {'2001:db8:2000::1', '2001:db8:2000::2'} }, {stringmatch='Example', selector='pickclosest'})"

pickclosest works by looking up the IP geolocation for each backend server IP in the list and also the user’s IP, then checking which backend server is the closest to the user geographically. When doing this, make sure the IP geolocation for your backend servers is correct!

You can also override the result for certain countries by using more complex Lua scripts. For example, to send all users from China to node 1, you can use the following Lua script:

;if country('CN') then return '198.51.100.101' else return ifurlup('https://example.com', { {'198.51.100.101', '198.51.100.102'} }, {stringmatch='Example', selector='pickclosest'}) end

Note that if the Lua code starts with ;, the PowerDNS will treat it as a full script. Otherwise, it would treat it as a simple Lua expression. In this case, we use a script. Like before, put the script content in A "..." and AAAA "....", and place the resulting string into the “content” field in poweradmin.

Also note that on the first request, PowerDNS will not have the availability information and will randomly return one of the backends. This is somewhat annoying but unavoidable due to the dynamic nature of Lua records. A hack is using a cron job to query example.com for A and AAAA every hour to ensure PowerDNS is always performing health checks.

For more details, consult the PowerDNS documentation.

Conclusion

Since this post has gone on for long enough, I’ll end it here. At this point, we’ve successfully constructed a highly available PowerDNS cluster with availability checks and geolocation to steer users towards nearby available instances of services, enabling high availability. We’ve demonstrated how to do this with a simple service that has multiple backends, which you can use to make any static website highly available.

The current setup has one flaw: there is a single master node, which is the single point of failure for writes. When it goes down, it becomes impossible to make changes to the DNS, even though the slave nodes in the anycast will continue to serve the zone, including performing availability checks.

I hope you found this post useful. Next time, we’ll look into how to eliminate the dependency on the single master node, enabling highly available DNS modifications, and also demonstrate how to make dynamic web applications highly available.

Notes

The thing that always shocks non-networking people about IP geolocation is how much it relies on people submitting corrections, and also how much it relies on random CSV files maintained by networks. I might write a post about how it all works one day. ↩
Networks are typically incentivized to improve connectivity, which should improve latencies to any given IP. On the other hand, improving connectivity by adding a new upstream has the potential to make downstream anycast worse due to the new upstream always preferring customer routes, even if it originated on the other side of the world and there are other routes that came from closer locations. ↩
I really like it when software is in the Debian repository, because then security patches are the Debian security team’s problem, not mine. I also trust them to patch stuff on time, unlike random third-party repositories. There are only a few vendors whose Debian repositories I trust, such as mariadb and nginx. The whole point of this exercise is to reduce my stress levels, and unattended-upgrades really helps. ↩
Yes, I know many people don’t like DNSSEC and the standard has many problems, but I still feel like having DNSSEC is better than letting anyone inject fake records. I refuse to build my infrastructure on something that locks me out of DNSSEC. ↩
Yes, DNS isn’t some super heavy application, especially given the size of my zones and the amount of queries I am getting. I’d rather have something simple. If my blog somehow becomes super popular, I am sure I can just upgrade the smallest nodes. ↩
For various reasons, people have been trying to replace the terms used for the different node types in replication. For simplicity and to reduce confusion, we are going to stick with the traditional terminology since they are the only ones that work in every context with MariaDB.

The newer terms are inconsistently applied at the time of writing, with the MariaDB documentation freely mixing different terminology, making it hard to understand and harder to know which commands to use. Even worse, the new commands they introduced are also internally inconsistent, leading to confusing situations like the following interaction, which doesn’t happen with the old commands:
```
MariaDB [(none)]> SHOW REPLICA STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
...
```
Needless to say, this should not be construed as any endorsement of the practice of enslaving humans. ↩
DNS has the concept of classes for different types of networks, though it’s basically always IN for Internet. The only other class you might see these days is CH for Chaosnet, but it’s really just being used as a way to query information about the DNS server itself, not anything to do with the real Chaosnet. ↩

Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

2025-05-11T01:30:14-04:00

Since the inception of solid-state drives (SSDs), there has been a choice to make—either use SSDs for vastly superior speeds, especially with non-sequential read and writes (“random I/O”), or use legacy spinning rust hard disk drives (HDDs) for cheaper storage that’s a bit slow for sequential I/O¹ and painfully slow for random I/O.

The idea of caching frequently used data on SSDs and storing the rest on HDDs is nothing new—solid-state hybrid drives (SSHDs) embodied this idea in hardware form, while filesystems like ZFS support using SSDs as L2ARC. However, with the falling price of SSDs, this no longer makes sense outside of niche scenarios with very large amounts of storage. For example, I have not needed to use HDDs in my PC for many years at this point, since all my data easily fits on an SSD.

One of the scenarios in which this makes sense is for the mirrors I host at home. Oftentimes, a project will require hundreds of gigabytes of data to be mirrored just in case anyone needs it, but only a few files are frequently accessed and could be cached on SSDs for fast access². Similarly, I have many LLMs locally with Ollama, but there are only a few I use very frequently. The frequently used ones can be cached while the rest can be loaded slowly from HDD when needed.

While ZFS may seem like the obvious option here, due to Linux compatibility issues with ZFS mentioned previously, I decided to use Linux’s Logical Volume Manager (LVM) instead for this task to save myself some headache. To ensure reliable storage in the event of HDD failures, I am running the HDDs in RAID 1 with Linux’s mdadm software RAID.

This post documents how to build such a cached RAID array and explores some considerations when building reliable and fast storage.

Why use LVM cache?

There are several alternative block device caching solutions on Linux, such as:

bcache: a built-in Linux kernel module that does similar caching as LVM. I don’t like the way it’s set up by owning the entire block device and non-persistent sysfs configurations, compared to LVM remembering all the configuration options, nor do I enjoy hearing about all the reports of bcache corrupting data; and
EnhanceIO: an old kernel module that does something similar to bcache and LVM cache, but hasn’t been maintained for over a decade.

Since I am very familiar with LVM and have already used it for other reasons, I opted to use LVM for this exercise as well.

A quick introduction to LVM

If you aren’t familiar with LVM, we’ll need to first introduce some concepts, or none of the LVM portions of this post will make any sense.

First, we’ll need to introduce block devices, which are just devices with a fixed number of blocks that can be read at any offset. HDDs and SSDs show up as block devices, such /dev/sda. They can be partitioned into multiple pieces, showing up as smaller block devices such as /dev/sda1, the first partition on /dev/sda. Filesystems can be created directly on block devices, but these block devices can also be used with more advanced things like RAID and LVM.

LVM is a volume manager that allows you to create logical volumes that can be expanded much more easily than regular partitions. In LVM, there are three major entity types:

Physical volumes (PVs): block devices that are used as the underlying storage for LVM;
Logical volumes (LVs): block devices that are presented by LVM, stored on one or more PVs; and
Volume groups (VGs): a group of PVs on which LVs can be created.

LVs can be used just like partitions to store files, with the flexibility of being able to expand them at will while they are actively being accessed, without having to be contiguous like real partitions.

There are more advanced LV types, such as thin pools, which doesn’t allocate space for LVs until they are actually used to store data, and cached volumes, which this post is about.

The hardware

For the purposes of this post, we will assume that there are two SATA HDDs ( 4 TB each in my case), available as block devices /dev/sda and /dev/sdb:

$ lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                  8:0    0   3.6T  0 disk
sdb                  8:16   0   3.6T  0 disk
...

Warning: before copying any commands, ensure that you are operating on the correct device. There is no undo button for most of the commands in this post, so be very careful lest you destroy your precious data! When in doubt, run lsblk to double check!

We’ll also assume that the SSD is /dev/nvme0n1³ (2 TB in my case), and we will allocate 100 GiB of it as the cache by creating a partition.

Effectively, the setup looks like this:

Why use RAID 1 on HDDs?

Mechanical HDDs, like everything mechanical, fail. It’s an inevitable fact of life. There are two choices here:

Treat your data as ephemeral and replace it when the drive fails, accepting the inevitable downtime this causes; or
Store your data in a redundant fashion (i.e. with RAID), so that it continues to be available despite drive failures⁴.

If your data is really that unimportant, I suppose you could store it on a single drive, or even use RAID 0 to stripe it across multiple drives such that it’s lost if any one drive fails, but benefit from being able to pool all the drives together.

However, as I learned the hard way, even easily replaceable data still requires effort to replace them. I once deployed this exact setup with RAID 0 and one of the constituent drives suffered a failure, causing a few files to become unreadable. While I could easily download them again, it created a lot of downtime due to having to destroy the entire array and start over after replacing the failed drive.

This may not matter for your use case, but I would rather that my mirror experience minimal downtime in the event of a drive failure. For this reason, I chose to run the drives together in RAID 1.

Setting up RAID 1 with `mdadm`

One thing worth noting before we start with setting up RAID is that all block devices (either whole drives or partitions) in a RAID must be identical in size⁵. This presents some interesting challenges, since a 4 TB HDD drive isn’t always the same size. Normally, for a drive to be sold as “4 TB,” it has to have at least 4 000 000 000 000 bytes (that’s 4 trillion bytes). This is around 3.638 TiB using power-of-two IEC units. Typically, they have slightly more, though this varies by manufacturer or even model.

This poses a problem when using non-identical drive models, which you are encouraged to do to avoid drives failing at the same time. Drives produced in the same batch subjected to the same operations tend to fail at similar times, so that’s a good precaution to take to avoid failures. A similar problem occurs when it comes to replacing the drives when they fail, especially if you can’t source an identical model.

To avoid this problem, we will partition the drive and cut the data partition off exactly at the 4 TB mark. This will ensure that any “4 TB” HDD could be similarly partitioned and used as a replacement. Another reason to partition is to avoid the drive being treated as uninitialized on operating systems that don’t understand Linux’s mdadm RAID, such as Windows.

Partitioning the drives

We’ll need to do some math to figure out which 512-byte logical sector to end the partition on. For a 4 TB drive, we want to end it at the exact 4 TB mark:

>>> 4e12/512 - 1
7812499999.0

Since partition tools typically ask for the offset of the last sector to be included in the partition, we’ll need to subtract 1.

To partition the drive, we first need to clean everything on it first:

$ sudo wipefs -a /dev/sda
...
$ sudo wipefs -a /dev/sdb
...

(You can skip this if you are using a brand new drive.)

Then, create the partition with gdisk:

$ sudo gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.9

Partition table scan:
  MBR: not present
  BSD: not present
  APM: not present
  GPT: not present

Creating new GPT entries in memory.

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-7814037134, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-7814037134, default = 7814035455) or {+-}size{KMGTP}: 7812499999
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): fd00
Changed type of partition to 'Linux RAID'

Command (? for help): c
Using 1
Enter name: cached_raid1_a

Command (? for help): p
Disk /dev/sda: 7814037168 sectors, 3.6 TiB
Model: ST4000VN008-2DR1
Sector size (logical/physical): 512/4096 bytes
Disk identifier (GUID): [redacted]
Partition table holds up to 128 entries
Main partition table begins at sector 2 and ends at sector 33
First usable sector is 34, last usable sector is 7814037134
Partitions will be aligned on 2048-sector boundaries
Total free space is 1539149 sectors (751.5 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048      7812499999   3.6 TiB     FD00  cached_raid1_a

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sda.
The operation has completed successfully.

Now repeat this for /dev/sdb. Note that you don’t have to name the partitions with the c command, but it makes it easier to identify which partition is which if you have a lot of drives.

The partitions /dev/sda1 and /dev/sdb1 should now be available. If not, run partprobe to reload the partition table.

Creating the `mdadm` RAID array

Now we can create the array on /dev/md0 by running mdadm:

$ sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sda1 /dev/sdb1
mdadm: Note: this array has metadata at the start and
    may not be suitable as a boot device.  If you plan to
    store '/boot' on this device please ensure that
    your boot-loader understands md/v1.x metadata, or use
    --metadata=0.90
mdadm: size set to 3906116864K
mdadm: automatically enabling write-intent bitmap on large array
Continue creating array? y
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

To avoid having to assemble this array on every boot, you should declare it in /etc/mdadm/mdadm.conf. To do this, first run a command to get the definition:

$ sudo mdadm --detail --scan
ARRAY /dev/md0 metadata=1.2 name=example:0 UUID=6d539f5d:5b37:4bf0:b2d9:2af5efc99e6a

Now, append the output to /etc/mdadm/mdadm.conf.

Then, make sure that this configuration is updated in the initrd for all kernels:

$ sudo update-initramfs -u -k all
update-initramfs: Generating /boot/initrd.img-6.1.0-34-amd64
update-initramfs: Generating /boot/initrd.img-6.1.0-33-amd64
...

The RAID 1 array on /dev/md0 is now ready to be used as a PV containing the HDD storage.

Background operations

In the background, Linux’s MD RAID driver is working hard to synchronize the two drives so that they store identical data:

$ cat /proc/mdstat 
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10] 
md0 : active raid1 sde1[1] sdd1[0]
      3906116864 blocks super 1.2 [2/2] [UU]
      [=>...................]  resync =  9.7% (379125696/3906116864) finish=402.8min speed=145930K/sec
      bitmap: 29/30 pages [116KB], 65536KB chunk

unused devices: 

We can safely ignore this and continue. It will finish eventually.

Creating the SSD cache partition

You’ll need a partition on an SSD to serve as cache. This needs to be a real partition, not an LVM LV, as that would involve nested LVM. That never works reliably in my experience, and I’ve given up trying. This is especially nasty because I also use LVM to hold virtual machine disks, and if I just blanketly allow nested LVM, then the host machine can access all the LVM volumes inside all the VMs, which can cause data corruption.

If you don’t have unpartitioned space lying around, you’ll need to shrink a partition and reallocate its space as a separate partition.

Calculating the size

In my case, I had two partitions on my SSD, one EFI system partition (ESP) for the bootloader, and an LVM PV covering the rest of the disk. It looks something like this:

$ sudo gdisk -l /dev/nvme0n1
...
Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          206847   100.0 MiB   EF00  EFI system partition
   2          206848      3907029134   1.8 TiB     8E00  main_lvm_pv

For a 100 GiB cache, we’ll need to shrink the LVM PV by 100 GiB, and then edit the partition table. To avoid off-by-one errors, we’ll shrink the LV by 200 GiB or so, fix up the partition table, and then expand it afterwards.

Effectively, we want to end the LVM PV at sector 3697313934, which is exactly 100 GiB worth of 512-byte sectors before the current last sector:

>>> 3907029134 - 100*1024*1024*2
3697313934

Note that we multiply by 1024 once to convert from GiB to MiB, then a second time to convert from MiB to KiB, and there are two sectors per KiB.

Shrink existing partition data

First, shrinking the PV:

$ sudo pvresize --setphysicalvolumesize 1600G /dev/nvme0n1p2
/dev/nvme0n1p2: Requested size 1.56 TiB is less than real size <1.82 TiB. Proceed?  [y/n]: y
  WARNING: /dev/nvme0n1p2: Pretending size is 3355443200 not 3906822287 sectors.
  Physical volume "/dev/nvme0n1p2" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized

If you aren’t using LVM, but instead a regular ext4 filesystem, you can try using resize2fs, passing the size as the second positional argument. This would require you to unmount the partition first, since ext4 doesn’t have online shrinking, unlike LVM.

Editing the partition table

Then, we edit the partition table to shrink the partition for the PV and create a new one in the freed space:

$ sudo gdisk /dev/nvme0n1
...
Command (? for help): d
Partition number (1-2): 2

Command (? for help): n
Partition number (2-128, default 2):
First sector (34-3907029134, default = 206848) or {+-}size{KMGTP}:
Last sector (206848-3907029134, default = 3907028991) or {+-}size{KMGTP}: 3697313934
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 8e00
Changed type of partition to 'Linux LVM'

Command (? for help): n
Partition number (3-128, default 3):
First sector (34-3907029134, default = 3697315840) or {+-}size{KMGTP}:
Last sector (3697315840-3907029134, default = 3907028991) or {+-}size{KMGTP}:
Current type is 8300 (Linux filesystem)
Hex code or GUID (L to show codes, Enter = 8300): 8e00
Changed type of partition to 'Linux LVM'

Command (? for help): c
Partition number (1-3): 3
Enter name: cached_cache_pv

Command (? for help): p
...

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048          206847   100.0 MiB   EF00  EFI system partition
   2          206848      3697313934   1.7 TiB     8E00  main_lvm_pv
   3      3697315840      3907028991   100.0 GiB   8E00  cached_cache_pv

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/nvme0n1.
Warning: The kernel is still using the old partition table.
The new table will be used at the next reboot or after you
run partprobe(8) or kpartx(8)
The operation has completed successfully.

Note that with gdisk, changing the size of a partition requires deleting it and recreating it with the same partition number at the same starting offset. The data in the partition is unaffected.

Now, we need to notify the kernel that the partition has shrunk:

$ sudo partprobe /dev/nvme0n1

Expand shrunk partition to fit new space

Then, we can expand the PV to fit all the available space:

$ sudo pvresize /dev/nvme0n1p2
  Physical volume "/dev/nvme0n1p2" changed
  1 physical volume(s) resized or updated / 0 physical volume(s) not resized
$ sudo pvdisplay /dev/nvme0n1p2
  --- Physical volume ---
  PV Name               /dev/nvme0n1p2
  PV Size               1.72 TiB / not usable <3.07 MiB
...

As we can see, the PV size is now exactly the reduced size of the partition. Now that’s done, we can use /dev/nvme0n1p3 as a PV containing our SSD cache.

Creating a new volume group

Now that we have the partitions to serve as our PVs, we can create a volume group called cached:

$ sudo vgcreate cached /dev/md0 /dev/nvme0n1p3
  WARNING: Devices have inconsistent physical block sizes (4096 and 512).
  Physical volume "/dev/md0" successfully created.
  Physical volume "/dev/nvme0n1p3" successfully created.
  Volume group "cached" successfully created

Creating the cached LV

Creating a cached LV is somehow a multistep process that requires a lot of math.

Creating an LV on the HDD

First, you’ll need to create an LV containing the underlying data. Let’s put it on /dev/md0, using up all available space. You can obviously use less space if you want and expand it later. This is the command:

$ sudo lvcreate -n example -l 100%FREE cached /dev/md0
  Logical volume "example" created.

Creating the cache metadata LV

Next, we need a cache metadata volume on the SSD. 1 GiB should be plenty:

$ sudo lvcreate -n example_meta -L 1G cached /dev/nvme0n1p3
  Logical volume "example_meta" created.

Creating the cache LV

Now, we’ll need to use all remaining space on the /dev/nvme0n1p3 PV to serve as our cache. However, -l 100%FREE will not work because creating a cached pool requires some free space for a spare pool metadata LV for repair operations of the exact same size as the metadata. Since our metadata is 256 extents long, we’ll need to identify how much space we have available and reduce it by 256 (adjust if your metadata size is different):

$ sudo pvdisplay /dev/nvme0n1p3
  --- Physical volume ---
  PV Name               /dev/nvme0n1p3
  VG Name               cached
  PV Size               <100.00 GiB / not usable 3.00 MiB
  Allocatable           yes 
  PE Size               4.00 MiB
  Total PE              25599
  Free PE               25343
  Allocated PE          256

As you can see, we have 25343 extents left. We’ll need to subtract 256:

>>> 25343-256
25087

We can now create the actual cache LV:

$ sudo lvcreate -n example_cache -l 25087 cached /dev/nvme0n1p3
  Logical volume "example_cache" created.

Creating a cache pool

We can now merge the cache metadata and actual cache LV into a cache pool LV:

$ sudo lvconvert --type cache-pool --poolmetadata cached/example_meta cached/example_cache
  Using 128.00 KiB chunk size instead of default 64.00 KiB, so cache pool has less than 1000000 chunks.
  WARNING: Converting cached/example_cache and cached/example_meta to cache pool's data and metadata volumes with metadata wiping.
  THIS WILL DESTROY CONTENT OF LOGICAL VOLUME (filesystem etc.)
Do you really want to convert cached/example_cache and cached/example_meta? [y/n]: y
  Converted cached/example_cache and cached/example_meta to cache pool.

Here, we used the default chunk size chosen by LVM, but depending on the size of your files, you might benefit from a different chunk size. The lvmcache(7) man page has this to say:

The value must be a multiple of 32 KiB between 32 KiB and 1 GiB. Cache chunks bigger than 512 KiB shall be only used when necessary.

Using a chunk size that is too large can result in wasteful use of the cache, in which small reads and writes cause large sections of an LV to be stored in the cache. It can also require increasing migration threshold which defaults to 2048 sectors (1 MiB). Lvm2 ensures migration threshold is at least 8 chunks in size. This may in some cases result in very high bandwidth load of transferring data between the cache LV and its cache origin LV. However, choosing a chunk size that is too small can result in more overhead trying to manage the numerous chunks that become mapped into the cache. Overhead can include both excessive CPU time searching for chunks, and excessive memory tracking chunks.

Attach the cache pool to the HDD LV

Once that’s done, we can now attach the cache pool to the underlying storage to create a cached LV:

$ sudo lvconvert --type cache --cachepool cached/example_cache cached/example
Do you want wipe existing metadata of cache pool cached/example_cache? [y/n]: y
  Logical volume cached/example is now cached.

We can now see this LV:

$ sudo lvs
  LV             VG             Attr       LSize   Pool                  Origin          Data%  Meta%  Move Log Cpy%Sync Convert
  example        cached         Cwi-a-C---  <3.64t [example_cache_cpool] [example_corig] 0.01   0.62            0.00
...            

Cache modes

Note that there are several cache modes in LVM:

writethrough: any data written to the cached LV is stored in both the cache and the underlying block device (the default). This means that if the SSD fails for some reason, you don’t lose your data, but it also means writes are slower; and
writeback: data is written to cache, and after some unspecified delay, is written to the underlying block device. This means that cache drive failure can result in data loss.

Basically, use writethrough if you want your data to survive an SSD failure, or writeback if you don’t care.

Since I am using RAID 1 for reliability, it’d be pretty silly to then use writeback and risk losing the data and creating an outage, so I kept the default of writethrough.

To use writeback, you can specify --cachemode writeback during the initial lvconvert, or use sudo lvchange --cachemode writeback cached/example afterwards.

Creating a filesystem

Now that the cached LV is created, we just have to create a filesystem on it and mount it. For this exercise, we’ll use ext4, since that’s the traditional Linux filesystem and the most well-supported. I wouldn’t recommend using something like btrfs or ZFS since they are designed to access raw drives.

Creating an ext4 partition is simple:

$ sudo mkfs.ext4 /dev/cached/example
mke2fs 1.47.0 (5-Feb-2023)
Discarding device blocks: done                            
Creating filesystem with 976528384 4k blocks and 244137984 inodes
Filesystem UUID: bb93c359-1915-4f09-b23f-2f3a5e8b8663
Superblock backups stored on blocks: 
	32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
	4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968, 
	102400000, 214990848, 512000000, 550731776, 644972544

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done

Mounting the new filesystem

Now, we need to mount it. We can just run mount, but it makes more sense to define a permanent place for it in /etc/fstab. For this exercise, let’s mount in /example.

First, we create /example:

$ sudo mkdir /example

Then, we add the following line to /etc/fstab:

/dev/cached/example /example ext4 rw,noatime 0 2

Now, let’s mount it:

$ sudo systemctl daemon-reload
$ sudo mount /example
$ ls /example
lost+found

And there we have it. Our new cached LV is mounted on /example, and the default ext4 lost+found directory is visible. Now you can store anything you want in /example.

Monitoring

You can find most cache metrics by running lvdisplay on the cached LV:

$ sudo lvdisplay /dev/cached/example
  --- Logical volume ---
  LV Path                /dev/cached/example
  LV Name                example
  VG Name                cached
...
  LV Size                <3.64 TiB
  Cache used blocks      8.40%
  Cache metadata blocks  0.62%
  Cache dirty blocks     0.00%
  Cache read hits/misses 84786 / 40435
  Cache wrt hits/misses  222496 / 1883192
  Cache demotions        0
  Cache promotions       67420
  Current LE             953641
...

Conclusion

In the previous iteration of this before the drive failure, I was able to hit over 95% cache hits on reads storing a mix of mirrors and LLMs, with most of the files very infrequently read. If you have a similar workload, LVM caching is probably highly beneficial.

Note that this technique doesn’t have to be used to cache HDDs. Another possible application lies in the cloud, where you frequently have access to very large but slow block storage over the network and fast but small local storage. You can use LVM cache in this scenario also to cache the slower networked block device with the local storage.

I hope this was helpful and you learned something about LVM. See you next time!

Notes

After getting spoiled by 3+ GB/s NVMe SSDs, the paltry 200 MB/s you can get on HDDs feels slow, but it’s probably fine in most situations. ↩
Some smaller, higher traffic projects can easily justify being hosted completely on SSDs, which is what I do. The rest are hosted on the cached HDD array. ↩
NVMe devices are a bit confusing due to NVMe namespaces. The drive is /dev/nvme0 while the first namespace is /dev/nvme0n1. On most consumer drives, only a single NVMe namespace is supported, but enterprise drives may support dividing into multiple namespaces, like partitioning except at a lower level. Namespace-level features may include encryption and write protection. ↩
RAID is not a backup! RAID ensures data availability in the event of drive failures, but it doesn’t protect you from accidental deletion, ransomware, or corruption. You should always back up important data! ↩
Those of you who have read the previous post on btrfs know that this isn’t the case when using btrfs’s raid1 profile. However, since btrfs doesn’t support SSD caching, I am forced to run ext4 on cached LVM instead. ↩

On reusing old cases for NAS applications

2025-04-25T00:48:13-04:00

My home server—the one that acts as my router and NAS, while hosting a multitude of services at home, such as mirror.quantum5.ca—had a problem: it was using a nameless case that was at least 20 years old, and it wasn’t doing the job well. The ancient case was from an era when computers were much smaller and emitted a lot less heat. With a modern air cooler, I couldn’t even close the side panel.

However, buying a modern case has significant drawbacks. The design philosophy for cases in the 2020s is completely focused on displaying all the internals with as much glass as possible, offering as much cooling as possible for power-hungry components, or both. Given that spinning hard drives (HDDs) have gone completely out of fashion in the PC market, drive bays are sacrificed to improve cooling and aesthetics. Whereas my 20+-year-old case had six 3.5” HDD bays and four more 5.25” bays for optical drives that could be repurposed to house more HDDs, most modern cases, if they still had 3.5” HDD bays, could host at most three. This was perfectly fine for building PCs, but it was far from ideal for building a NAS.

What I really wanted was a full ATX case with good cooling and as many drive bays as possible. There’s effectively only one case on the market that fulfilled these requirements—Fractal Design’s Meshify 2 or the XL variant—and they came at a price of ~$200 CAD and ~$270 CAD, respectively, which always felt a bit too expensive for this hobby. So instead, I kept using the crappy old case. That was until I found an old Antec 1200, which ticked all my requirements, for free.

This post documents my experience of repurposing the 17-year-old Antec 1200 to fit a modern computer acting as a NAS, and my thoughts on the endeavour after doing it.

How come these ancient cases work?

A common question might be: why do these old cases still fit modern components just fine?

This is because ever since the late 1990s, the vast majority of computer cases, motherboards, and power supplies have followed the ATX standard, which was introduced by Intel in 1995. Even 30 years later in 2025, the latest motherboard could still be installed into the earliest computer case, as long as they are both ATX¹. For this reason, computer cases are one of the most reusable components in a computer—that is, if you are willing to deal with the quirks.

Despite ostensibly following the same ATX standard, computer cases have evolved significantly over the years, aesthetically from the beige boxes to the modern fad of using as much glass as possible, and functionally from sealed boxes to complete mesh fronts.

You’ll notice several things that have changed over the years:

Front panel headers: Perhaps the most obvious difference is the kind of ports available on the front panel. The oldest cases may just have a power and reset button. Then, USB 2.0 and audio jacks started showing up, before getting replaced with USB 3.0 and USB-C ports;
Cooling: Computers have grown a lot more power-hungry, and since all the power eventually gets emitted as heat, this has required beefier and beefier cooling. The old solution of basically sealed cases would quickly overheat and either damage hardware or cause it to thermal throttle to snail speed;
Air intake: Older cases would have no fan slots or just one or two exhaust fans for cooling. While this got the job done by getting rid of the hot air, it comes at a cost—due to the negative pressure generated inside the case, air would be sucked in from every hole in the case, causing dust to accumulate everywhere inside. Modern cases typically have a bunch of filtered intakes with washable filters to suck up most of the dust and are configured with positive pressure, leaving the insides of the case a lot cleaner;
PSU mounting: The traditional location of the power supply is above the motherboard. However, since heat rises, cases benefit from having the top open to let the heat flow out and having top exhaust fans to aid the process. Top-mounted PSUs came in the way of increasing cooling requirements, and almost all cases since the 2010s have the PSU at the bottom. The new way also allows large radiators to be mounted at the top, which is beneficial for GPU cooling, instead of just at the front;
PSU air circulation: Older cases typically use the fan on the PSU as an additional fan to exhaust hot air from the case, while modern cases typically give the PSU fan a separate filtered intake at the bottom of the case to completely isolate it from the rest of the system. The new design also allowed the PSU to be housed in a “basement,” which covered up the cables coming out of the PSU for better aesthetics;
Transparency: Ever since the rise of RGB, computer internals became fashionable to look at, which motivated transparent side panels to let the RGB shine through. Acrylic used to be a popular choice, but now basically all cases have tempered glass side panels; and
Drive bays: After SSDs have become completely mainstream, HDD is basically completely obsolete except in NAS applications, resulting in a massive reduction of 3.5” bays for them. Similarly, with the death of optical media, there wasn’t much point in keeping the 5.25” bays for them either. A lot of modern cases have no way to mount optical drives and can mount only a very limited number of 3.5” HDDs.

What’s in the server?

As I’ve mentioned before on this blog, the hardware specs are as follows:

CPU: AMD Ryzen 9 3900X
CPU Cooler: Deepcool AK620
Motherboard: ASUS Prime X570-P
RAM: 4×Crucial 16 GB 2666 MT/s ECC (CT16G4WFD8266)
GPU: ASUS Turbo GeForce GTX 1060 6 GB
Storage:
- 2×WD Red Pro 16 TB NAS HDD (WD161KFGX) for the main storage array
- 2×Seagate IronWolf 4 TB NAS drive (ST4000VN008) for the experimental SSD-cached array²

Note that while I only have four drives at the moment, my storage is quite close to being full anyway and I will likely need to add more drives soon, so extensibility is quite important to me.

However, this time around, what’s more important is probably the dimensions of the components rather than the exact hardware:

The length is constrained by the GPU, which is 267 mm long. Given current GPU trends, if I had to upgrade the GPU down the line, it is quite likely I might end up with something even longer. While the GPU technically fit inside the ancient case, it completely blocked off three drive bays from being used. In the Antec 1200, a drive could barely squeeze in next to the GPU with a right-angled SATA adapter, but the case could theoretically fit in a 330 mm GPU, at the cost of blocking the drive bay it’s next to;
The width is constrained by the height CPU air cooler, since it’s mounted sideways. The AK620 is 160 mm tall. Since the ATX standard requires a motherboard I/O shield height of 44.45 mm, any case that places a 120 mm fan next to the I/O shield should fit the AK620. The nameless ancient case, on the other hand, only had a 90 mm fan slot next to the I/O shield slot, preventing the side panel from being closed; and
The height of the case is constrained by the size of an ATX motherboard, though cases are broadly available in two sizes that fit an ATX motherboard, commonly called ATX mid-towers and full towers. A mid-tower basically barely fits an ATX motherboard, whereas a full tower has a lot more space, enabling it to fit more drives, for example.

Introducing the Antec 1200

The Antec 1200 is an ATX full tower case first released in 2008. It contained rooms for twelve 5.25” bays, which probably explained the name. It’s the cousin of the Antec 902, sharing the exact same top design, but that case only had nine bays. The Antec 1200 came with three drive cages, each containing exactly three 3.5” drives, taking up three 5.25” bays each, for a total of nine drives with these cages. Each drive cage contained a 120 mm fan and a washable filter.

On top of the case is a front panel containing two USB 2.0 ports and an eSATA³ port (which were replaced by three USB 3.0 ports in the V3 version, but I have the original), and then a huge 200 mm fan inside a circular honeycomb mesh. At the back, there were two 120 mm fans next to the motherboard rear I/O, and the standard 7 full-height expansion card slots. Next to the expansion card slots were two rubber grommets to run tubing for a custom water cooling loop, if you were so inclined. At the very bottom was a spot for PSU.

The Antec 1200 I got was actually part of a full system that I managed to acquire. It was a high-end system back in the day, containing a first-generation Intel Core i7 CPU (Nehalem architecture), first released in 2008. It was the kind of system I dreamed of owning during its heyday, though in 2025, it was laughably obsolete and likely performs worse than a new phone.

The system I got wasn’t the stock configuration for the Antec 1200, but somewhat modified. The harsh blue LEDs on all the fans were removed, resulting in a more sane look. The lower of the two rear fans was replaced with a 120 mm AIO cooler for the CPU. The bottom drive bay also lacked a fan that was supposed to be included in the stock configuration. The side panel was also not the stock version with a fan mount, but rather a full acrylic piece that showcased the entire interior of the computer.

For fun, I powered it on and was greeted with a BIOS setup screen straight out of the 90s. Instead of a modern GUI with mouse support on so many motherboards these days, it looked like this:

        CMOS Setup Utility - Copyright (C) 1984-2010 Award Software
╔══════════════════════════════════════╤═══════════════════════════════════════╗
║                                      │                                       ║
║                                      │                                       ║
║  ► MB Intelligent Tweaker(M.I.T.)    │    Load Fail-Safe Defaults            ║
║                                      │                                       ║
║  ► Standard CMOS Features            │    Load Optimized Defaults            ║
║                                      │                                       ║
║  ► Advanced BIOS Features            │    Set Supervisor Password            ║
║                                      │                                       ║
║  ► Integrated Peripherals            │    Set User Password                  ║
║                                      │                                       ║
║  ► Power Management Setup            │    Save & Exit Setup                  ║
║                                      │                                       ║
║  ► PC Health Status                  │    Exit Without Saving                ║
║                                      │                                       ║
║                                      │                                       ║
╟──────────────────────────────────────┴───────────────────────────────────────╢
║ Esc : Quit                ↑↓→←: Select Item        F11 : Save CMOS to BIOS   ║
║ F8  : Q-Flash             F10 : Save & Exit Setup  F12 : Load CMOS from BIOS ║
╟──────────────────────────────────────────────────────────────────────────────╢
║                                                                              ║
║                       Change CPU's Clock & Voltage                           ║
║                                                                              ║
╚══════════════════════════════════════════════════════════════════════════════╝

Ah, the memories it brought back…

Either way, the entire system inside had to go out, along with all the dust that accumulated from the previous owner. I took apart the machine completely and removed every removable part of the case and vacuumed up all the dust inside. That took quite the effort.

Fan replacement

The biggest modification I had to do to the case was replacing all the fans. After 17 years, they were quite noisy and no longer in good shape. The Antec 1200 was also designed in an era before PWM fan control. Instead, all the fans were powered through Molex connectors, and to control the fans, potentiometers were connected to the fans, with the knob sticking out of the front mesh for the front fans that you could turn to tune the fans manually. The rear and top fans were controlled from the rear with hardware switches, also connected to the fans. This was a mess. I’d much rather have the motherboard automatically control the fans based on the temperature and workload.

Rather than waiting for the fans to inevitably fail and tolerate the noise in the meantime, I decided to just replace them when it wouldn’t incur downtime. So I bought a set of five Thermalright TL-C12C fans to replace all the 120 mm fans and install another fan in the bottom HDD cage to better cool the drives. Replacing the 120 mm fans was a fairly simple operation, though the case did show its age in an unpleasant way.

You see, the drive bay fan mounts had this plastic piece that sticks into the mounting holes of the 120 mm fan, and it had a hole inside to accept a screw. This allowed you to mount the fan to the front of the drive bay before attaching the front portion to the rest of the drive bay. Unfortunately, those thin pieces of plastic proved easily damaged, and the mere act of screwing in the fan caused all the little plastic pieces to shear off. At least the fans were held in place tightly once the drive bay was assembled, so it was no great loss.

The top 200 mm fan proved to be a lot trickier due to it being a custom part. The fan had a super thick rim that was exactly the same size as the top fan mesh, and there were several random screw holes that attached the fan to the case. I bought a Thermaltake CL-F015-PL20BL-A, which looked like it might fit, but it ultimately didn’t. I had to saw off the corners on one side to get it to fit inside the mesh, which created microplastic dust that probably wasn’t very good for my health. Then, none of the screw holes ended up aligning, so I had to hot glue the fan into place. In retrospect, it may have been wiser to keep the top as it was.

Also, if any case designers are reading this: please don’t have custom fan shapes in your case. It’s clear that Antec wanted a circular fan grill on top for aesthetic reasons, which necessitated the custom shape, instead of a square one that could have fit standard 200 mm fans.

Fan placement

An interesting consideration was which fan slots to populate. The instinct might be to populate all fan slots, but that would also create a problem given the Antec 1200’s design.

As previously mentioned, to keep dust to a minimum, maintaining positive pressure is essential. Given that there were two rear 120 mm exhausts and a huge 200 mm top exhaust, but only three filtered front 120 mm intakes, the case would have negative pressure if I had populated all the fans.

To solve this problem, I decided to run only a single 120 mm fan at the back instead of two, while running the front intakes at a much higher speed than the other fans for positive pressure inside the case. Given that the server isn’t running a power-intensive GPU, the cooling was more than sufficient.

Missing accessories

There were several accessories missing when I got my Antec 1200:

The long screws used to attach 3.5” HDDs to the drive cage. I was able to identify them as 6-32×1” machine screws and buy some replacements;
Some of the drive bay thumbscrews, which I replaced with regular M3 screws;
A 3.5” external drive adapter that allows 3.5” devices with a front panel to be installed, e.g. a floppy drive, a Zip drive, or a card reader, which I don’t plan to use;
The two 5.25” drive bay covers, which forced me to install some old optical drives to cover up the front; and
A fan bracket that allowed a 120 mm fan to be mounted inside a drive cage where the drive would be for additional cooling, which I found unnecessary.

Rebuilding the server

Now that the Antec 1200 was modernized with new fans and cleaned, it was time to move the server over from the nameless case. So I picked a night when there was very low traffic on my mirror and powered down the server. I apologize for the downtime if you noticed it.

Due to a lack of space inside the old case, I opted to move all the hard drives over first. It was there that I ran into the first snag: while all the Seagate IronWolfs were moved over without issue, the WD Red Pros were missing the middle set of side screws⁴, resulting in only the front of the drives being screwed into the drive cages. Fortunately, because the drive cages have metal flaps for the drive to sit on, the drive is resting stably inside the case anyway, though the back could pivot up if a force is applied.

After mounting the drives, I moved over the PSU. Due to the compatible nature of the ATX, the PSU fit perfectly. However, due to the lack of a bottom intake, I had to mount the PSU with the fan facing up. Also, due to the lack of a PSU basement and the non-modular nature of my PSU, there’s a bunch of cable clutter, but I don’t really care.

I then moved over the motherboard, which also fit without issue thanks to ATX. However, I did have some trouble plugging in the EPS 12 V cable from the PSU due to the height of the case causing it to not reach, but it appeared the original system suffered from the same problem and there was an extension in the original system that I reused.

With the motherboard installed, I plugged in the front panel audio⁵, USB headers, and all the fans, which was a surprisingly easy task. Inside ATX mid-towers or smaller cases, that was always a struggle due to how close the headers on the motherboard were to the edge of the case, making it super difficult to get my hands in there. Due to the additional height of the Antec 1200, there was a lot more space and this ended up being the smoothest experience I’ve had in ages.

Due to wire length issues, I ended up plugging the rear fan into the AIO pump header. I had to override it in the UEFI to configure that as a PWM fan and set up a fan curve manually, but it worked just like a regular fan after that.

Interestingly, the Antec 1200 didn’t have a power LED, because the obnoxious blue lights on the fans were supposed to serve as power indicators, according to the manual. Since none of the replacement fans had lighting, I’ll just have to settle for looking at whether the fans are spinning or not to see if the system is powered on. I also didn’t bother plugging in the eSATA port, since eSATA hasn’t been relevant for a decade at this point.

With the system all wired up, I turned it on and it booted normally. I set up the fan curves in the UEFI configuration and the server was all good to go and booted back up as if nothing happened.

Result

The inside of the Antec 1200

The front of the Antec 1200

I apologize for the cable management.

Conclusions

The Antec 1200 is still a great case for a modern NAS with its drive bays and good cooling. After replacing all the fans, the system was surprisingly quiet—even at full load—while having great cooling. When the system is mostly idle, I could barely hear it, while it was slightly audible at full load, though it still sounded pleasant and not distracting.

There are just several small downsides due to the age and design of the case:

The little plastic pieces for mounting the drive bay fans have begun to fail;
The lack of a bottom intake for the PSU forces it to pull hot air from inside the case instead of cold air outside, which hurts cooling;
The top 200 mm fan was very hard to replace due to being non-standard;
The front I/O is outdated, notably missing USB 3.0⁶;
The drive cages are not fully compatible with modern 3.5” HDDs; and
The case will not fit the biggest GPUs currently on the market.

Overall, I am pretty happy with this. It certainly saved me a lot of money compared to buying a new Meshify 2 for $200 CAD, given that the fan replacement only cost around $56. Given that it was Earth Day recently, I’ll also say that it’s better for the environment to reuse a perfectly good case instead of buying a new one.

Notes

For simplicity’s sake, I’ll ignore all the smaller ATX variants, like microATX, mini-ITX, etc. This post will only concern itself with full-sized ATX motherboards and cases. The same logic applies to the smaller sizes, with the caveat that the case has to be at least as large as the motherboard for it to fit. ↩
This is the array that I am using to mirror larger projects with a lot of cold files, as well as storing LLMs locally. The commonly accessed files are cached in the SSD via lvmcache, while the remaining data gets stored on slower HDDs. In practice, it performed reasonably well. There was just one problem—since I wasn’t holding “important data” on it, I decided to use RAID 0. Well, I had a SMART warning on one of the drives and I replaced it while replacing the case. I then had to rebuild the whole RAID array, which was deeply unpleasant. I opted to use RAID 1 on the newly built array just to save myself the trouble of rebuilding it again. ↩
eSATA was an old way to plug an external drive into the system that was a lot faster than USB 2.0. Unfortunately, USB 3.0 soon showed up afterwards, was just as fast, and could also send power, which quickly rendered eSATA obsolete. ↩
After checking the relevant standard, which is SFF-8301 for 3.5” drive dimensions, it appears that only the front and back side screws are standard, whereas the middle screws appear to not be. Apparently, to fit in more platters inside the drive, drive manufacturers decided to sacrifice compatibilty with the non-standard hole. Unfortunately, a lot of cases and drive cages used the front and middle screws… This was irritating, but wasn’t a dealbreaker. ↩
The Antec 1200 was from a transitional period between two front panel audio standards, the older AC’97 and the modern Intel HD Audio (HDA), and as such, had a connector for each standard. Since I am using a modern system, I obviously connected the HD Audio. The nameless ancient case only had AC’97 and the front panel audio probably never worked right, but I wouldn’t know—I don’t use front panel audio on a server. ↩
I could get a 5.25” bay front panel with a bunch of USB ports and maybe even some card readers, but I decided that wasn’t worth it. ↩

Building a multi-network ADS-B feeder with a $20 dongle

2025-04-06T01:36:28-04:00

For a while now, I’ve wondered what to do with my old Raspberry Pi 3 Model B from 2017, which has basically been doing nothing ever since I replaced it with the Atomic Pi in 2019 and an old PC in 2022. I’ve considered building a stratum 1 NTP server, but ultimately did it with a serial port on my server instead.

Recently, I’ve discovered a new and interesting use for a Raspberry Pi—using it to receive Automatic Dependent Surveillance–Broadcast (ADS-B) signals. These signals are used by planes to broadcast positions and information about themselves, and are what websites like Flightradar24 and FlightAware use to track planes. In fact, these websites rely on volunteers around the world running ADS-B receivers and feeding the data to them to track planes worldwide.

Since I love running public services (e.g. mirror.quantum5.ca), I thought I might run one of these receivers myself and feed the data to anyone who wants it. I quickly looked at the requirements for Flightradar24 and found it wasn’t even that much—all you need was a Raspberry Pi, a view of the sky, and a cheap DVB-T TV tuner, such as the very cheap and popular RTL2832U/R820T dongle, which has a software-defined radio (SDR) that could be used to receive ADS-B signals.

I have enough open sky out of my window to run a stratum 1 NTP server with a GPS receiver, so I figured that was also sufficient for ADS-B¹. Since I found an RTL2832U/R820T combo unit with an antenna for around US$20 on AliExpress, I just decided on a whim to buy one. Today, it arrived, and I set out to build my own ADS-B receiver.

Choice of software

Once I had the dongle, I decided to do some research on what software I should install to feed to as many networks as possible, since I didn’t want to limit myself to just one. As it turns out, every network appears to have its own software, conveniently packaged into its own OS image, or has random shell scripts that you are expected to curl and run which will “magically” set everything up. Even though the ones with scripts all claim to play nicely with each other, there’s something deeply disturbing about setting it up this way.

So I took a look at what these scripts were actually doing, and it turned out, they were all just running a variant of dump1090 (for 1090 MHz ADS-B signals), either dump1090-mutability or FlightAware’s dump1090-fa, but they are all compatible with the various feeder software. In theory, the various feeders should all be able to share the same instance of dump1090, and sometimes the scripts do attempt to do so, but the active instance ultimately belongs to one of the feeders, resulting in a tangled mess of dependencies. Instead, I decided to install dump1090 myself and configure all the feeders to use it.

There is also dump978 for 978 MHz ADS-B signals, but since those aren’t used in Canada, I didn’t bother setting it up. If you live in the US, setting up dump978 is left as an exercise for the reader.

Setting up `dump1090`

I don’t like using software outside official repositories if I can avoid it, so I decided to go for dump1090-mutability, which is part of Debian’s official repositories. Installing it was trivial:

$ sudo apt install --no-install-recommends dump1090-mutability
...
The following NEW packages will be installed:
  dump1090-mutability libjs-excanvas libjs-jquery-ui libjs-jquery-ui-theme-smoothness librtlsdr0
...

Note that I passed --no-install-recommends because dump1090-mutability recommends lighttpd, which it configures to display a local map of all the planes it detects. If you like lighttpd and don’t have an existing web server, then feel free to omit it. I wanted to use nginx though.

At this point, I tried to start the daemon, but it didn’t work:

$ sudo systemctl start dump1090-mutability.service
$ tail /var/log/dump1090-mutability.log
Sat Apr  5 21:18:18 2025 UTC  EB_SOURCE EB_VERSION starting up.
Using sample converter: UC8, integer/table path
Found 1 device(s):
0: unable to read device details
usb_open error -3
Please fix the device permissions, e.g. by installing the udev rules file rtl-sdr.rules
Error opening the RTLSDR device: Permission denied

A quick examination revealed that dump1090-mutability runs under the dump1090 user, and that user needs access to the device files. As suggested, we should create rtl-sdr.rules. Unfortunately, there’s no link and there are multiple variants of the file, but this one worked (mirror):

$ sudo wget -O /etc/udev/rules.d/rtl-sdr.rules https://raw.githubusercontent.com/osmocom/rtl-sdr/refs/heads/master/rtl-sdr.rules

You can now unplug and replug the dongle to update the permissions on the device files, or simply run:

$ sudo udevadm trigger

However, these udev rules only granted access to the plugdev group, so we’ll need to ensure dump1090 is a member of that group:

$ sudo adduser dump1090 plugdev
Adding user `dump1090' to group `plugdev' ...
Done.

Now, it should start cleanly:

$ sudo systemctl restart dump1090-mutability.service
$ tail /var/log/dump1090-mutability.log
Sat Apr  5 21:27:01 2025 UTC  EB_SOURCE EB_VERSION starting up.
Using sample converter: UC8, integer/table path
Found 1 device(s):
0: Realtek, RTL2838UHIDIR, SN: 00000001 (currently selected)
Detached kernel driver
Found Rafael Micro R820T tuner
Max available gain is: 49.60 dB
Setting gain to: 49.60 dB
Gain reported by device: 49.60 dB
Allocating 15 zero-copy buffers

If it doesn’t work for you, the log will hopefully tell you why.

Isolation

There’s something super yucky about setting stuff up with random bash scripts, especially on this Raspberry Pi that took some significant effort to get Debian working, which meant that if it broke the system, I’d waste a large amount of time fixing it. So instead, I opted to run all the feeders in a systemd-nspawn container. Now theoretically, I could run this container anywhere since it’d be talking to dump1090 over TCP, but I thought it’d be fun to run it on the Raspberry Pi itself. If you are interested in setting it up this way, feel free to consult my post on systemd-nspawn.

For the rest of the post, we’ll assume that the Raspberry Pi 3 is at 192.0.2.1 and the ADS-B feeders run on the container at 192.0.2.2. To avoid massive pain with USB passthrough, dump1090 should be run on real hardware (i.e. the Pi). This will require dump1090 and the various feeders to be configured to use the correct IPs, since they all default to localhost, i.e. assuming it’s running on the same device.

Of course, you could just run the feeders on the Raspberry Pi directly if you so wish, saving all the trouble, at the risk of the scripts doing something crazy to your system.

Setting up `dump1090` web UI

Before feeding your data to the various networks, you should probably check that it works locally. dump1090 comes with a web UI. If you used the lighttpd thing that it installs by default, then it should just work at http://192.0.2.1/dump1090/gmap.html.

You probably want to change the default starting location of the map to where you live, or you’d have to move the map around to your location every time. To do this, edit /etc/dump1090-mutability/config.js and change DefaultCenterLat and DefaultCenterLon to your position. You may also want to change SiteShow to true and SiteLat and SiteLon to your position so you can see where the planes are relative to you.

Since I wanted to use nginx, I installed it on the Raspberry Pi and wrote the following configuration:

server {
    listen 80;
    server_name adsb.example.com;

    root /usr/share/dump1090-mutability/html;
    index gmap.html;

    location /data/ {
        alias /run/dump1090-mutability/;
    }
}

Now, you should be able to see your feed at http://adsb.example.com once the DNS is pointed correctly. Remember to change the domain! Setting up HTTPS is left as an exercise for the reader.

You should see a map with a list of aircraft and the current time on the side. For privacy reasons, I will not share a screenshot.

Making `dump1090` available to the container

dump1090 by default listens on localhost. If you are running the feeders in a separate container, that obviously won’t let you connect. To fix this, edit /etc/default/dump1090-mutability on the Pi and change:

NET_BIND_ADDRESS="192.0.2.1"

Then restart dump1090 with sudo systemctl restart dump1090-mutability.service.

Flightradar24

The first network I fed my data to was flightradar24. They have an install script which installs their repository and their feeder daemon, fr24feed. It’s as simple as:

$ wget -qO- https://fr24.com/install.sh | sudo bash -s
...
Welcome to the FR24 Decoder/Feeder sign up wizard!

Before you continue please make sure that:

 1 - Your ADS-B receiver is connected to this computer or is accessible over network
 2 - You know your antenna's latitude/longitude up to 4 decimal points and the altitude in feet
 3 - You have a working email address that will be used to contact you
 4 - fr24feed service is stopped. If not, please run: sudo systemctl stop fr24feed

To terminate - press Ctrl+C at any point


Step 1.1 - Enter your email address ([email protected])
$:[redacted]

Step 1.2 - If you used to feed FR24 with ADS-B data before, enter your sharing key.
If you don't remember your sharing key, you can find it in your account on the website under "My data sharing".
https://www.flightradar24.com/account/data-sharing

Enter your sharing key or press ENTER/RETURN to continue.
$:

Since I’ve never fed data before, I left this field blank. They generated a new key for me and emailed it to me. If I were setting it up again, I’d enter the previous key.

Step 1.3 - Would you like to participate in MLAT calculations? (yes/no)$:no

Since they explicitly ask you to not enable MLAT when sharing with other networks on their website, I said no. It’s still not clear to me what the problem is, since a few other networks are happily using my MLAT data.

Step 4.1 - Receiver selection:

 1 - DVBT Stick (USB)
 -----------------------------------------------------
 2 - SBS1/SBS1er (USB/Network)
 3 - SBS3 (USB/Network)
 4 - ModeS Beast (USB/Network)
 5 - AVR Compatible (DVBT over network, etc)
 6 - microADSB (USB/Network)

Enter your receiver type (1-6)$:5

Step 4.2 - Please select connection type:

 1 - Network connection
 2 - USB directly to this computer

Enter your connection type (1-2)$:1

Step 4.3A - Please enter your receiver's IP address/hostname
$:192.0.2.1

Step 4.3B - Please enter your receiver's data port number
$:30002

Step 5.1 - Would you like to enable RAW data feed on port 30334 (yes/*no*)$:no

Step 5.2 - Would you like to enable Basestation data feed on port 30003 (yes/no)$:no

Saving settings to /etc/fr24feed.ini...OK
Settings saved, please restart the application by running the command(s) below to use new configuration!
sudo systemctl restart fr24feed

I opted to use DVB-T over network, since that’s my setup. If you are running it locally, I think you can select the network option and just enter localhost for the hostname. This should allow you to use your own instance of dump1090 as opposed to having fr24feed run its own.

Now I just restarted the daemon:

$ sudo systemctl restart fr24feed

And soon, my radar showed up on “my data sharing” on Flightradar24 and I was able to filter for planes seen by my receiver.

If it doesn’t work, check the logs with sudo journalctl -u fr24feed.

FlightAware

FlightAware has their own feeder called piaware, which you can install from their APT repository. For up-to-date instructions, see here, but this was what I ran:

$ wget https://www.flightaware.com/adsb/piaware/files/packages/pool/piaware/f/flightaware-apt-repository/flightaware-apt-repository_1.2_all.deb
...
$ sudo dpkg -i flightaware-apt-repository_1.2_all.deb
...
flightaware-apt-repository: regenerated APT configuration /etc/apt/sources.list.d/flightaware-apt-repository.list
flightaware-apt-repository: please run 'sudo apt update' to use the new configuration
$ sudo apt update
...
$ sudo apt install piaware
The following NEW packages will be installed:
  itcl3 libboost-program-options1.74.0 libboost-regex1.74.0 libestr0 libfastjson4 liblognorm5 libtcl8.6 piaware rsyslog tcl tcl-tls tcl8.6 tcllib tclx8.4
...

At this point, if you are running piaware directly on the Raspberry Pi, things should just work. However, since I chose to run it in a container, I had to do some extra configuration:

$ sudo piaware-config receiver-type other
Set receiver-type to other in /etc/piaware.conf:7
$ sudo piaware-config receiver-host 192.0.2.1
Set receiver-host to 192.0.2.1 in /etc/piaware.conf:8
$ sudo piaware-config mlat-results-format 'beast,connect,192.0.2.1:30104 beast,listen,30105 ext_basestation,listen,30106'
Set mlat-results-format to beast,connect,192.0.2.1:30104 beast,listen,30105 ext_basestation,listen,30106 in /etc/piaware.conf:9
$ sudo systemctl restart piaware.service

If it’s not working, check sudo journalctl -u piaware.

ADS-B Exchange

Next, I tried feeding my data to ADS-B Exchange. They have their shell script which did some irritating things, like installing a compiler to build their own MLAT code, and shoved a bunch of Git repos under /usr/local/share/adsbexchange.

It also required pkg-config without first installing it. This is a transcript of the correct installation process:

$ sudo apt install pkgconf
...
The following NEW packages will be installed:
  libpkgconf3 pkgconf pkgconf-bin
...
$ curl -L -o /tmp/axfeed.sh https://www.adsbexchange.com/feed.sh
$ sudo bash /tmp/axfeed.sh
...
The following NEW packages will be installed:
  binutils binutils-aarch64-linux-gnu binutils-common build-essential bzip2 cpp cpp-12 dpkg-dev g++ g++-12 gcc gcc-12 libasan8 libatomic1 libbinutils libc-dev-bin libc6-dev libcc1-0 libcrypt-dev libctf-nobfd0 libctf0 libdpkg-perl libexpat1-dev libgcc-12-dev libgomp1
  libgprofng0 libhwasan0 libisl23 libitm1 libjs-jquery libjs-sphinxdoc libjs-underscore liblsan0 libmpc3 libmpfr6 libncurses-dev libncurses6 libnsl-dev libpython3-dev libpython3.11 libpython3.11-dev libstdc++-12-dev libtirpc-dev libtsan2 libubsan1 libzstd-dev
  linux-libc-dev make patch python3-dev python3-distutils python3-lib2to3 python3-pip-whl python3-setuptools-whl python3-venv python3.11-dev python3.11-venv rpcsvc-proto socat uuid-runtime zlib1g-dev
...
Installing mlat-client to virtual environment
...
Compiling / installing the readsb based feed client
...

It then showed a wizard in the console, asking for information like GPS coordinates. Once that’s done, things should work if you are running it on the Raspberry Pi directly.

Since I didn’t, I instead got this:

---------------------
No data available from IP 127.0.0.1 on port 30005!
---------------------
If your data source is another device / receiver, see the advice here:
https://github.com/adsbexchange/wiki/wiki/Datasource-other-device

I had to edit /etc/default/adsbexchange, making the following changes:

INPUT="192.0.2.1:30005"
RESULTS="--results beast,connect,192.0.2.1:30104"

And then sudo systemctl restart adsbexchange-{feed,mlat}.service.

Once that’s done, you should be able to see your feed on ADS-B Exchange’s feeder status if you visit it from the same IP as your feeder.

If it doesn’t work for some reason, look at sudo journalctl -u adsbexchange-feed for ADS-B feed issues and sudo journalctl -u adsbexchange-mlat for MLAT issues.

There is also a stats daemon to help them display the planes received by you, which you can install like this:

$ curl -L -o /tmp/axstats.sh https://www.adsbexchange.com/stats.sh 
$ sudo bash /tmp/axstats.sh
...
Cloning into '/tmp/adsbexchange-stats-git'...
...
The following NEW packages will be installed:
  bash-builtins bind9-host bind9-libs jq libfstrm0 libjemalloc2 libjq1 liblmdb0 libmaxminddb0 libonig5 libprotobuf-c1 libuv1
...

Additional information should now be available on the feeder status page.

adsb.fi

adsb.fi is basically a clone of ADS-B Exchange that was created due to some drama over ADS-B Exchange being sold. The installation process is basically the same, except you use the shell script from https://adsb.fi/feed.sh, edit /etc/default/adsbfi, the systemd units are adsbfi-feed.service and adsbfi-mlat.service, and there is no stats daemon. For the latest instructions, see here.

OpenSky

OpenSky is another network, though more focused on research than aviation enthusiasts like the others. They have an APT repository, but it was broken at the time of writing. So instead, I installed the latest deb:

$ wget https://opensky-network.org/files/firmware/opensky-feeder_latest_arm64.deb
...
$ sudo apt install ./opensky-feeder_latest_arm64.deb
Unpacking opensky-feeder (2.1.7-1) ...
Setting up opensky-feeder (2.1.7-1) ... 

At this point, it pops up a wizard asking you for location and dump1090 connection information. I simply filled it in and it was installed and started.

Once again, if you run into issues, look at sudo journalctl -u opensky-feeder.

Monitoring with Prometheus

I thought it’d be fun to monitor it with the Prometheus instance I set up recently, so I installed dump1090-exporter on the Raspberry Pi (it should go on the same machine as dump1090):

$ sudo apt install python3-venv python3-dev
...
$ python3 -m venv /opt/dump1090-exporter
$ /opt/dump1090-exporter/bin/pip install dump1090exporter aiohttp==3.10.10

Note that I had to override a newer version of aiohttp==3.10.10 since the old version didn’t compile on Python 3.11, which shipped with Debian bookworm.

And then it’s just a matter of creating a new systemd unit /etc/systemd/system/prometheus-dump1090-exporter.service:

[Unit]
Description=Prometheus exporter for dump1090, an ADS-B receiver

[Service]
Restart=on-failure
DynamicUser=true
ExecStart=/opt/dump1090-exporter/bin/dump1090exporter --resource-path=/run/dump1090-mutability --port=9105 --latitude=[redacted] --longitude=[redacted]

[Install]
WantedBy=multi-user.target

Then it was just a matter of sudo systemctl enable --now prometheus-dump1090-exporter.service and telling Prometheus to pull logs from http://192.0.2.1:9105/metrics.

Conclusion

This was a fun weekend project. Setting up an ADS-B receiver was surprisingly easy, even when doing it in a network-agnostic way. I was quite impressed with what a $20 TV tuner managed to accomplish.

My only regret is not having enough open sky, resulting in a limited field of view.

Notes

As it turns out, for the stratum 1 NTP server, it didn’t really matter as long as you could see enough GPS satellites. However, for ADS-B, the more sky you can see, the more planes you can track. So ideally, if you can, you should mount the antenna on the roof of a relatively tall building. It’s probably fine to feed it as long as you have a reasonable view of a quadrant of the sky, but it might not be worth it if your view is almost completely blocked by another building. ↩

Quantum

Resurrecting a 13-year-old high school project: the BSoD lock screen

What did it do?

The development environment

Raw Windows API programming

Seeing the program in action

Lessons learned

Annotated source code

Headers

Macros

Compatibility

Function prototypes

Constants

Variables

Keyboard accelerator table

Helper functions

Protection functions

Generating the blue screen

Entry point

Low-level mouse hook

Low-level keyboard hook

Window procedure

Dialogue procedure

Conclusion

Notes

2025: Year in Review

BGP and operating my own AS

My homebrew CDN

My home server

My ADS-B feeder

My coding projects

My mechanical keyboard

My travel router project

My music hobby

Conclusion

Joining your first Internet Exchange

What is an Internet Exchange?

How to not get banned?

Gaining access to the peering LAN

Renting servers from an IX reseller

Apply to an IX on your own

Configuring the IX interface

Checking for bad traffic

Set preferred IPs

Setting up BGP sessions with route servers

Common issues

Setting up bilateral BGP sessions

Hurricane Electric

Conclusion

Notes

Announcing your first routes to the Internet via BGP

Prerequisites

Platform

Preparing IP space

IRR

RPKI

IPv4

Requesting a BGP session

Installing BIRD

Common BGP session issues

Seeing your new prefix on the Internet

Start using your own IPs

Simple method to use your prefix on one server

Using a bridge

Tunnelling

Conclusion

Notes

Building a global WireGuard mesh “backbone” network with OSPF

Background

Choice of technology

Choice of transport

Example layout

Setting up WireGuard tunnels

Aside: Setting up IP-over-IP tunnels

Setting up OSPF

Turn on forwarding

Conclusion

Notes

Enabling highly available global anycast DNS modifications with Galera

Table of contents

Setting up RAID 1 with `mdadm`

Creating the `mdadm` RAID array

Setting up `dump1090`

Setting up `dump1090` web UI

Making `dump1090` available to the container