Jekyll2026-03-16T06:47:52+00:00https://7ji.github.io/feed.xml7Ji’s BlogKeep It Simple, Stupid.Into Alpine APK v3 format: the binary perspective2026-03-03T03:30:00+00:002026-03-03T03:30:00+00:00https://7ji.github.io/designdoc/2026/03/03/into-Alpine-APK-v3-format-the-binary-perspectiveBackground

From OpenWrt 25.12 onwards, and available earlier in release candidates and development branches, OpenWrt has swapped their package manager + package format from opkg + .ipk (almost a .deb) to apk-tools v3.0 from Alpine Linux and its newly introduced APK v3. Even Alpine itself has just upgraded apk-tools to v3.0 since v3.23 but not fully switched into apk v3 yet.

The format is a totally new format designed in-house by Alpine with some collabration from OpenWrt. But let’s first list the above mentioned older formats:

  • deb is an ar like format with pre-defined members debian-binary, data.tar(.gz/.xxx), control.tar(.gz/.xxx), the files are in data.tar.xxx and metadata is in control.tar.xxx
  • ipk is pretty much a fixed version deb with in most cases only gzip compressed data and control so less dependency is needed which is useful in embedded scenarios.
  • apk v2 is three (with signature) or two gzipped (without signature) tars, one optionally for signature, one for metadata (kinda like control.tar in deb) and one for actual files (kinda like data.tar in deb)

apk v3 however is in a supposedly schema-based binary format adb, loosely defined by its official manual (source / online), yet most details are determined by apk-tools the only reference tool’s C source code.

I’ve recently wrote a tool adumpk to parse an v3 .apk, prints useful info about it, optionally convert it to .tar, and write metainfo into .json. When writing the tool I had to jump around in apk-tools’s source code quite often. After finishing adumpk, I decided to write this as an easier-to-follow single blog post so others can avoid the hassle.

Format

Unless explicitly mentioned, all integral data types in the format is little-endian.

The below uses openwrt/25.12.0-rc5/crowdsec-1.6.2-r1.apk as an example package, although it’s not needed, it’s recommended you get one so it’s easier to examine the binary by yourself and learn the actual binary format.

File Header

The file header is 4 bytes in length, first 3 bytes being magic "ADB", i.e. big-endian 0x414442, then the last byte is either one of the following to tell the file compression method:

  • ., i.e. 0x2e, means the content is not compressed
  • d, i.e. 0x64, means the conetnt is compressed with Deflate level 0
  • c, i.e. 0x63, means the content is compressed with a custom compression method recorded in the next 2 bytes
    • the first byte is a u8 recording the ID of compression method:
      • 0 for not compressed
      • 1 for Deflate
      • 2 for Zstandard, which is optionally supported only when apk-tools was compiled with HAVE_ZSTD
    • the second byte is a u8 containing compression level, the level allowed by each of the above method is:
      • not compressed: 0
      • Deflate: 0 to 9
      • Zstandard: 0 to 22

Compressed Body

The compressed ADB data body either starts from the 7th byte (file[6:]) if the compression method is c and 2 addtional bytes were took to record the actual compression method, or the 5th byte (file[4:]) otherwise.

The compressed ADB data body shall be a plain stream without magic header you would expect from a standalone compressed file, so the gzip stream does not begin with big-endian 0x1F8B magic then meta, and zstd stream does not begin with big-endian 0x28B52FFD magic.

Note while Deflate stream can be concatenated to each other, which is used in apk v2, the apk v3 body should be a whole continous Deflate stream if it was compressed with such algorithm. I didn’t test this but I think the official tool would happily accept a manually prepared apk v3 with multiple concatenated Deflate stream. Zstandard on the other hand does not support concatenated streams.

The body, when decompressed, shall begin with exactly ADB., same as uncompressed apk v3, 3-byte magic and 1-byte marking no compression.

With the example apk the body shall be decompressed with:

import zlib
with open('crowdsec-1.6.2-r1.apk', 'rb') as f:
    assert(f.read(4) == b'ADBd')
    body = zlib.decompress(f.read(), wbits=-zlib.MAX_WBITS)
assert(body[0:8] == b'ADB.pckg')

Top-level schema

After the leading ADB. in either the original uncompressed file or the decompresssed body, a 4-byte (u32(body[4:8])) magic (called schema) would mark the inner data as one of the following:

  • 0x676B6370 or big-endian 0x70636B67 or literal pckg for package
  • 0x78646E69 or big-endian 0x696E6478 or literal indx for index

We would only talk about package.

ADB blocks stream

In the package case, a series of ADB blocks would continue one after another, each starting at a 8-byte boundary (the first ADB block naturally starts at such boundary as it’s after 8-byte "ADB.pckg").

Each block starts with a u32 recording the type and optionally the size, itself as either a simple header, or as the first member of a bigger 16-byte header

  • If the highest 2 bits are not all 1 (so either 0b00, 0b01, 0b10, but not 0b11), then this is a simple 4-byte header
    • Type is these 2 bits extracted, i.e. v >> 30
    • Raw size for this block (4-byte header included) is the low 30 bits, i.e. v & 0x3fffffff
  • If the highest 2 bits are all 1, then this is an extended 16-byte header, including the 4-byte u32 v itself as type_size field, a 4-byte u32 reserved field for alignment for future expansion, and a 8-byte u64 x_size field, defined in C as:
    struct adb_block {
        uint32_t type_size;
        uint32_t reserved;
        uint64_t x_size;
    };
    
    • Type is the low 30 bits, i.e. u32 & 0x3fffffff
    • Raw size for this block (16-byte header included) is the x_size field

The payload size of what follows the header can be calculated as raw size - header size, and it must be non-negative.

The actual type of a block must be one of the following:

ID NAME Usage
0 ADB_BLOCK_ADB essential, contains metadata info and file infos
1 ADB_BLOCK_SIG optional, carries signature data
2 ADB_BLOCK_DATA technically also optional, carries file content (no name or path) or alike

The order of these blocks is restricted. A sane ADB must contains these blocks head to tail:

  • 1 ADB_BLOCK_ADB for metadata
  • 0 to N ADB_BLOCK_SIG for signature
  • 0 to N ADB_BLOCK_DATA for data

If the order is not respected (e.g. SIG after DATA, or ADB after SIG/DATA, etc ), then the file would be rejected

The meta block ADB_BLOCK_ADB

Such block must be the first block in one package’s blocks stream.

The block begins with a 8-byte header, including a u8 adb_compat_ver field (currently must be 0), a u8 adb_ver field (currently also must be 0), a u16 reserved field for alignment and future expansion (all 0), then a u32 root field to declare the following data streams, type aliased as adb_val_t, defined in C as:

struct adb_hdr {
    uint8_t adb_compat_ver;
    uint8_t adb_ver;
    uint16_t reserved;
    adb_val_t root;
};

An adb_val_t val carries type info in its highest 4 bits (val & 0xf0000000), and value in lowest 28 bits (val & 0x0fffffff)

Let’s focus on what root means here.

The type shall be ADB_TYPE_OBJECT (0xe0000000 == root & 0xf0000000) for a series of elements each with their own types, which makes sense for a metadata block.

And the value, i.e. root & 0x0fffffff, marks the offset of targets inside current payload, note the current payload includes the above adb_hdr but not the block header, so e.g. with u32([body[8:12]]) == 0x1614 == 5652 for a simple 4-byte header, type ADB, this means the whole block is at body[8:8+5652] == body[8:5660], so payload is body[8+4:5660] == body[12:12+5648] without the block header, including the adb_hdr at body[12:12+8] == body[12:20], so with root==0xe0001600 the offset would be 0x1600 == 5632, and therefore we need to go from body[12+5632].

ADB Object for root

The pointed-to ADB object elements for root starts with a u32 recording how many adb_val_ts follow it including itself, then the actual series of adb_val_ts, so e.g. with the above example and u32(body[5644:5648]) being 4, then the 3 adb_val_ts are expected at body[5648:5652], body[5652:5656], body[5656:5660].

Note the last adb_val_t is just at the end of this payload / ADB_BLOCK_ADB, you can easily tell that the root value was written at the end of ADB_BLOCK_ADB creation

The above 3 objects with ID starting at 1, alongside count u32 (4) with ID 0 (remember this convention, APK prefers to use id 0 for num/count and 1 onwards as actual slots), can be considered a 4-length adb_val_t array, each latter member is used for a different purpose:

ID NAME Purpose
1 ADBI_PKG_PKGINFO package info metadata
2 ADBI_PKG_PATHS package file / folder paths (compacted than plain list of texts)
3 ADBI_PKG_SCRIPTS package postrm / preinst / etc scripts
4 ADBI_PKG_TRIGGERS package triggers

If a slot is not needed then it can be 0 if there’s still slot needed after it, or not exist at all if there’s no slot needed after it. In this example with count being 4 there would only be 3 slots, so no TRIGGERS was stored.

We would only focus on PKGINFO and PATHS, as SCRIPTS and TRIGGERS pretty much follows the idea of PATHS + ADB_BLOCK_DATA

ADBI_PKG_PKGINFO: Package Info

ID 1 in root object, which marks the head of pacakge info, is yet another object, e.g. root_obj[1] == body[5648:5652] == 0xe00012b8, means an object (0xe00012b8 & 0xf0000000 == 0xe0000000) with offset 4792 (0xe00012b8 & 0x0fffffff == 0x12b8 == 4792), which then points to another u32 for number for elements including itself. So e.g. u32(body[12+4792:+4]) == u32(body[4804:4808]) == 17 means there’re 16 adb_val_t after the count u32, 1st at body[4804+4*1:+4] == body[4808:4812] and 16th at body[4804+4*16:+4] == body[4868:4872].

Now is a good time to list all possible data types, which can all be possibly used in these fields:

Type Magic Note
ADB_TYPE_SPECIAL 0x00000000 Currently just alias to INT
ADB_TYPE_INT 0x10000000 Single u32 (max 0x0fffffff) value in low
ADB_TYPE_INT_32 0x20000000 Single u32 at low-as-off
ADB_TYPE_INT_64 0x30000000 Single u64 at low-as-off
ADB_TYPE_BLOB_8 0x80000000 Series of u8, length (u8) + data (u8s) at low-as-off
ADB_TYPE_BLOB_16 0x90000000 Series of u8, length (u16) + data (u8s) at low-as-off
ADB_TYPE_BLOB_32 0xa0000000 Series of u8, length (u32) + data (u8s) at low-as-off
ADB_TYPE_ARRAY 0xd0000000 Series of same type, length (u32) + data at low-as-off
ADB_TYPE_OBJECT 0xe0000000 Series of different type, length (u32) + data at low-as-off

As we briefly mentioned earlier, as each element does not record its own ID yet the ID has special meaning, to mark an empty, skipped element, the elements shall be special value 0; e.g. if a package has only a ID 6 field to declare, then it would have first 5 slots all set to 0 so they’re skipped, and there would be no ID 7 field onwards.

These slots are numbered as follows:

ID Name Data Type
1 NAME BLOB, usually BLOB_8, as string
2 VERSION BLOB, usually BLOB_8, as string
3 HASHES BLOB, usually BLOB_8, as hex-string
4 DESCRIPTION BLOB, usually BLOB_8, as string
5 ARCH BLOB, usually BLOB_8, as string
6 LICENSE BLOB, usually BLOB_8, as string
7 ORIGIN BLOB, usually BLOB_8, as string
8 MAINTAINER BLOB, usually BLOB_8, as string
9 URL BLOB, usually BLOB_8, as string
10 REPO_COMMIT BLOB, usually BLOB_8, as hex-string
11 BUILD_TIME INT, usually embedded
12 INSTALLED_SIZE INT, usually embedded
13 FILE_SIZE INT, usually embedded
14 PROVIDER_PRIORITY INT, usually embedded
15 DEPENDS OBJECT of dependency (see below)
16 PROVIDES OBJECT of dependency (see below)
17 REPLACES OBJECT of dependency (see below)
18 INSTALL_IF OBJECT of dependency (see below)
19 RECOMMENDS OBJECT of dependency (see below)
20 LAYER INT, usually embedded
21 TAGS OBJECT of BLOB, usually BLOB_8, as string array

So e.g. the first element, u32(body[4808:4812]) == 0x80000008, is not zero, means the package has an actual name, the high 0x8 means this is a BLOB_8 item, and the low 0x8 means the item’s length is at offset 8 and content starts at offset 9, so, length == u8(body[12+8]) == 8, and therefore content is at body[12+9:12+9+8] == body[21:29], the example package is crowdsec

And e.g. the last element ID 16 u32(body[4868:4872]) == 0xe0000184, is not zero, means the package has an actual provides OBJECT (0xe0...), offset 0x184 == 388 in payload, so u32(body[12+388:12+388+4]) == u32(body[400:404]) == 2 records the number of sub-elements including the count itself, therefore 1 actual element, adb_val_t(body[404:408]) therefore records the info of the only sub-element, here being 0xe000017c means it’s yet another OBJECT starting at offset 0x17c, …, and then u32(body[392:396]) == 2 so there’s again 1 real sub-element, adb_val_t(body[396:400]) == 0x8000016c means this is a BLOB_8 starting at 0x16c, then we get length from u8(body[12+0x16c]) == 12, and finally the provide item at body[12+0x16c+1:+12] == body[377:389], being crowdsec-any

While the provides seems a list of list of BLOB_8, but recall that OBJECT elements can be different types, each element in provides is actually a strongly-typed dep info, containing the NAME slot (BLOB_8, ID1), VERSION slot (BLOB_8, ID2), and MATCH slot (INT, ID3, for vercmp operations). In the example there’s just no VERSION nor MATCH.

All dependency-like element can have these 3 slots:

ID Name Data Type
1 NAME BLOB, usually BLOB_8, string
2 VERSION BLOB, usually BLOB_8, string
3 MATCH INT, usually embedded

The MATCH field is a bitwise OR of the following base bits:

NAME VALUE bit
EQUAL 1 0b00001
LESS 2 0b00010
GREATER 4 0b00100
FUZZY 8 0b01000
CONFLICT 16 0b10000

The implementation would pre-calculate all valid combinations of these; these are (excluding CONFLICT which can be freely appended):

Sign Meaning Bits
< Less than 0b0010
<= Less than or equal to 0b0011
<~ less than or equal to, fuzzy 0b1011
~ Equal to, fuzzy 0b1001
= Equal to 0b0001
>= Greater than or equal to 0b0101
>~ Greater than or equal to, fuzzy 0b1101
> Greater than 0b0100
>< Special, checksum 0b0110
_ Special, any 0b0111

So e.g. a match field with value 0x10000003 would mean INT with value 0x3 == 0b11, therefore <=

ADBI_PKG_PATHS: Paths

Files and folders are stored in ADB_BLOCK_ADB in a compact way, before the latter possible file data appearance in ADB_BLOCK_DATA blocks. Each of these path elements stores a folder path without the leading / (empty path for root folder), then any number of direct file entries. While most of the file entries do need their ADB_BLOCK_DATA block for actual data, others could exist purely in header.

ID 2 in ADB_BLOCK_ADB’s root object, which marks the head of paths, is yet another object, e.g. root_obj[2] == body[5652:5656] == 0xe00012fc, means an object with offset 0x12fc, and u32(body[12+0x12fc:+4]) == u32(body[4872:4876]) == 22 means there’re 21 adb_val_t after the count u32, 1st at body[4872+4*1:+4] == body[4876:4880] and 21st at body[4872+4*21:+4] == body[4956:4960]

Each one of these 21 “path”s is actually called ADBI_DI by apk-tools, and is also an OBJECT with the following slots (still, some are optional):

ID NAME Data Type
1 NAME BLOB, usually BLOB_8, string
2 ACL OBJECT being ACL info (see below)
3 FILES OBJECT of File info (see below)

In the example the last “path” adb_val_t(body[4956:4960]) == 0xe0001290, so it’s an OBJECT starting from 12 + 0x1290 == 4764, as u32(body[4764:4768]) == 4 there’re 3 slots after it.

For ID 1, NAME, adb_val_t(body[4768:4772]) == 0x80001258, it’s a BLOB_8 starting at offset 0x1258, and u8(body[12+0x1258]) == u8(body[4708]) == 7 says this is a 7-length string, content at body[4708+1:7] == body[4709:4716] == b"usr/bin" says the folder name/path is usr/bin

For ID2, ACL, adb_val_t(body[4772:4776]) == 0xe0000194, it’s an OBJECT starting at offset 0x194, the count u32(body[12+0x194:+4]) == u32(body[416:420]) == 4 so there’re 3 slots after it.

The ACL info OBJECT could have the following slots:

ID NAME Data Type
1 MODE INT, usually embedded
2 USER BLOB, usually BLOB_8, string
3 GROUP BLOB, usually BLOB_8, string
4 XATTRS OBJECT of BLOB, usually BLOB_8, each BLOB_8 with \0 as sep for name and value

In the example:

  • SLOT1 reads 0x100001ed so it’s an INT with value 0x1ed == 0o755
  • SLOT2 reads 0x8000018c so it’s an BLOB_8 starting at offset 0x18c, length at body[408] reads 4 and content at body[409:413] reads root
  • SLOT3 reads the same value so it reuses root from USER
  • There’s not SLOT4

For ID3, FILES, adb_val_t(body[4776:4780]) == 0xe0001260, it’s an OBJECT starting at offset 0x1260, the count u32(body[12+0x1260:+4]) == u32(body[4716:4720]) == 12, so there’re 11 file entries after it, the first at body[4720:4724] and the last at body[4760:4764].

The first file entry, adb_val_t(body[4720:4724]) == 0xe0000f4c, it’s an OBJECT starting at offset 0xf4c, the count u32(body[12+0xf4c:+4]) == u32(body[3928:3932]) == 6, so there’re 5 slots after it.

The File info OBJECT could have the following slots:

ID NAME Data Type
1 NAME BLOB, usually BLOB_8, string
2 ACL OBJECT being ACL info (see above)
3 SIZE INT, usually embedded
4 MTIME INT, usually INT32
5 HASHES BLOB, usually BLOB_8, hex-string
6 TARGET BLOB, usually BLOB_8, string

In the example:

  • SLOT1 reads 0x80000008 so it’s an BLOB_8 with length at offset 8, u8(body[12+8]) == 8, and content body[12+8+1:+8] == b"crowdsec"
  • SLOT2 reads 0xe0000194 so it’s again 0o755 owned by root:root
  • SLOT3 reads 0x133cd7e8 so it’s an INT with value 0x33cd7e8 == 54319080
  • SLOT4 reads 0x200001e4 so it’s an INT_32 at offset 0x1e4, and u32(body[12+0x1e4:+4]) == 1772344484 so mtime is Sun Mar 1 05:54:44 UTC 2026
  • SLOT5 reads 0x80000f28 so it’s an BLOB_8 with length at offset 0xf28, u8(body[12+0xf28]) == 32, and content body[12+0xf28+1:+32] reads a hex-string, which is the SHA256 checksum of the file
  • There’s no SLOT6, as this is a regular file

A file can have its SIZE set to 0, being an empty file, and on top of it having TARGET set, so it either serves as a symlink or hardlink to the set target, or is a special CHAR / DEV.

Let’s use the third file entry under the same last path entry to examine what TARGET does, which is adb_val_t(body[4728:4732]) == 0xe0000fdc, it’s an OBJECT with offset 0xfdc, we read u32(body[12+0xfdc:+4]) == u32(body[4072:4076]) == 7 so it does have 6th slot for TARGET; we read the name adb_val_t(body[4076:4090]) == 0x80000fac so name starts at offset 0xfac and len == u8(body[12+0xfac]) == u8(body[4024]) == 5, content is body[4024+1:+5] == body[4025:4030] == b"cscli", so the link itself is usr/bin/cscli; we skip to SLOT 6 for TARGET which should be adb_val_t(body[12 + 0xfdc + 4 * 6:+4]) == adb_val_t(body[4096:4100]) == 0x80000fb2, so it’s a BLOB with offset 0xfb2, then we read length at u8(body[12+0xfb2]) == u8(body[4030]) == 23 so content is body[4030+1:+23] == body[4031:4054] == b"\x00\xa0/usr/bin/crowdsec-cli"

The first two bytes in the TARGET determines the data type, and they shall be handled as one u16, and u16(body[4031:4033]) == 40960 == 0o120000, this is basically the same thing as you would expect from the st_mode field in a struct stat with already S_IFMT been bitwise AND. The following file type are supported:

Type Mask Content at target[2:]
S_IFBLK 0o060000 8-byte, as u64 for dev ID (major:minor combined)
S_IFCHR 0o020000 8-byte, as u64 for dev ID (major:minor combined)
S_IFIFO 0o010000 8-byte, as u64 for dev ID (major:minor combined)
S_IFLNK 0o120000 any-length, for symlink target
S_IFREG 0o100000 any-length, for hardlink target

The real target for symlink is therefore target[2:], so we know this symlink is usr/bin/csli -> /usr/bin/crowdsec-cli

When reading through the PATHs info it’s recommended to store them for later lookup, as the ADB_BLOCK_DATA blocks would only carry the file content, but not the names, paths, ownership, etc.

ADBI_PKG_SCRIPTS: scripts

ID 3 in the ADB_BLOCK_ADB is ADBI_PKG_SCRIPTS which is an OBJECT with multiple BLOBs for package pre/post scripts.

e.g. root_obj[3] == body[5656:5660] == 0xe00015e0, means an object with offset 0x15e0, and u32(body[12+0x15e0:+4]) == u32(body[5612:5616]) == 8 means there’re 7 adb_val_t after the count u32, 1st at body[5612+4*1:+4] == body[5616:5620] and 21st at body[5612+4*7:+4] == body[5640:5644]

The scripts OBJECT could have the following slots

ID NAME
1 TRIGGER
2 PREINST
3 POSTINST
4 PREDEINST
5 POSTDEINST
6 PREUPGRADE
7 POSTUPGRADE

All names except TRIGGER should tell the purpose just by its name. The TRIGGER one is special as it would be triggered on changes to paths listed in latter ADBI_PKG_TRIGGERS.

The example package has only 3/POSTINST, 4/PREDEINST and 7/POSTUPGRADE. Take the last slot for example, u32(body[5640:5644]) reads 0x800014e4 so it’s BLOB_8 with offset 0x14e4, read u8(body[12+0x14e4]) == 251 so length is 251, therefore content is body[12+0x14e4+1:+251] == body[5361:5612], b'#!/bin/sh\nexport PKG_UPGRADE=1\n[ "${IPKG_NO_SCRIPT}" = "1" ] && exit 0\n[ -s ${IPKG_INSTROOT}/lib/functions.sh ] || exit 0\n. ${IPKG_INSTROOT}/lib/functions.sh\nexport root="${IPKG_INSTROOT}"\nexport pkgname="crowdsec"\nadd_group_and_user\ndefault_postinst\n', which prints as:

#!/bin/sh
export PKG_UPGRADE=1
[ "${IPKG_NO_SCRIPT}" = "1" ] && exit 0
[ -s ${IPKG_INSTROOT}/lib/functions.sh ] || exit 0
. ${IPKG_INSTROOT}/lib/functions.sh
export root="${IPKG_INSTROOT}"
export pkgname="crowdsec"
add_group_and_user
default_postinst

ADBI_PKG_TRIGGERS: triggers

ID 4 in the ADB_BLOCK_ADB is ADBI_PKG_TRIGGERS which is an OBJECT with multiple BLOBs that shall trigger the TRIGGER script to run.

Note it is totally valid that a package does not have TRIGGER script yet has multiuple TRIGGERS paths.

The signature block ADB_BLOCK_SIG

Such block must be after ADB_BLOCK_ADB and before ADB_BLOCK_DATA.

The block begins with a 2-byte header, including a u8 sign_ver field for the version of signature (currently must be 0), and a u8 hash_alg field for the ID of the algorithm, defined in C as:

struct adb_sign_hdr {
    uint8_t sign_ver, hash_alg;
};

The hash algorithm could be one of the following:

ID NAME LENGTH
0 NONE -
2 SHA1 20
3 SHA256 32
4 SHA512 64
5 SHA256_160 (actually SHA2 160 variant) 20

The missing ID1 was MD5 whose support was dropped in apk-tools.

And currently apk-tools would only use SHA512 for both signing and verifying.

If this have a valid, non-NONE hash_alg, then the actual payload should (after the 2-byte header) be followed by a 16-byte ID, and the corresponding length of signature, defined in C as:

struct adb_sign_v0 {
    struct adb_sign_hdr hdr;
    uint8_t id[16];
    uint8_t sig[];
};

In the example file there’s no SIG block.

When testing signing with apk-tools (which can re-sign an unsigned v3 apk), the private key file --sign-key shall be an OpenSSL private key, which could be generated via e.g. openssl genrsa -aes256 -out /tmp/private.pem 4096, in which -aes256 could be omitted if you don’t want password. However as this is a temporary key not in pool, the command should look like apk adbsign --allow-untrusted --sign-key /tmp/private.pem crowdsec-1.6.2-r1.apk.resigned

The size of sig[] part shall follow what the key specifies, e.g. for the above rsa4096 key, the signature shall be 512 bytes, and it might be PKCS#1 message but as this is only testing with temporary key I can’t confirm the official repo signing method.

The following is output from adumpk:

DEBUG... AdbBlock(type_block=<AdbBlockType.SIG: 1>, size_raw=534, size_payload=530)
INFO.... Hash sha512, 64 bytes, ID 290fb2a94d29dda681301285226e604d: CQT7OfNPmxt6XtW3s1iV5N6DtGlfkVYKYsjKn4LKsRmYW0RjTXhZ12bexzmcx7zIQqs9VMZYyN9ovCobYhnUDikR5an2FoUYIJ9oJAEm3FdS1Q5L0m7mSqssO6SP/Y8dK7G1wgnlvTLgKOQ4gWjVogOLCDFk2j/B15NmGMS3rS7hcYNPhn7SuDTBMzNM6jMNoe0ElYznFCZYEUw89Ow1rD602/sIhO6eZwuTrgsFBq6dBLLiOZ863ufiKnUVNW1PijmdPh730L8aqnlm1Jdro+eN4A5Af5zDsqobPaRlE1Rs/7UzTBozDAIcoPWTjtVkBUqEw8SWMdeAQnlBKkiOGmq5uGsM/KvgZb+NthME5YcsbWJLineVCuZ/iVZCAtSbKvFlPKRpwk385YnA/LMfdIuR7dsZQLjpzEdgYC5/57O/CWOs7WvBI4jXi0wiTqbEHKKSHhlmnJI7DdTwAesE86G5lgxqamnxIuG9xjD6Cm6l9fPYR3dcVAFl76FuLSLzDuT4J51o48F4MvlyfJIt5a+Thoknvhcg4OXEAJMg5tOc5uWU+TV1cllLqkeyAh1qxUCbol4mU5ZLctgMYGsSnCxISuDXNDy6k6D/m3ilz+9BOIrfKM2C6z7SBvCzmoezCMkr2oBdGHbgguSj9vkwwLXHzMbY7AZXRb0UQ3fIml4=

The data block ADB_BLOCK_DATA

Such block must be after ADB_BLOCK_ADB and cannot be before ADB_BLOCK_SIG

The block begins with a 8-byte header, including a u32 path_idx field for the 1-started ID of corresponding PATH element, and a u32 file_idx field for the 1-started ID of corresponding FILE element in that PATH, defined in C as:

struct adb_data_package {
    uint32_t path_idx;
    uint32_t file_idx;
};

The file data follows directly after the header. Remember that the payload contains the header and each block starts at 8-byte boundary and aligns up to 8-byte boundary.

E.g. in the example file, right after ADB_BLOCK_ADB at body[8:5660], pad that to 8-byte boundary 5664, and reads the 4-byte type_size u32(body[5664:5668]) == 0x8000007f, so type for it is 0x8000007f >> 30 == 2, for a ADB_BLOCK_DATA, and size for the payload including header is 0x7f, 127, therefore the whole payload including header is body[5664:+127] = body[5664:5791]. In it, the path_idx is u32(body[5668:5672]) == 3, and file_idx is u32(body[5672:5676]) == 1, and actual data length is 127 (whole block) - 4 (block header) - 8 (data header) = 115, and we can confirm it’s body[5676:5791] = body[5676:+115].

The file content reads as below:

config crowdsec 'crowdsec'
    option data_dir '/srv/crowdsec/data'
    option db_path '/srv/crowdsec/data/crowdsec.db'

And if we go back to look at the PATH block, we would know this is for folder etc/config and file crowdsec, perm 0o600 owned by root:root, with SHA256 checksum.

]]>
Multi-architecture multi-distro in one root partition2025-11-14T10:30:00+00:002025-11-14T10:30:00+00:00https://7ji.github.io/booting/2025/11/14/multi-arch-multi-distro-in-one-root-driveRecently I needed to do offline OS maintainance work on quite a few of my devices, for which I used Ventoy + archiso on x86_64 for Debian 13 / Arch, and ALARM os drive on aarch64 for Debain 13 / ALARM. For which I find archiso more and more annoying as I had to re-do a lot of initial setups.

While I know these could be improved if I have a dedicated persistent fs for configs, or cloud-init scripts, or archiso boot parameters, I don’t quite want immutable live system for the work any more. So, I decided, what if I have a single drive, on which I have all of the following systems booting from the same root partition?

  • Arch Linux x86_64 bootable via both UEFI and legacy
  • Debian 13 x86_64 bootable via both UEFI and legacy
  • Debian 13 aarch64 bootable via UEFI (and U-boot distroboot)
  • Arch Linux ARM aarch64 bootable via both UEFI (and U-boot distroboot)
  • And of course more!

Background knowledge

Before the actual installation, I’ll explain the background knowledge first. If you don’t bother with, skip to next chapter

Booting on x86_64 UEFI, without CSM

On x86_64 UEFI, without CSM, the booting process is quite simple:

  1. UEFI powers on and do preparation until booting logic is ready
  2. If quick boot is enabled, load only necessary drivers, and execute the first available (destination exists and binary exists) BootEntry in current BootOrder, if this succeeds then no remaining steps would be executed
  3. If external boot sources are available (e.g. network adapters with UEFI ROM) and not disabled, let them scan and register BootEntrys as needed; on most devices this step does not execute at all
  4. UEFI BIOS loads various drivers and scans for all drives for EFI partition, that is, MSDOS type “EFI (FAT-12/16/32)”, or GPT type “C12A7328-F81F-11D2-BA4B-00A0C93EC93B”, for each FAT fs with such type, open and search for EFI binary at removable path “EFI/BOOT/BOOTX64.EFI”, and register a BootEntry with generated name for it like “UEFI OS”; for different UEFI vendors the logic whether to register them with order earlier or later than existing entries are undetermined
  5. Go similarly as step 2

Note about MBR on UEFI: as the specification only required support for GPT, the support or no-support for MBR is undeterminable before you get your hands on the actual machine. Windows and systemd-boot simply refuses to install on MBR on UEFI. While all of my devices support such and I use it for local system installation on small drives, I would only focus on GPT on UEFI due to the fact that we want the result drive bootable on variuos machines.

Most of UEFI-compatible boot managers support to be (or has to be at least) installed at removable path. E.g. for grub (expecting EFI partiton mounted at /efi):

grub-install --target x86_64-efi --removable

On Debian 13 this installs the following files to /efi/EFI/BOOT:

  • BOOTX64.CSV: this contains required entry to be registered once shim at BOOTX64.EFI successfully booted
  • BOOTX64.EFI: Debian’s signed shim, loads signed grubx64.efi, and register entries according to BOOTX64.CSV; the one UEFI firmware would pick as removable EFI
  • grub.cfg: Grub’s config, it just tells grub to scan for real root and look up configs there. An example content:
      search.fs_uuid 91c69930-b508-4a42-b510-d63544d7eae0 root hd1,gpt2
      set prefix=($root)'/boot/grub'
      configfile $prefix/grub.cfg
    

    Grub’s config files look like shell scripts and you can imagine configfile as source in shell, so in this case the grub.cfg in EFI partition just records where to find the root partition (search for fs with uuid 91c69930-b508-4a42-b510-d63544d7eae0 and records the result in variable root, if failed then use default hd1,gpt2), sets another variable prefix, being the path to folder /boot/grub under that root fs, then “source” another config grub.cfg under there.

  • grubx64.efi: Grub’s core EFI binary, signed by Debian, loads grub.cfg
    • The file would carry a built-in $prefix variable equalling /EFI/debian to instruct where to look up for a grub folder; in removable case it would instead try [ESP]/EFI/BOOT, so it looks up grub.cfg here
  • mmx64.efi: Machine owner key manager, only needed for secure boot, not needed for fully removable use cases, can be safely deleted

On Arch Linux this installs only BOOTX64.EFI, even the config needs to be manually created.

For a removable drive, on which we would have (some) kernels unsigned, we certainly would not want strict Secure Boot, neither would we want permissive Secure Boot with Machine Owner Key managed by ourselves. And we would not want it to register non-removable UEFI BootEntry s if possible.

The dependency tree in Debain’s Grub split packages make it really hard to install secure-boot-less under UEFI, as grub-efi-amd64-bin, a soft depened making grub-install --target x86_64-efi possible, hard depends grub-efi-amd64-signed. And the whole dependency tree becomes locked-in and almost impossible to uninstall due to they considered “essential”. And dpkg hook would “friendly” help use to re-install (update) Grub on version change, not respecting the existing layout. So in later steps we would install and manage Grub the boot loader part from Arch Linux, and install only the booting configuration generation part on Debian.

Don’t worry about “Debian not having its own Grub”. It is $prefix/grub.cfg that normally grub-mkconfig / update-grub updates, these are managed by sytem themselves, and the configuration tool would still be installed.

Other than the per-system grub.cfg, we would want an “outer grub.cfg”, which is directly used by Grub, containing menuentry to instruct which sub grub.cfg to redirect to.

The reason we choose Grub rather than other booting methods:

  • While systemd-boot is also a candidate on x86_64 UEFI, the straight-forward installtion tool bootctl does not natively support a removable option (so it must write a BootEntry at installation which I find quite annoying); although you could manually copy the binary to the removable EFI path, no maintainance can be done easily with bootctl update; also systemd-boot does not support legacy BIOS so we have to maintain more booting configs, which is a hassle itself
  • Placing a unified kernel image to the removable place is OK if you only want a single distro, but is impossible if we want multiple distro (well technically you can use kernel image cross-distro, but good luck updating them)

In summary, for the multi-boot logic we would need: GPT + EFI partition + one single removable Grub EFI binary + one single grub.cfg as menuentry selector + one grub.cfg per system maintained by the system itself

Booting on x86 legacy BIOS or on x86_64 UEFI with CSM

This is still “simple” but not straight-forward in the modern perspective. Still, let’s write the main ideas down:

  1. BIOS powers on and do preparation until booting logic is ready
  2. BIOS registers newly found drives into its pool, not necessarily at last positions
  3. For each target in the booting order configuration, if not drive, delegate to external source (e.g. network manager with booting ROM), otherwide load the Master Boot Record (MBR) on sector 0 and try to execute it; in most cases this wouldn’t return even if the MBR is not technically bootable (some partition tools would place a binary here to print “unbootable device”, and for some this means hang i.e. soft locked)
  4. If all drives in booting order failed, print “no bootable drives found” and hang

Note while MBR was mentioned above, the partition table on the drive does not have to be MBR / msdos. The BIOS actually knows nothing above the partition but rather just reads stuff from fixed offsets (think that as “as-if MBR0”, in fact the whole drive can be a “super floppy” i.e. fs on whole drive, and as long as sector 0 is available it does not matter).

So most of the legacy-BIOS-compatible boot loaders need to be installed to MBR, or also technically the whole drive, e.g. for Grub:

grub-install --target i386-pc /dev/sda

The binary that Grub installs into MBR / sector 0 is called boot.img by Grub itself, the functionality is similar to grubx64.efi in the UEFI case: to load necessary drivers to look up real grub.cfg. But as MBR sector 0 is too small (512 Byte) it’s impossible to do it by the small boot.img itself. For this another part of binary called core.img needs to be looked up and executed. Grub does this differently depending on whether you’re booting on MBR or GPT:

  • on MBR, core.img is stored from sector 1 onward, before first partition, the space is 1 MiB - 512B, and in real world would never be fully utilized.
  • on GPT, core.img is stored in the partition with type BIOS boot (21686148-6449-6E6F-744E-656564454649), for the same reason this can be only 1 MiB

Of course core.img itself would carry metadata including how large the actual data is, fs drivers so actual root partition could be opened, and optionally a built-in config.

The logic is simply, boot.img -> core.img -> $prefix/grub.cfg, after that latter steps are similar to UEFI cases.

Similar as UEFI, we would want a single outer Grub with config managed by ourselves to function as selector, and each system managing their own inner grub.cfg (but not having their Grub boot manager) ready to be picked up by ourselves.

The reason we choose Grub rather than other booting methods:

  • While syslinux can also provide menu, and is pretty KISS, it would need seperate config, instead of reusing the same config as UEFI

In summary, for the multi-boot logic we would need: GPT + BIOS Boot partition + one single Grub binary in MBR0 and BIOS Boot partition + one single grub.cfg as menuentry selector + one grub.cfg per system maintained by the system itself

Booting on AArch64, UEFI or U-Boot distroboot

Some AArch64 devices support UEFI, many others don’t and use U-Boot. In not-too-old U-Boot builds, the “distroboot” concept scans for various bootable targets, and removable EFI binary EFI/BOOT/BOOTAA64.EFI is one of them, besides extlinux.conf and boot.scr(.uimg) . I build u-boot myself on all of SBCs and TV boxes on my hand and they all support such without explicitly enabling. As we want multi-boot on a single drive on both AArch64 and x86_64 we would only focus the UEFI and distroboot UEFI.

On AArch64 UEFI, the steps are similar as x86_64, except that the fallback EFI binary name is BOOTAA64.EFI

On AArch64 U-Boot that chainloads removable EFI in distroboot logic:

  1. U-Boot powers on and do preparation until booting logic is ready
  2. Loads environment variables either from persistent storage or from built-in, in both cases to memory
  3. Run environment variable bootcmd, in modern cases, bootcmd='bootflow scan -lb'
  4. So, run bootflow, the argument scan means to scan all possible sources, in most cases these include all block devices first, then network; the argument -l means to print each scanned bootablt target; the argument -b means for each target scanned, try to boot immediately.
  5. Let’s only focus on removable/fallback EFI binary, and assumes it being the only possible target and scanned
  6. Prepare some “UEFI” environments and “UEFI services”, then loads the EFI binary in, then execute it.

Note specifically for per-device DTB to be applied correctly in the U-Boot case, if /boot is in root fs, the job cannot be done by Grub (cannot expect a partition to be readable before you could even tell there’s a block device), rather the DTB has to be loaded by U-Boot

Drive preparation

Boot archiso or do this in a device already running Linux.

Run your perferred partition tool to partition the drive with the following partitions:

  • 100 MiB EFI system partition
  • 1 MiB BIOT boot partition
  • Remaining as a single root partition
  • Others as you like

Or simply save the following infos in a temporary file e.g. parts.info:

label: gpt
unit: sectors
sector-size: 512

size=204800, type=C12A7328-F81F-11D2-BA4B-00A0C93EC93B
size=2048, type=21686148-6449-6E6F-744E-656564454649
type=0FC63DAF-8483-4772-8E79-3D69D8477DE4

Then run sfdisk to format use the info:

sfdisk /dev/[drive] < parts.info

The result partitions shall look like the following (/dev/vda is used in the following example):

Checking that no-one is using this disk right now ... OK

Disk /dev/[drive]: 64 GiB, 68719476736 bytes, 134217728 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> Script header accepted.
>>> Script header accepted.
>>> Script header accepted.
>>> Created a new GPT disklabel (GUID: D8997F52-AFD6-4A0F-845F-FF12CDD718F6).
/dev/[drive]1: Created a new partition 1 of type 'EFI System' and of size 100 MiB.
/dev/[drive]2: Created a new partition 2 of type 'BIOS boot' and of size 1 MiB.
/dev/[drive]3: Created a new partition 3 of type 'Linux filesystem' and of size 63.9 GiB.
/dev/[drive]4: Done.

New situation:
Disklabel type: gpt
Disk identifier: D8997F52-AFD6-4A0F-845F-FF12CDD718F6

Device      Start       End   Sectors  Size Type
/dev/[drive]1    2048    206847    204800  100M EFI System
/dev/[drive]2  206848    208895      2048    1M BIOS boot
/dev/[drive]3  208896 134215679 134006784 63.9G Linux filesystem

The partition table has been altered.

Create a FAT fs on the ESP:

mkfs.vfat /dev/[drive]1

Create a Btrfs on the root partition:

mkfs.btrfs /dev/[drive]3

Note about Btrfs compression: if you want it, you could set it at fs-creation time, e.g.:

mkfs.btrfs --compress zstd:15 /dev/[drive]3

But the later steps assume this not set and we would write the compression level manually at mount time and in fstab. You can omit those if you did this at fs-creation time.

Now mount the Btrfs root partition to somewhere, we need to create some subvols

mount --mkdir /dev/[drive]3 /mnt/manyos
cd /mnt/manyos

Let’s create subvols. The main focus is that we want seperate root volumes for each system, and shared home. (@ or + is not strictly needed in subvol names, but it helps to tell them from plain folders)

  • The generic one subvol per system except shared home style (very simple for latter mounting and fstab):
    btrfs subvolume create @arch @debian @home
    
  • My style:
    mkdir shared arch-x86_64 debian-x86_64 alarm-aarch64 debian-aarch64
    btrfs subvolume create shared/@{home{,_.snapshots},etc_ssh} {arch-x86_64,debian-x86_64,alarm-aarch64,debian-aarch64}/{@{,.snapshots},+nocow}
    chattr +C {arch-x86_64,debian-x86_64,alarm-aarch64,debian-aarch64}/+nocow
    mkdir -p arch-x86_64/+nocow/var/{cache,log,spool,tmp}
    chmod 1777 arch-x86_64/+nocow/var/tmp
    mkdir template
    tar -f template/nocow.tar -C arch-x86_64/+nocow -cv .
    for i in arch-x86_64/@ {debian-x86_64,alarm-aarch64,debian-aarch64}/{@,+nocow}; do tar -f template/nocow.tar -C $i -xv; done
    

    The layout shall look the the following:

    > tree
    .
    ├── arch-x86_64
    │   ├── @
    │   │   └── var
    │   │       ├── cache
    │   │       ├── log
    │   │       ├── spool
    │   │       └── tmp
    │   ├── +nocow
    │   │   └── var
    │   │       ├── cache
    │   │       ├── log
    │   │       ├── spool
    │   │       └── tmp
    │   └── @.snapshots
    ├── ...
    ├── shared
    │   ├── @home
    │   └── @home_.snapshots
    └── template
        └── nocow.tar
    
    34 directories, 1 file
    

    The benefit of the above layout is that stuffs not needing snapshot and compression are enclosed into a single +nocow subvol and bind-mounting is used instead of many subvols; and system-specific stuffs are enclosed into one single top-level folder, and shared stuffs not so, and .snapshots subvol for snapper is pre-created.

In later steps I would follow only my own style

Installing Arch Linux x86_64

As discussed earlier we would use Arch Linux as the one to install and maintain Grub x86_64 (as, Grub almost HAS TO BE installed as Secure Boot with shim, which we don’t need at all), so let’s install Arch Linux first.

Mainly you shall follow the official Installation Guide for the most part, to boot archiso on x86_64 and install in UEFI style; of course you could also do this on a device already running Linux. I would only cover the archiso case.

Let’s focus on things that shall go differently from the official way:

  1. Follow the official guide, until before “Partition the disks”
  2. Skip “Partition the disks”
  3. Skip “Format the partitions”
  4. To mount my layout, mount root and nocow subvol first:
     mount -o compress=zstd:15,subvol=arch-x86_64/@ --mkdir /dev/[drive]3 /mnt/root
     mount -o subvol=arch-x86_64/+nocow --mkdir /dev/[drive]3 /mnt/arch-x86_64+nocow
    

    Then pre-create fstab and edit it:

     mkdir /mnt/root/etc
     cp /etc/fstab /mnt/root/etc/
     vim /mnt/root/etc/fstab
    

    Remember to use vim’s functionality to do multi-line edit (shift + v, crtl + v, etc) and the ability to pipe content into external command (select in multi-line visual, then :, then !column -t to force a table look)

    The content shall look like the following (remember to use blkid to acquire your real UUID for root partition and ESP)

     # Static information about the filesystems.
     # See fstab(5) for details.
    
     # <file system> <dir> <type> <options> <dump> <pass>
     UUID=6894094c-a75e-4f1a-b228-283faf7bf003  /                       btrfs  rw,compress=zstd:15,subvol=arch-x86_64/@            0  0
     UUID=6894094c-a75e-4f1a-b228-283faf7bf003  /.snapshots             btrfs  rw,compress=zstd:15,subvol=arch-x86_64/@.snapshots  0  0
     UUID=6894094c-a75e-4f1a-b228-283faf7bf003  /home                   btrfs  rw,compress=zstd:15,subvol=shared/@home             0  0
     UUID=6894094c-a75e-4f1a-b228-283faf7bf003  /home/.snapshots        btrfs  rw,compress=zstd:15,subvol=shared/@home_.snapshots  0  0
     UUID=6894094c-a75e-4f1a-b228-283faf7bf003  /mnt/arch-x86_64+nocow  btrfs  rw,compress=zstd:15,subvol=arch-x86_64/+nocow       0  0
     /mnt/arch-x86_64+nocow/var/cache           /var/cache              none   bind,private                                        0  0
     /mnt/arch-x86_64+nocow/var/log             /var/log                none   bind,private                                        0  0
     /mnt/arch-x86_64+nocow/var/spool           /var/spool              none   bind,private                                        0  0
     /mnt/arch-x86_64+nocow/var/tmp             /var/tmp                none   bind,private                                        0  0
     UUID=5D03-ED7A                             /efi                    vfat   rw,noatime                                          0  2
    

    Then mount everything remaining up:

     mount --all --fstab /mnt/root/etc/fstab --target-prefix /mnt/root --mkdir
    

    A lsblk shall look like following now:

     # lsblk
     NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
     vda    254:0    0   128G  0 disk
     ├─vda1 254:1    0   100M  0 part /mnt/root/efi
     ├─vda2 254:2    0     1M  0 part
     └─vda3 254:3    0 127.9G  0 part /mnt/root/var/tmp
                                      /mnt/root/var/spool
                                      /mnt/root/var/log
                                      /mnt/root/var/cache
                                      /mnt/root/mnt/arch-x86_64+nocow
                                      /mnt/root/home/.snapshots
                                      /mnt/root/home
                                      /mnt/root/.snapshots
                                      /mnt/arch-x86_64+nocow
                                      /mnt/root
    
  5. Continue from “Installation”, I’d recommend to choose the following bootstrap packages (pre-configure booster so initramfs is only generated once for universal, skipping the non-universal one):
     echo 'universal: true' > /mnt/root/etc/booster.yaml
     pacstrap -K /mnt/root base booster linux intel-ucode amd-ucode linux-firmware btrfs-progs dosfstools grub vim sudo
    
  6. Skip “Configure the system / Fstab”, the fstab generated on our Btrfs layout is pretty messy, just use our own
  7. Continue from “Chroot”, arch-chroot /mnt/root and do the remaining parts, until before “Network configuration”
  8. For “Network configuration”, the hostname could be unique for each system, or same, depending on your need; for network manager, I’d recommend just use systemd-networkd, so:
    • Enable networkd and resolved: systemctl enable systemd-{network,resolve}d
    • Quit from chroot
    • Copy archiso’s network config files: cp -rva /etc/systemd/network/* /mnt/root/etc/systemd/network/
    • Re-link resolv.conf ln -sf /run/systemd/resolve/stub-resolv.conf /mnt/root/etc/resolv.conf
    • Re-enter chroot
  9. Skip “Initramfs”, we’re using booster instead of the default mkinitcpio, and the universal initramfs was already created
  10. For “Boot Loader”, we would install grub, the package was already installed into root in earlier bootstrap steps, we only need to install it to bootable:
    1. Install as removable EFI, note we also specify --boot-directory /efi, so grub modules and first-stage config are saved and loaded from there. We would only want each system’s /boot to store there boot config
       grub-install --removable --efi-directory /efi --boot-directory /efi
      
    2. Install to MBR, similarly note we also specify --boot-directory /efi
       grub-install --target i386-pc --boot-directory /efi /dev/[drive]
      
    3. Hack/fix /etc/grub.d/10_linux so it would prefer booster initramfs (without this, booster initramfs would be hidden in a submenu) and ro root; if you’re not using booster only or you don’t require ro root on boot, you can skip this:
       From 9ad850b2b8842bb673313be08f6a5af66cdf12ea Mon Sep 17 00:00:00 2001
       From: Guoxin Pu <[email protected]>
       Date: Thu, 13 Nov 2025 15:52:27 +0800
       Subject: [PATCH] use booster as main initramfs and prefer ro
      
       ---
       10_linux | 26 ++------------------------
       1 file changed, 2 insertions(+), 24 deletions(-)
      
       diff --git a/10_linux b/10_linux
       index e16cea8..3ce3a9d 100755
       --- a/10_linux
       +++ b/10_linux
       @@ -147,7 +147,7 @@ linux_entry ()
         message="$(gettext_printf "Loading Linux %s ..." ${version})"
         sed "s/^/$submenu_indentation/" << EOF
               echo    '$(echo "$message" | grub_quote)'
       -       linux   ${rel_dirname}/${basename} root=${linux_root_device_thisversion} rw ${args}
       +       linux   ${rel_dirname}/${basename} root=${linux_root_device_thisversion} ro ${args}
       EOF
         if test -n "${initrd}" ; then
           # TRANSLATORS: ramdisk isn't identifier. Should be translated.
       @@ -227,7 +227,7 @@ for linux in ${reverse_sorted_list}; do
         done
      
         initrd_real=
       -  for i in "initrd.img-${version}" "initrd-${version}.img" \
       +  for i in "booster-${version}.img" "initrd.img-${version}" "initrd-${version}.img" \
                 "initrd-${alt_version}.img.old" "initrd-${version}.gz" \
                 "initrd-${alt_version}.gz.old" "initrd-${version}" \
                 "initramfs-${version}.img" "initramfs-${alt_version}.img.old" \
       @@ -304,28 +304,6 @@ for linux in ${reverse_sorted_list}; do
         linux_entry "${OS}" "${version}" advanced \
                     "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
      
       -  if test -e "${dirname}/initramfs-${version}-fallback.img" ; then
       -    initrd="${initrd_early} initramfs-${version}-fallback.img"
       -
       -    if test -n "${initrd}" ; then
       -      gettext_printf "Found fallback initrd image(s) in %s:%s\n" "${dirname}" "${initrd_extra} ${initrd}" >&2
       -    fi
       -
       -    linux_entry "${OS}" "${version}" fallback \
       -                "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
       -  fi
       -
       -  if test -e "${dirname}/booster-${version}.img" ; then
       -    initrd="${initrd_early} booster-${version}.img"
       -
       -    if test -n "${initrd}" ; then
       -      gettext_printf "Found booster initrd image(s) in %s:%s\n" "${dirname}" "${initrd_extra} ${initrd}" >&2
       -    fi
       -
       -    linux_entry "${OS}" "${version}" booster \
       -                "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
       -  fi
       -
         if [ "x${GRUB_DISABLE_RECOVERY}" != "xtrue" ]; then
           linux_entry "${OS}" "${version}" recovery \
                       "${GRUB_CMDLINE_LINUX_RECOVERY} ${GRUB_CMDLINE_LINUX}"
       --
       2.51.2
      
    4. Generate grub.cfg for Arch Linux as how you do it on a normal installation:
       mkdir /boot/grub
       grub-mkconfig -o /boot/grub/grub.cfg
      

      Note this would not be the first config Grub loads, but rather just a “to-be-included” config.

    5. Let’s write a real, outer config for Grub to achieve menu logic by vim /efi/grub/grub.cfg with content like following:
       search.fs_uuid 6894094c-a75e-4f1a-b228-283faf7bf003 root hd0,gpt2
       terminal_input console
       terminal_output console
       set suffix='@/boot/grub/grub.cfg'
       menuentry 'Arch Linux (x86_64)' {
               configfile ($root)/arch-x86_64/$suffix
       }
      
  11. Finalize the installation, umount everything and poweroff

After the above steps we shall have a UEFI + legacy bootable, you can boot on different machines to validate it, the boot menu shall look like this on both UEFI and legacy:


                         GNU GRUB  version 2:2.14rc1-2

 ┌────────────────────────────────────────────────────────────────────────────┐
 │*Arch Linux (x86_64)                                                        │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │ 
 └────────────────────────────────────────────────────────────────────────────┘

      Use the ▲ and ▼ keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line.                           
                                                                             

And after pressing Enter it’s the same old Arch Linux Grub menu as always.

While we’re at it, you can of course add timeout, decoration, etc to the outer menu. I’ll stick with the simple look and continue.

After-installation for Arch Linux x86_64

This part is mostly about quality of life improvement and can be skipped. If you want to follow, boot into the installed Arch Linux.

sshd config + key

pacman -S openssh
systemctl enable --now sshd

Verify that sshd pubkeys are generated successfully:

# ls /etc/ssh
moduli	ssh_config  ssh_config.d  sshd_config  sshd_config.d  ssh_host_ecdsa_key  ssh_host_ecdsa_key.pub  ssh_host_ed25519_key	ssh_host_ed25519_key.pub  ssh_host_rsa_key  ssh_host_rsa_key.pub
# systemctl status sshdgenkeys
○ sshdgenkeys.service - SSH Key Generation
     Loaded: loaded (/usr/lib/systemd/system/sshdgenkeys.service; disabled; preset: disabled)
     Active: inactive (dead) since Thu 2025-11-13 16:26:26 CST; 1min 51s ago
 Invocation: ef2323bd8d4e4007be02a0e3208f3c31
    Process: 578 ExecStart=/usr/bin/ssh-keygen -A (code=exited, status=0/SUCCESS)
   Main PID: 578 (code=exited, status=0/SUCCESS)
   Mem peak: 1.7M
        CPU: 84ms

Nov 13 16:26:25 dud systemd[1]: Starting SSH Key Generation...
Nov 13 16:26:26 dud ssh-keygen[578]: ssh-keygen: generating new host keys: RSA ECDSA ED25519
Nov 13 16:26:26 dud systemd[1]: sshdgenkeys.service: Deactivated successfully.
Nov 13 16:26:26 dud systemd[1]: Finished SSH Key Generation.

Do required sshd_config modification as needed:

vim /etc/ssh/sshd_config
  • I would replace AuthorizedKeysFile .ssh/authorized_keys -> AuthorizedKeysFile /etc/ssh/authorized_keys/%u, so no one can add their SSH pubkey except root (me); and I would prepare this folder as needed
  • I would uncomment #PasswordAuthentication and set PasswordAuthentication no so password login is disabled

After the modification restart sshd so the changes take effect:

systemctl restart sshd

snapshot on boot

For the setup I would want after-boot snapshots to be taken, so install snapper:

pacman -S snapper

Create configs and enable needed services

vim /etc/snapper/configs/root
vim /etc/snapper/configs/home

The content shall look like the following:

SUBVOLUME="/"
FSTYPE="btrfs"
QGROUP=""
SPACE_LIMIT="0.5"
FREE_LIMIT="0.2"
ALLOW_USERS=""
ALLOW_GROUPS=""
SYNC_ACL="no"
BACKGROUND_COMPARISON="yes"
NUMBER_CLEANUP="yes"
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="50"
NUMBER_LIMIT_IMPORTANT="10"
EMPTY_PRE_POST_CLEANUP="yes"
EMPTY_PRE_POST_MIN_AGE="1800"
SUBVOLUME="/home"
FSTYPE="btrfs"
QGROUP=""
SPACE_LIMIT="0.5"
FREE_LIMIT="0.2"
ALLOW_USERS=""
ALLOW_GROUPS=""
SYNC_ACL="no"
BACKGROUND_COMPARISON="yes"
NUMBER_CLEANUP="yes"
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="50"
NUMBER_LIMIT_IMPORTANT="10"
EMPTY_PRE_POST_CLEANUP="yes"
EMPTY_PRE_POST_MIN_AGE="1800"

And edit the global config to enable these profiles:

vim /etc/conf.d/snapper

With the following line:

SNAPPER_CONFIGS="root home"

By default snapper-boot only snapshots root, so modify the unit:

systemctl edit snapper-boot

The result shall look like this (ExecStart is appended after the original ExecStarted):

> cat /etc/systemd/system/snapper-boot.service.d/override.conf
[Service]
ExecStart=/usr/bin/snapper --config home create --cleanup-algorithm number --description "boot"

Then let’s enable needed units:

systemctl enable --now snapper-{boot,cleanup}.timer

Installing Debian x86_64

Let’s do this on Arch Linux x86_64 with debootstrap and arch-chroot

Of course the tools shall be installed first:

pacman -S debootstrap arch-install-scripts

The installation goes similarly as Arch, but keep the following points in mind:

  1. After chroot, do export PATH= with missing /sbin parts, whole command in later steps
  2. Do not install any boot manager! The only system here that has a boot manager installed is Arch x86_64.

The steps are as follows:

  1. Similarly, mount root and nocow subvol first:
     mount -o subvol=debian-x86_64/@ --mkdir /dev/[drive]3 /mnt/root
     mount -o subvol=debian-x86_64/+nocow --mkdir /dev/[drive]3 /mnt/debian-x86_64+nocow
    

    Then duplicate fstab:

     mkdir /mnt/root/etc
     sed 's/arch/debian/g' /etc/fstab > /mnt/root/etc/fstab
    

    Then mount everything remaining up:

     mount --all --fstab /mnt/root/etc/fstab --target-prefix /mnt/root --mkdir
    

    A lsblk shall look like following now:

     # lsblk
     NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
     vda    254:0    0   128G  0 disk
     ├─vda1 254:1    0   100M  0 part /mnt/root/efi
     │                                /efi
     ├─vda2 254:2    0     1M  0 part
     └─vda3 254:3    0 127.9G  0 part /mnt/root/var/tmp
                                      /mnt/root/var/spool
                                      /mnt/root/var/log
                                      /mnt/root/var/cache
                                      /mnt/root/mnt/debian-x86_64+nocow
                                      /mnt/root/home/.snapshots
                                      /mnt/root/home
                                      /mnt/root/.snapshots
                                      /mnt/debian-x86_64+nocow
                                      /mnt/root
                                      /var/tmp
                                      /var/spool
                                      /var/log
                                      /home/.snapshots
                                      /var/cache
                                      /mnt/arch-x86_64+nocow
                                      /home
                                      /.snapshots
                                      /
    
  2. Do debootstrap into the root
     debootstrap trixie /mnt/root http://[mirror_link]
    
  3. chroot into /mnt/root, and set PATH
     arch-chroot /mnt/root
    
     export PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    

    The PATH needs to be set as arch-chroot keeps PATH from Arch, which lacks sbin, but on Debian sbin is a seperate folder than bin

  4. Install a few missing packages
     apt update
     apt install vim locales systemd-timesyncd btrfs-progs dosfstools linux-image-amd64 amd64-microcode intel-microcode firmware-linux
    
  5. Do timezone, locale, hostname setup just like how you did it in Arch
  6. For network manager, similarly, I’d recommend systemd-networkd + systemd-resolved pair, but on Debian resolved needs to be installed seperately:
     apt install systemd-resolved
    
  7. Exit from chroot and borrow host Network configuration and re-link resolv.conf just like how we did for Arch x86_64 above, then re-enter chroot
  8. Now, the Grub, we only need Debian to generate grub.cfg, but definitely not intalling and maintaining the actual Grub boot manager, so install only the system integration part and prepare the folder manually:
     apt install grub2-common
     mkdir /boot/grub
     vim /etc/default/grub
    

    With this setup there would be no pre-configured grub config, so use the following as a starting point:

     GRUB_DEFAULT=0
     GRUB_TIMEOUT=1
     GRUB_DISTRIBUTOR=`( . /etc/os-release && echo ${NAME} )`
     GRUB_CMDLINE_LINUX_DEFAULT="audit=0"
     GRUB_CMDLINE_LINUX=""
     GRUB_TERMINAL=console
    

    Remember to re-generate the one included by our outer grub

     update-grub
    
  9. Exit from chroot and update our outer Grub config /efi/grub/grub.cfg to include a new menuentry:
     menuentry 'Debian (x86_64)' {
         configfile ($root)/debian-x86_64/$suffix
     }
    
  10. Finalize the installation, umount everything and poweroff

After the above steps we shall have a UEFI + legacy bootable Arch Linux + Debian installation, you can boot on different machines to validate it, the boot menu shall look like this on both UEFI and legacy:


                         GNU GRUB  version 2:2.14rc1-2

 ┌────────────────────────────────────────────────────────────────────────────┐
 │*Arch Linux (x86_64)                                                        │
 │ Debian (x86_64)                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │ 
 └────────────────────────────────────────────────────────────────────────────┘

      Use the ▲ and ▼ keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line.                           
                                                                             

And after selecting Debian and pressing Enter it’s the same old Debian Grub menu as always.

After-installation for Debian x86_64

This part is mostly about quality of life improvement and can be skipped. If you want to follow, boot into the installed Debian and do similar things as in After-installation for Arch Linux x86_64, but with the following differences:

  1. If you want to share hostname, it’s better to share host keys, i.e. /etc/ssh/ssh_host_*_key{,.pub}, so clients would not complain about host key differing
  2. For snapper, Debian comes with all units pre-enabled, disable the timeline as we want only on-boot: systemctl disable --now snapper-timeline.timer; and note the snapper config folder is the same but the global config is at /etc/default/snapper
  3. While debootstrap still prepares an old-style APT sources.list, I recommend to migrate to new APT config style:
     rm /etc/apt/sources.list
     vim /etc/apt/sources.list.d/debian.sources
    

    With content like the following:

     Types: deb
     URIs: http://[mirror]/debian/
     Suites: trixie trixie-updates
     Components: main contrib non-free non-free-firmware
     Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
    
     Types: deb
     URIs: http://[mirror]/debian-security/
     Suites: trixie-security
     Components: main contrib non-free non-free-firmware
     Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
    
     Types: deb
     URIs: http://[mirror]/debian/
     Suites: trixie-backports
     Components: main contrib non-free non-free-firmware
     Signed-By: /usr/share/keyrings/debian-archive-keyring.gpg
    
  4. It’s good if you remember that we did not configure initramfs-tools to create generic initramfs. In most cases it comes with pre-configured MODULES=most and you would not need to modify that, only with debian-install could this be set to MODULES=deps. To verify, vim /etc/initramfs-tools/initramfs.conf and check MODULES. If it’s MODULES=deps then modify it, save, and run update-initramfs -u
  5. There’re many firmware not installed by firmware-linux meta package, in most cases these are not needed, if you really really need all of them installed:
     apt install $(for i in $(apt-cache search '^firmware-' | cut -d ' ' -f 1); do dpkg-query -W -f='${Status}' $i &>/dev/null || echo $i; done | grep -v installer)
    

Installing Arch Linux ARM aarch64

Let’s do this on Arch Linux x86_64

  1. Install dependencies first
     pacman -S qemu-user-static-binfmt arch-install-scripts
    
  2. Duplicate host pacman config and do necessary modification
     cp /etc/pacman.conf pacman-alarm.conf
     vim pacman-alarm.conf
    
    1. Set Architecture = aarch64 instead of auto
    2. Set repo Inlucde = mirrorlist-alarm instead of /etc/pacman.d/mirrorlist
    3. Temporarily set SigLevel = Never as we don’t have ALARM keyring on Arch and adding those to Arch host would mess up the host, pacakges can later be re-verified once we have installed ALARM
    4. Add my repo, we would use linux-aarch64-7ji as kernel, instead of ALARM’s official linux-aarch64, the latter misses some drivers built-in and would not boot on some of my SBCs and it has max CPUs set to a small number so would not work nicely on VM either, and the worst is it has some naive hooks to always expect mkinitcpio instead of any other initramfs maker.
       [7Ji]
       Include = mirrorlist-3rdparty
      
  3. Create mirrorlist for alarm
     vim mirrorlist-alarm
    

    With server:

     Server = http://[mirror]/archlinuxarm/$arch/$repo
    
  4. Create mirrorlist for 3rdparty repo
     vim mirrorlist-3rdparty
    

    With server:

     Server = http://[mirror]/$repo/$arch
    
  5. Mount the root tree
     mount -o subvol=alarm-aarch64/@ --mkdir /dev/[drive]3 /mnt/root
     mount -o subvol=alarm-aarch64/+nocow --mkdir /dev/[drive]3 /mnt/alarm-aarch64+nocow
    

    Then duplicate fstab:

     mkdir /mnt/root/etc
     sed 's/arch-x86_64/alarm-aarch64/g' /etc/fstab > /mnt/root/etc/fstab
    

    Then mount everything remaining up:

     mount --all --fstab /mnt/root/etc/fstab --target-prefix /mnt/root --mkdir
    

    A lsblk shall look like following now:

     # lsblk
     NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
     vda    254:0    0   128G  0 disk
     ├─vda1 254:1    0   100M  0 part /mnt/root/efi
     │                                /efi
     ├─vda2 254:2    0     1M  0 part
     └─vda3 254:3    0 127.9G  0 part /mnt/root/var/tmp
                                      /mnt/root/var/spool
                                      /mnt/root/var/log
                                      /mnt/root/var/cache
                                      /mnt/root/mnt/alarm-aarch64+nocow
                                      /mnt/root/home/.snapshots
                                      /mnt/root/home
                                      /mnt/root/.snapshots
                                      /mnt/alarm-aarch64+nocow
                                      /mnt/root
                                      /var/tmp
                                      /var/spool
                                      /var/log
                                      /home/.snapshots
                                      /var/cache
                                      /mnt/arch-x86_64+nocow
                                      /home
                                      /.snapshots
                                      /
    
  6. Do pacstrap into the root
     echo 'universal: true' > /mnt/root/etc/booster.yaml
     pacstrap -C pacman-alarm.conf -K -M /mnt/root base booster linux-aarch64-7ji linux-firmware btrfs-progs dosfstools grub vim sudo archlinuxarm-keyring 7ji-keyring
    
  7. Re-edit (or duplicate from CWD) the pacman configs and mirrorlist under /mnt/root/etc, as now they come the official package, remember to add 7Ji repo
     vim /mnt/root/etc/pacman.conf
     vim /mnt/root/etc/pacman.d/mirrorlist
     vim /mnt/root/etc/pacman.d/mirrorlist-3rdparty
    
  8. Chroot into the target and confirm we’re running as aarch64
     arch-chroot /mnt/root
    
     uname -m
    

    Note: as we’re using qemu-static, the stdin/out is technically not directly attached to our terminal, so text editing could be a pain. If text editing is needed, do it from another SSH session from host under /mnt/root instead so VIM and Nano could correctly write to terminal.

  9. Initalize the keyring
     pacman-key --init
     pacman-key --populate
    
  10. Let’s actually verify the packages, now that we’ve nitialized the keyring
    pacman -Sy --downloadonly $(sed -n '/^%NAME%/{n;p}' /var/lib/pacman/local/*/desc)
    
  11. Continue and finish the setup just like how we did in Arch Linux, until before the boot manager
  12. Similar to Arch Linux x86_64, let’s install and hack Grub
    1. Install as removable EFI, note we also specify --boot-directory /efi, so grub modules and first-stage config are saved and loaded from there. We would only want each system’s /boot to store there boot config
       grub-install --removable --efi-directory /efi --boot-directory /efi
      
    2. Hack/fix /etc/grub.d/10_linux so it would prefer booster initramfs (without this, booster initramfs would be hidden in a submenu) and ro root; if you’re not using booster only or you don’t require ro root on boot, you can skip this:
       From 9ad850b2b8842bb673313be08f6a5af66cdf12ea Mon Sep 17 00:00:00 2001
       From: Guoxin Pu <[email protected]>
       Date: Thu, 13 Nov 2025 15:52:27 +0800
       Subject: [PATCH] use booster as main initramfs and prefer ro
      
       ---
       10_linux | 26 ++------------------------
       1 file changed, 2 insertions(+), 24 deletions(-)
      
       diff --git a/10_linux b/10_linux
       index e16cea8..3ce3a9d 100755
       --- a/10_linux
       +++ b/10_linux
       @@ -147,7 +147,7 @@ linux_entry ()
         message="$(gettext_printf "Loading Linux %s ..." ${version})"
         sed "s/^/$submenu_indentation/" << EOF
               echo    '$(echo "$message" | grub_quote)'
       -       linux   ${rel_dirname}/${basename} root=${linux_root_device_thisversion} rw ${args}
       +       linux   ${rel_dirname}/${basename} root=${linux_root_device_thisversion} ro ${args}
       EOF
         if test -n "${initrd}" ; then
           # TRANSLATORS: ramdisk isn't identifier. Should be translated.
       @@ -227,7 +227,7 @@ for linux in ${reverse_sorted_list}; do
         done
      
         initrd_real=
       -  for i in "initrd.img-${version}" "initrd-${version}.img" \
       +  for i in "booster-${version}.img" "initrd.img-${version}" "initrd-${version}.img" \
                 "initrd-${alt_version}.img.old" "initrd-${version}.gz" \
                 "initrd-${alt_version}.gz.old" "initrd-${version}" \
                 "initramfs-${version}.img" "initramfs-${alt_version}.img.old" \
       @@ -304,28 +304,6 @@ for linux in ${reverse_sorted_list}; do
         linux_entry "${OS}" "${version}" advanced \
                     "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
      
       -  if test -e "${dirname}/initramfs-${version}-fallback.img" ; then
       -    initrd="${initrd_early} initramfs-${version}-fallback.img"
       -
       -    if test -n "${initrd}" ; then
       -      gettext_printf "Found fallback initrd image(s) in %s:%s\n" "${dirname}" "${initrd_extra} ${initrd}" >&2
       -    fi
       -
       -    linux_entry "${OS}" "${version}" fallback \
       -                "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
       -  fi
       -
       -  if test -e "${dirname}/booster-${version}.img" ; then
       -    initrd="${initrd_early} booster-${version}.img"
       -
       -    if test -n "${initrd}" ; then
       -      gettext_printf "Found booster initrd image(s) in %s:%s\n" "${dirname}" "${initrd_extra} ${initrd}" >&2
       -    fi
       -
       -    linux_entry "${OS}" "${version}" booster \
       -                "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}"
       -  fi
       -
         if [ "x${GRUB_DISABLE_RECOVERY}" != "xtrue" ]; then
           linux_entry "${OS}" "${version}" recovery \
                       "${GRUB_CMDLINE_LINUX_RECOVERY} ${GRUB_CMDLINE_LINUX}"
       --
       2.51.2
      
    3. Generate grub.cfg for Arch Linux ALARM as how you do it on a normal installation:
       mkdir /boot/grub
       grub-mkconfig -o /boot/grub/grub.cfg
      

      Note this would not be the first config Grub loads, but rather just a “to-be-included” config.

    4. Add a menuentry in outer Grub config:
       menuentry 'Arch Linux ARM (aarch64)' {
           configfile ($root)/alarm-aarch64/$suffix
       }
      
  13. Finalize the installation, umount everything and poweroff

The drive should now work on an aarch64 VM (pure UEFI), but not on real hardware (U-boot faking UEFI), due to missing DTB. Verify it on VM first before trying on real hardware. (Remeber to disable Secure Boot first)

The Grub menu shall look like the following in VM:


                         GNU GRUB  version 2:2.14rc1-2

 ┌────────────────────────────────────────────────────────────────────────────┐
 │*Arch Linux (x86_64)                                                        │
 │ Debian (x86_64)                                                            │
 │ Arch Linux ARM (aarch64)                                                   │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │ 
 └────────────────────────────────────────────────────────────────────────────┘

      Use the ▲ and ▼ keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line.                           
                                                                             

And after selecting Arch Linux ARM and pressing Enter it’s the same old Arch Grub menu (well ALARM didn’t bother to modify the look) as always.

Now, for U-Boot to correctly work, U-Boot itself needs to load the specific DTB for device. We don’t want Grub to load DTB from root AFTER U-Boot loads Grub, as the device must be fully functional at the time Grub wants to open the Btrfs root, and that’s too late.

In modern-day U-Boot, there would be these built-in variables:

  • efi_dtb_prefixes=/ /dtb/ /dtb/current/ -> Same across builds
  • fdtfile=amlogic/meson-sm1-bananapi-m5.dtb -> Unique for each board
  • scan_dev_for_efi=setenv efi_fdtfile ${fdtfile}; for prefix in ${efi_dtb_prefixes}; do if test -e ${devtype} ${devnum}:${distro_bootpart} ${prefix}${efi_fdtfile}; then run load_efi_dtb; fi;done;run boot_efi_bootmgr;if test -e ${devtype} ${devnum}:${distro_bootpart} efi/boot/bootaa64.efi; then echo Found EFI removable media binary efi/boot/bootaa64.efi; run boot_efi_binary; echo EFI LOAD FAILED: continuing...; fi; setenv efi_fdtfile -> Same across builds

So take my BPI-M5 for example, the place for the DTB could be:

  • ESP/amlogic/meson-sm1-bananapi-m5.dtb
  • ESP/dtb/amlogic/meson-sm1-bananapi-m5.dtb
  • ESP/dtb/current/amlogic/meson-sm1-bananapi-m5.dtb

I’ll pick the second one

For this, we needs to copy some DTBs to “ESP”, usually in U-Boot efi_dtb_prefixes=/ /dtb/ /dtb/current/, so let’s pick /dtb/ so they can be loaded by U-Boot as early as possible

cp -rva /mnt/root/boot/dtbs/linux-aarch64-7ji /mnt/root/efi/dtb

And booting on real hardware should now be OK. I’ve tested this on my BananaPi BPi-M5, OrangePi 5, Orange Pi 5 Plus and they all work seamlessly.

Note: as you may have tried and realized, even if you did not place the DTB, boards could still boot, but in those cases it is the U-Boot’s built-in DTB that’s used, and for newer kernels this could bring some problems.

Note also: by doing this we’ve locked the DTB to the one provided by linux-aarch64-7ji kernel, which is a stable-as-new-as-possible kernel, not only for Arch Linux ARM, but also for the latter Debian installation, using new DTBs on old kernels generally wouldn’t bring much trouble, unlike other way around.

Installing Debian aarch64

Do this in Arch Linux ARM aarch64, just similar to how we installed Debian x86_64 from Arch Linux x86_64. Just remember that still we would not want Grub the boot manager installed here, but rather Debian should only install grub2-common to auto-update its sub gurb.cfg

The final Grub menu shall look like this:


                         GNU GRUB  version 2:2.14rc1-2

 ┌────────────────────────────────────────────────────────────────────────────┐
 │*Arch Linux (x86_64)                                                        │
 │ Debian (x86_64)                                                            │
 │ Arch Linux ARM (aarch64)                                                   │
 │ Debian (aarch64)                                                           │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │
 │                                                                            │ 
 └────────────────────────────────────────────────────────────────────────────┘

      Use the ▲ and ▼ keys to select which entry is highlighted.          
      Press enter to boot the selected OS, `e' to edit the commands       
      before booting or `c' for a command-line.                           
                                                                             

After all installation

Now it’s good time to do modification, to e.g. decorate the outer menu, reorder the menu, add more entries, etc.

Also note that I did not set timeout, for that booting default x86_64 on aarch64 is plainly wrong, and this blog was written as simple to follow as posisble. This could be improved by either one of the following ways:

  • Use seperate –boot-directory e.g. /efi/x86_64 and /efi/aarch64 and maintain sepearate outer grub.cfg for each architecture
  • Use $grub_platform variable and $grub_cpu to determine the current platform. If you’d want to try, the syntax shall look like this:
      set timeout 1
      if [ "${grub_cpu}" = 'x86_64' -a "${grub_platform}" = 'efi' ]; then
          # x86_64 UEFI Secure Boot only
          menuentry ...
      elif [ "${grub_cpu}" = 'x86_64' -o "${grub_cpu}" = 'i386']; then
          # x86_64 UEFI plain, or legacy
          menuentry ...
      elif [ "${grub_cpu}" = 'arm64' ]; then
          if [ "${grub_platform}" = 'efi' ]; then
              # arm64 UEFI Secure Boot only
              menuentry ...
          fi
          # arm64 plain
          menuentry ...
      fi
    
]]>
Booster’s multi-device Btrfs race condition2025-06-30T03:50:00+00:002025-06-30T03:50:00+00:00https://7ji.github.io/booting/2025/06/30/booster-mutli-device-btrfs-race-conditionRepost of booster Pull Request #299 created by myself:

In a recent change to kernel pacakage the module for Btrfs became built-in instead of as-module: diff

This reveals a race condition (in both kernel and booster) which seemed to have been “worked around” in the past but in reality not, for multi-device Btrfs:

  • Booster did not check the return status of BTRFS_IOC_DEVICES_READY and it mounts every Btrfs sub-device whether they’re actually ready or not.
  • The kernel considers a Btrfs filesytem being “used” while it is being mounted from sub-device A and would refuse to mount it from sub-device B, even though it would fail to mount A after only a short time window.
  • The logic was:
    • Goroutine A:
      1. “booster sends IOCTL to register sub-device A”
      2. “booster always think sub-device A ready” (definitely not ready)
      3. “booster mounts sub-device A”
    • Go routine B:
      1. “booster sends IOCTL to register sub-device B”
      2. “booster always think sub-device B ready” (maybe ready)
      3. “booster mounts sub-device B”
  • When btrfs is a module, when “booster always think sub-device B ready”, it is really ready in most cases; when btrfs is built-in instead of a module, all of the above happen too fast and the failure of mounting from A was not finished yet, so the step “booster sends IOCTL to register sub-device B” ends up being rejected by btrfs kernel module with EBUSY

Example logs of failed boots, from which you could tell that booster tries to register vdc immediately after vdb while vdb’s failure was not finished yet and it’s rejected by kernel:

[    0.875882] BTRFS error: device /dev/vdc (254:32) belongs to fsid f1872b6c-c2b3-44fe-930e-bb85fd35d669, and the fs is already mounted, scanned by init (1)
[    0.876896] booster: ioctl(0x90009427): device or resource busy
ioctl(0x90009427): device or resource busy
[    0.877787] BTRFS error (device vdb): devid 2 uuid 1f95bd95-1c49-498f-abf9-a39143b970ce is missing
[    0.878568] BTRFS error (device vdb): failed to read the system array: -2
[    0.880541] BTRFS error (device vdb): open_ctree failed: -2
mount(/dev/vdb): no such file or directory
found a new device /dev/vda
blkinfo for /dev/vda: type=mbr UUID=dfd9de13 LABEL=
found a new device /dev/vda1
blkinfo for /dev/vda1: type=fat UUID=8b5649c8 LABEL=NO NAME
found a new device /dev/vdb
blkinfo for /dev/vdb: type=btrfs UUID=f1872b6c-c2b3-44fe-930e-bb85fd35d669 LABEL=
mounting /dev/vdb->/booster.root, fs=btrfs, flags=0x0, options=
found a new device /dev/vdc
blkinfo for /dev/vd[    0.833072] BTRFS error: device /dev/vdc (254:32) belongs to fsid f1872b6c-c2b3-44fe-930e-bb85fd35d669, and the fs is already mounted, scanned by init (136)
c: type=btrfs UU[    0.834106] booster: ioctl(0x90009427): device or resource busy
ID=f1872b6c-c2b3[    0.834346] BTRFS error (device vdb): devid 2 uuid 1f95bd95-1c49-498f-abf9-a39143b970ce is missing
-44fe-930e-bb85f[    0.835248] BTRFS error (device vdb): failed to read the system array: -2
d35d669 LABEL=
ioctl(0x90009427): device or res[    0.836022] BTRFS error (device vdb): open_ctree failed: -2
ource busy
mount(/dev/vdb): no such file or directory

To fix this, we need to check the return status of BTRFS_IOC_DEVICES_READY and only really considers the FS ready when it returns 0

Now the logic is:

  • Goroutine A:
    1. “booster sends IOCTL to register sub-device A”
    2. “booster knows sub-device A not ready and waits” -> back to step 1
    3. No bad mounting of sub-device A is performed
  • Go routine B:
    1. “booster sends IOCTL to register sub-device B”
    2. “booster knows sub-device B ready” (definitely ready)
    3. “booster mounts sub-device B”

Tested on a VM with three Vdisks, vda as boot, vdb + vdc as btrfs (profile single, raid0, raid1 all tested):

found a new device /dev/vda
blkinfo for /dev/vda: type=mbr UUID=dfd9de13 LABEL=
found a new device /dev/vdc
found a new device /dev/vda1
blkinfo for /dev/vdc: type=btrfs UUID=f1872b6c-c2b3-44fe-930e-bb85fd35d669 LABEL=
blkinfo for /dev/vda1: type=fat UUID=8b5649c8 LABEL=NO NAME
found a new device /dev/vdb
Waiting for multi-device btrfs at /dev/vdc to become fullly assembled, waited 0 seconds
blkinfo for /dev/vdb: type=btrfs UUID=f1872b6c-c2b3-44fe-930e-bb85fd35d669 LABEL=
mounting /dev/vdb->/booster.root, fs=btrfs, flags=0x0, options=
Switching to the new userspace now. Да пабачэння!

References:

  • btrfs kernel module returns 0 for ready and 1 for not ready to BTRFS_IOC_DEVICES_READY: source code
  • btrfs-progs checks the return value of BTRFS_IOC_DEVICES_READY with type int to determine whether the FS is ready: source code
]]>
Why OpenWrt DDNS does not start on boot2025-06-05T07:50:00+00:002025-06-05T07:50:00+00:00https://7ji.github.io/networking/2025/06/05/why-openwrt-ddns-does-not-start-on-bootEdit on 2025-06-08: My patch to fix this was merged into upstream OpenWrt LuCI master and backported to 24.10, the issue should be gone since 24.10.2.

On OpenWrt, DDNS functionality is provided by the opt-in ddns-scripts package (and optionally ddns-scripts-[provider] packages), which provides both an rc-init script /etc/init.d/ddns and a hotplug.d hook /etc/hotplug.d/iface/95-ddns to start it automatically:

  • The rc-init script, instead of starting a “daemon” and maintaining it like other scripts that uses procd, mainly just calls /usr/lib/ddns/dynamic_dns_updater.sh, which without explicit interface names after start/stop/reload just double forks the workers and quits. Note it has an empty boot() function which shadows start() on boot, i.e. /usr/lib/ddns/dynamic_dns_updater.sh -- start would not be run on boot.

      cat /etc/init.d/ddns
    
      #!/bin/sh /etc/rc.common
      START=95
      STOP=10
    
      boot() {
              return 0
      }
    
      reload() {
              /usr/lib/ddns/dynamic_dns_updater.sh -- reload
              return 0
      }
    
      restart() {
              /usr/lib/ddns/dynamic_dns_updater.sh -- stop
              sleep 1 # give time to shutdown
              /usr/lib/ddns/dynamic_dns_updater.sh -- start
      }
    
      start() {
              /usr/lib/ddns/dynamic_dns_updater.sh -- start
      }
    
      stop() {
              /usr/lib/ddns/dynamic_dns_updater.sh -- stop
              return 0
      }
    
    
  • The hotplug.d hook starts instances for “interfaces” when they’re brought up by netifd and triggers hotplug event (e.g. when you ifup manually, or reconnect an interface from LuCI, or they start up automatically on boot after netifd is up and running):

      cat /etc/hotplug.d/iface/95-ddns
    
      #!/bin/sh
    
      # there are other ACTIONs like ifupdate we don't need
      case "$ACTION" in
              ifup)                                   # OpenWrt is giving a network not phys. Interface
                      /etc/init.d/ddns enabled && /usr/lib/ddns/dynamic_dns_updater.sh -n "$INTERFACE" -- start
                      ;;
              ifdown)
                      /usr/lib/ddns/dynamic_dns_updater.sh -n "$INTERFACE" -- stop
                      ;;
      esac
    

Both the rc-init script and the hotplug.d maintain nothing: they just spawn workers for interfaces, either fork and run /usr/lib/ddns/dynamic_dns_updater.sh -n "$INTERFACE" -- start by itself, or from a convenient shortcut provided by /usr/lib/ddns/dynamic_dns_updater.sh -- start which iterates uci config ddns internally to do the work.

So on boot the intended logic that dynamic_dns_updater shall be spawned on interfaces is as follows:

  • The early init stage
  • The procd exec-ed by early init and becomes new PID 1
  • The ubusd becomes ready
  • The /etc/init.d/network starts, and spawns netifd in procd
  • The /etc/init.d/ddns starts, and due to empty boot() it does nothing
  • The wan interface becomes ready in netifd
  • The /etc/hotplug.d/iface/95-ddns hook triggers on interface(s) that you have configured ddns on, and the corresponding worker(s) would be spawned.

Note that the hotplug.d hook uses the internal name used by netifd. That is, an “physical” “interface” might e.g. be called as br-lan in the scope of Linux, but would be called lan in the scope of netifd, uci, LuCI, etc and of course hotplug.d.

Now let’s discuss about an “issue”: many with a PPPoE wan might find a strange phenomenon: even though they have “enabled” the ddns service and configured it on pppoe-wan “interface”, the ddns worker would not correctly start on boot on their PPPoE wan interface. The reason this issue happens is due to the combination of following factors:

  • In OpenWrt, software-based “interface”s are named in the style of [protocol]-[network], e.g. for PPPoE-based “wan” interface/network, the actual Linux interface name that’s created would be pppoe-wan
    config interface 'wan'
            option device 'eth5'
            option proto 'pppoe'
            option username 'xxxxxxxx'
            option password 'yyyyyy'
            option keepalive '10 60'
            option ipv6 'auto
    
    > ip l | grep wan
    19: pppoe-wan: <POINTOPOINT,MULTICAST,NOARP,UP,LOWER_UP> mtu 1492 qdisc fq_codel state UNKNOWN mode DEFAULT group default qlen 3
    
  • In luci-app-ddns, the “interface” attribute is derived from the current source network/interface, e.g. when you configure “network” “wan”, this would be wan; when you configure “interface” “pppoe-wan”, this would be pppoe-wan:
    o = s.taboption('advanced', form.DummyValue, '_interface',
                          _("Event Network"),
                          _("Network on which the ddns-updater scripts will be started"));
    o.depends("ip_source", "interface");
    o.depends("ip_source", "network");
    o.forcewrite = true;
    o.modalonly = true;
    o.cfgvalue = function(section_id) {
      return uci.get('ddns', section_id, 'interface') || _('This will be autoset to the selected interface');
    };
    o.write = function(section_id) {
      var opt = this.section.formvalue(section_id, 'ip_source');
      var val = this.section.formvalue(section_id, 'ip_'+opt);
      return uci.set('ddns', section_id, 'interface', val);
    };
    
  • In dynamic_dns_updater.sh i.e. the actual updater worker, the uci attribute interface needs to be the OpenWrt/netifd internal name that’s put on the “network” / Openwrt “interface”, not the Linux “interface”:
    # interface 	network interface used by hotplug.d i.e. 'wan' or 'wan6'
    
  • In dynamic_dns_functions.sh, the start_daemon_for_all_ddns_sections takes the “network” name as argument and tries to get one ddns section with interface equalling it (note wan is the fallback name):
    # starts updater script for all given sections or only for the one given
    # $1 = interface (Optional: when given only scripts are started
    # configured for that interface)
    # used by /etc/hotplug.d/iface/95-ddns on IFUP
    # and by /etc/init.d/ddns start
    start_daemon_for_all_ddns_sections()
    {
        local event_if sections section_id configured_if
        event_if="$1"
    
        load_all_service_sections sections
        for section_id in $sections; do
          config_get configured_if "$section_id" interface "wan"
          [ -z "$event_if" ] || [ "$configured_if" = "$event_if" ] || continue
          /usr/lib/ddns/dynamic_dns_updater.sh -v "$VERBOSE" -S "$section_id" -- start &
        done
    }
    
  • When retrieving network information from netifd, the “interface” must be the Openwrt “interface” / network, not the Linux “interface”. There’s no internal fallback logic to get the info from a Linux “interface”.
    > ubus call network.interface status '{"interface":"wan"}' | jsonfilter -e '@["ipv4-address"][0].address'
    xxx.xxx.xxx.xxx
    > ubus call network.interface status '{"interface":"pppoe-wan"}' | jsonfilter -e '@["ipv4-address"][0].address'
    Command failed: Not found
    Failed to parse json data: unexpected end of data
    
  • Likewise, the hotplug event only triggers on wan, not on pppoe-wan
  • So, the hotplug event actually triggers and it runs /usr/lib/ddns/dynamic_dns_updater.sh -n wan -- start to start the worker for interface wan, but as it could not find any config section in /etc/config/ddns with interface=wan (which in reality is interface=pppoe-wan), it just quits and nevers spawns the actual /usr/lib/ddns/dynamic_dns_updater.sh -S SECTION -- start worker.

Note that while hotplug.d logic fails, you can still run /etc/init.d/ddns start to effectively run /usr/lib/ddns/dynamic_dns_updater.sh -- start, which just iterates the whole /etc/config/ddns config and would start all workers for all sections (as -n NETWORK is skipped and -S SECTION is run directly).

This of course does not only affect PPPoE wan, but in general affects any interface that’s named differently from the corresponding network name.

There are two correct way to fix the issue, one is simply LuCI-only, and another one needs some uci (or manual config editting) but does not touch logic codes:

  • The simple way is, without touching any of the above code, to configure your DDNS instance with source as “network” “wan”, instead of “interface” “pppoe-wan”, so you have interface=wan in your /etc/config/ddns and this way the hotplug event would correctly starts on “network” “wan”
  • Another way is, to modify the interface value (you can also edit /etc/config/ddns manually)
    uci set ddns.cfv4.interface=wan
    uci commit ddns
    

Now with this knowledge you shall know that why the following “band-aid” “hacks” seem to “fix” the “problem” but they are very unreliable.

  • By removing boot() function in /etc/init.d/ddns, you can force /usr/lib/ddns/dynamic_dns_updater.sh -- start to run on boot, which would spawn workers for each configured section. The workers are there, but if another ifdown & ifup occurs then they could break. As the intended way with hotplug.d is that the worker shall be brought up after ifup and brough down before ifdown.
  • By putting /etc/init.d/ddns restart in your /etc/rc.local, you’re basically doing the same thing as removing boot()
  • By putting both /etc/init.d/ddns restart in your /etc/rc.local, and a sleep before it, you have the addtional hope that pppoe-wan definitely becomes online after that timeout, however it’s not guaranteed.
  • By removing boot() function in /etc/init.d/ddns, putting sleep and /etc/init.d/ddns restart in your /etc/rc.local. You’re combining “band-aid”s which makes your device more and more non-reproducible.
  • Things can still fail after the above “band-aids” if your pppoe-wan connection is not there and you have configured retry_max_count for DDNS sections. If you use hotplug.d then the worker is guaranteed to be started on pppoe-wan creation and stopped on pppoe-wan destruction.
]]>
Gotchas when booting from virtiofs root2025-05-23T10:00:00+00:002025-05-23T10:00:00+00:00https://7ji.github.io/booting/2025/05/23/gotchas-when-booting-from-virtiofs-rootvirtiofs is a nice host-passthrough “fs” with which a virtual machine can use the host fs almost directly, saving the overhead of “guest fs -> guest disk -> host file -> host fs” hassle, a virtiofs root is especially useful when access to files in guest root fs from host directly is needed, e.g. when doing frequent debugging, or vice versa.

It might seem enough to just set the kernel cmdline to:

root=root rootfstype=virtiofs

However to make a virtual machine actually boot from such root you need to go through a few gotchas:

Common

  • In libvirt, you need to allocate a dir type storage pool and then mkdir a root-owned subfolder inside the pool, the pool could be reused while the subfolder needs to be VM-specific, the corresponding storage config file looks like this:
      > cat /etc/libvirt/storage/filesystems.xml
    
      <pool type='dir'>
      <name>filesystems</name>
      <uuid>b36d710f-7a61-49f7-935b-061b12f24c8a</uuid>
      <capacity unit='bytes'>0</capacity>
      <allocation unit='bytes'>0</allocation>
      <available unit='bytes'>0</available>
      <source>
      </source>
      <target>
          <path>/var/lib/libvirt/filesystems</path>
      </target>
      </pool>
    

    The subfolder:

      > ls -ldh /var/lib/libvirt/filesystems/root.debian12
      drwxr-xr-x 18 root root 4.0K May 23 17:55 /var/lib/libvirt/filesystems/root.debian12/
    
  • The guest needs to have a “filesystem” “device” that points to the subfolder, i.e. the guest config shall have the following snippet:
      > sed -n '/<filesystem/,/<\/filesystem/p' /etc/libvirt/qemu/debian12.xml
    
      <filesystem type='mount' accessmode='passthrough'>
        <driver type='virtiofs'/>
        <source dir='/var/lib/libvirt/filesystems/root.debian12'/>
        <target dir='root'/>
        <address type='pci' domain='0x0000' bus='0x08' slot='0x00' function='0x0'/>
      </filesystem>
    
  • The guest fstab needs only the following line then:
      root / virtiofs defaults 0 0
    
  • If you need to use overlayfs with upperdir pointing to path inside the virtual root, the virtiofsd needs addtional arguments to allow xattr and keep cap_sysadmin:
      <filesystem type='mount' accessmode='passthrough'>
        <driver type='virtiofs' queue='1024'/>
        <binary path='/usr/lib/virtiofsd+sys_admin' xattr='on'/>
        <source dir='/var/lib/libvirt/filesystems/root.arb-x64-builder'/>
        <target dir='root'/>
        <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
      </filesystem>
    
      > cat /usr/lib/virtiofsd+sys_admin
    
      #!/bin/sh
      exec /usr/lib/virtiofsd -o modcaps=+sys_admin "$@"
    
  • If you want to share something between VMs it’s not recommended to re-use the root subfoldder (which is as bad as mounting a virtual disk that is being used), just create another shared subfolder.

Debian guest

  • You must either install a system to a virtual disk first then extract the files, or debootstrap manually
  • If the system rootfs folder was extracted from a locally installed disk image, then /etc/initramfs-tools/conf.d/resume must be purged before a rerun of update-initramfs -u, otherwise the initrd would try to resume from the missing block device and break
  • You must include virtiofs in /etc/initramfs-tools/modules, otherwise the virtiofs module would not be included (unfortunately initramfs-tools’ hook only picks fs for a root fs mounted from a block device or nfs). Addtionally you could set FSTYPE=virtiofs in /etc/initramfs-tools/initramfs.conf to omit fsck hook (not recommended, as this 1. does not include virtiofs while it looks it should and 2. would render the system unusable if you forget to revert it in the future when you want to disgard the virtiofs root).
  • A run of update-initramfs -u is certainly needed after the above gotchas sorted out.
  • The guest kernel needs to be booted directly by hypervisor, without a bootloader, and it’s recommended to use the vmlinuz and initrd.img symlinks instead of the real files:

      > sed -n '/<os/,/<\/os/p' /etc/libvirt/qemu/debian12.xml
    
      <os>
          <type arch='x86_64' machine='pc-q35-10.0'>hvm</type>
          <kernel>/var/lib/libvirt/filesystems/root.debian12/vmlinuz</kernel>
          <initrd>/var/lib/libvirt/filesystems/root.debian12/initrd.img</initrd>
          <cmdline>root=root rootfstype=virtiofs</cmdline>
          <boot dev='hd'/>
          <bootmenu enable='no'/>
      </os>
    
  • If you’ve prepared the root from a virtual disk bootable image, grub stuffs needs to be purged later:
    apt purge --autoremove grub2 grub-common
    rm -rf /boot/grub
    

Arch Linux guest

  • Only mkinitcpio and dracut initrd makers support booting from virtiofs natively, booster needs my patchset, but dracut is not recommended as its initramfs paths are undetermined containing kernel version (see below).
  • The target rootfs can be prepared by either pacstrapping in host using arch-install-scripts or pacstrapping in target using archiso, it’s not recommended to extract the rootfs from a installed system that runs on either virtual disk or physical machine
  • The guest kernel needs to be booted directly by hypervisor, without a bootloader, the kernel path is determined and the initrd path also unless you’re using dracut (so better use booster or mkinitcpio)

      > sed -n '/<os/,/<\/os/p' /etc/libvirt/qemu/archlinux.xml
    
      <os>
          <type arch='x86_64' machine='pc-q35-10.0'>hvm</type>
          <kernel>/var/lib/libvirt/filesystems/root.archlinux/boot/vmlinuz-linux</kernel>
          <initrd>/var/lib/libvirt/filesystems/root.archlinux/boot/booster-linux.img</initrd>
          <cmdline>root=root rootfstype=virtiofs</cmdline>
          <boot dev='hd'/>
          <bootmenu enable='no'/>
      </os>
    
]]>
RTL8373-based cheap 2.5 Gbps switches are troublemakers2025-04-16T02:00:00+00:002025-04-16T02:00:00+00:00https://7ji.github.io/networking/2025/04/16/rtl8373-based-cheap-switches-are-troublemakersRTL8373 is great: it’s the cheapest L2 network switching IC that provides 8x 2.5 Gbps ethernet ports (4 with its built-in PHY, 4 with external PHYs from RTL8224), 1 10 Gbps SFP port, and on top of them fancy L2 management features like VLAN, Link aggregation, etc.

From around March 2024 I’ve bought 5 of these switches, three of them are from Sirivision, and the remaining two from Hellotek. The price kept dropping as I bought more of them, going from around 290 CNY in March 2024 to around 210 CNY in March 2025. It’s almost a steal!

These are all located at different places: two of which were installed at my new house at my hometown to provide in-home 2.5 Gbps networking, two of which were installed at my rent house at the city I work to also provide in-home 2.5 Gbps networking in addition to L3 10 Gbps core switching, and the last one installed in my office for 2.5 Gbps switching among my wired devices.

As they came from some almost unknown branch I didn’t expect much from their stock firmware, but the Sirivision switches tend out to be very feature-full: features go from VLAN tagging, LAG, IGMP snoofing to DHCP spoofing protection, almost like they’ve cut nothing from what Realtek left in their reference BSP and exposed everything, their web UI looks very simple and I appreciate that; on the other hand the Hellotek ones provide only limited features, with only base VLAN tagging, and LAG with limitation (one group must be on 4 native ports and the other group must be on 4 “external” ports), they got no IGMP snoofing, no DHCP spoofing protection, while they have nice web UI it just adds the shame that they choose to limit the features exposed.

They both seem nice and sound, considering their price, providing cheap high speed switching with supposedly little to no hassle. However, in reality they’re really troublemakers, and I’ll summarize the problems below:

  1. Configuration is applied only after booting (only affecting Sirivision)

    The VLAN, LAG, etc settings are not applied directly after booting. There’s a short time window (1-2 second) during which the switch acts as a dumb switch.

    If you use the switch just as a simple dumb switch without VLAN, LAG, etc then you are safe. However if you have untagged VLAN your untagged traffic would go rogue and escape the VLAN filtering; if you have LAG the LAG ports would forward traffic to each other resulting in loops:

     Apr 05 23:57:18 wn1 kernel: igc 0000:02:00.0 enp2s0: NIC Link is Down
     Apr 05 23:57:18 wn1 kernel: igc 0000:03:00.0 enp3s0: NIC Link is Down
     Apr 05 23:57:19 wn1 kernel: bond0: (slave enp2s0): link status definitely down, disabling slave
     Apr 05 23:57:19 wn1 kernel: bond0: (slave enp3s0): link status definitely down, disabling slave
     Apr 05 23:57:19 wn1 kernel: bond0: now running without any active interface!
     Apr 05 23:57:19 wn1 kernel: bridge0: port 1(bond0) entered disabled state
     Apr 05 23:57:40 wn1 kernel: igc 0000:03:00.0 enp3s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
     Apr 05 23:57:40 wn1 kernel: igc 0000:02:00.0 enp2s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
     Apr 05 23:57:40 wn1 kernel: bond0: (slave enp2s0): link status definitely up, 2500 Mbps full duplex
     Apr 05 23:57:40 wn1 kernel: bond0: (slave enp3s0): link status definitely up, 2500 Mbps full duplex
     Apr 05 23:57:40 wn1 kernel: bond0: active interface up!
     Apr 05 23:57:40 wn1 kernel: bridge0: port 1(bond0) entered blocking state
     Apr 05 23:57:40 wn1 kernel: bridge0: port 1(bond0) entered forwarding state
     Apr 05 23:57:40 wn1 kernel: bridge0: received packet on bond0 with own address as source address (addr:f6:fe:da:c5:1d:90, vlan:0)
     Apr 05 23:57:40 wn1 kernel: bridge0: received packet on bond0 with own address as source address (addr:f6:fe:da:c5:1d:90, vlan:0)
     Apr 05 23:57:40 wn1 kernel: bridge0: received packet on bond0 with own address as source address (addr:f6:fe:da:c5:1d:90, vlan:0)
     Apr 05 23:57:41 wn1 kernel: igc 0000:02:00.0 enp2s0: NIC Link is Down
     Apr 05 23:57:41 wn1 kernel: igc 0000:03:00.0 enp3s0: NIC Link is Down
     Apr 05 23:57:41 wn1 kernel: bond0: (slave enp2s0): link status definitely down, disabling slave
     Apr 05 23:57:41 wn1 kernel: bond0: (slave enp3s0): link status definitely down, disabling slave
     Apr 05 23:57:41 wn1 kernel: bond0: now running without any active interface!
     Apr 05 23:57:41 wn1 kernel: bridge0: port 1(bond0) entered disabled state
     Apr 05 23:57:44 wn1 kernel: igc 0000:02:00.0 enp2s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
     Apr 05 23:57:44 wn1 kernel: igc 0000:03:00.0 enp3s0: NIC Link is Up 2500 Mbps Full Duplex, Flow Control: RX/TX
     Apr 05 23:57:45 wn1 kernel: bond0: (slave enp2s0): link status definitely up, 2500 Mbps full duplex
     Apr 05 23:57:45 wn1 kernel: bond0: (slave enp3s0): link status definitely up, 2500 Mbps full duplex
     Apr 05 23:57:45 wn1 kernel: bond0: active interface up!
     Apr 05 23:57:45 wn1 kernel: bridge0: port 1(bond0) entered blocking state
     Apr 05 23:57:45 wn1 kernel: bridge0: port 1(bond0) entered forwarding state
    

    These are bad things and you definitely would not like them in your network.

    Luckily things are not too bad as you would not reboot the switch very frequently. And the issue seems only affecting Sirivision and not Hellotek.

  2. VLAN 1 cannot be safely deleted

    On Hellotek VLAN 1 cannot be deleted at all: the button is grey; on Sirivion you seem to be able to delete it, but deleting it results in a freezing interface with glitched characters as an indication of memory leak.

    Both of these seem to be using VLAN 1 for their internal logic, and while it seems still available for modification you’d better left them in place, just removing ports from it if you want some modification.

  3. Management VLAN cannot be set

    The Sirivision ones do not provide the option to modify the management VLAN, while the Helloteks ones do provide the option but it never take effect. The management VLAN can only be 1, but it’s not 1 actually, there’s just no “management VLAN” in fact, which comes to the next issue.

    And your know what? It seems Hellotek “has” management VLAN but that only works for external traffic! I.e. when through a stacked switch, you now can only access another switch from untagged traffic, how secure and convenient!

  4. Management packets are captured from all VLANs

    The switch captures any traffic targetting its IP instead of only from a specific management VLAN. To make things worse, while the Sirivision one captures these traffic from even external traffic forwarded from other switches (i.e. a stacked config), the Hellotek ones only capture such traffic from ports directly connected to it physically.

    This is a bad thing. E.g. let’s assume you left the switch to keep its default 192.168.1.199 mangement IP, and you have a VLAN 123 running the 192.168.1.0/24 network, your access to 192.168.1.199 in VLAN 123 from a port connected to the switch leads you always to the switch, not the actual target.

  5. VLAN configuration cannot be saved when under load (only affecting Sirivision)

    If there’s already traffic in a VLAN that’s not fully configured, you can never get it fully configured. The web UI just freezes and config would never be saved. You can only configure things in one go.

    This makes it very inconvenient if you want your main VLAN not being VLAN 1, and to make things worse recall that “VLAN 1 cannot be safely deleted” and “Both of these seem to be using VLAN 1 for their internal logic”, so in general you’d better not be using VLAN 1 and have to do this. To work around this and make your VLANs seperate from the default VLAN 1 hassle you have to configure ports in batch but not including your currently used port, and then swap ports around and set the remaining port. And sometimes you cannot set the remaining ports and have to delete the VLAN and re-create it.

  6. VLAN and port naming has no memory boundary check

    While it appears the VLAN IDs and ports can be named for lookup, setting them sometimes result in success but most time result in glitched characters. And when they’re glitched some other settings might get flushed. This is most likely a memory boundary checking issue and when it happens your only reliable fix is to revert everything, most likely to factory reset.

  7. Under specific workload the switch would power cycle (only affecting Hellotek)

    I’m using a single port on one of the Hellotek switch as an upstream trunk port to connect to the ISP modem, carrying both PPPoE Internet upstream in a VLAN and IPoE IPTV upstream in another VLAN, the IPTV VLAN is then carried in trunk through another Hellotek switch, then in trunk through a K2P running Openwrt acting as both switch and AP. Strangely the main switch i.e. the one connected directly to the modem would reset itself when IPTV stream start to run. E.g. if I don’t watch TV then everything seems fine, if I start watching TV and the IPTV set-top box initiated its IPoE stream then after a couple of minutes the main switch would reset, cutting both my Internet and IPTV connections. If they keep running for a few minutes after the switch comes back online then it would not reset any more, very very strange.

    To get away from this issue I had to replace the main switch to a 10 Gbps L3 switch and add a 2.5 Gbps base-T SFP+ module.

So while these switches seem cheap, be prepared for various issues if you decide to use them for fancy network setups.

]]>
Btrfs backup: snapper + snasync2025-01-09T03:15:00+00:002025-01-09T03:15:00+00:00https://7ji.github.io/designdoc/2025/01/09/btrfs-backup-snapper-with-snasyncUsually I don’t like to document about my software/script projects on the blog as I prefer to docuemnt them in tree. But as I’ve been a Btrfs user for multiple years and haven’t really documented the details I’d like to share, I think it’s a good time to share them together here.

Background: Btrfs

I’ll assume you’ve already known what Btrfs is: a copy-on-write next-gen Linux filesystem that has entered the kernel for a long time and has many shiny features, of which I appreciate snapshoting and transparent compression the most.

Thanks to Btrfs’s nature of copy-on-write, to its helpful userspace toolset btrfs-progs, and to the fact that Btrfs subvolumes live in the filesystem namespace and the user perspective as if they’re just folders, it is travail to snapshot a Btrfs filesystem or part of it.

With the added benefit that a Btrfs subvolume can be mounted directly with specified subvol= or subvolid= mount argument and their content cannot be snapshotted as a part of the parent folder / subvlume they live in, you can combine your filesystem tree freely to decide which part to take snapshots on.

However, although it is travail to take a snapshot, it is hard to do it cleanly reguarlarly with handwritten script and crontab jobs / systemd.timer units that are easy to forget about and prune to fail. After all, if you’ve deleted a file by accident you’ve just created a few hours ago yet your last snapshot are taken a week ago, you’re doomed just like you don’t have snapshots / backup.

And of course you would want snapshots to be cleaned reguarly as few would really need a snapshot taken 3 months and 1 hour ago more than a snapshot taken exactly 3 months ago.

Background: snapper

I’ll also assume you’ve already known what snapper is: a tool written by OpenSUSE to manage filesystem snapshots and allow undo of system modifications.

To describe it simply: with snapper installed, and its bundled timer units enabled, you can define a few configs that each define what subvolume to take snapshots regurlarly on, and snapper would create and clean up snapshots, and manage them in a centralized way under the corresponding .snapshots submount.

Let me take a few lines from my fstab to demonstrate how it works (only the mountpoints of the subvolume to take snapshots and their corresponding snapshots storage subvolume are listed, others, like /var/cache are left out):

# <file system> <dir> <type> <options> <dump> <pass>
## backpane m.2 2280 (2T 4.0 x4 downgraded to 3.0 x4)
ID=nvme-HYV2TBX4_GR__24113WJHA0000072-part3 /                      btrfs rw,compress=zstd:3,subvol=@                       0 0
ID=nvme-HYV2TBX4_GR__24113WJHA0000072-part3 /.snapshots            btrfs rw,compress=zstd:3,subvol=@.snapshots             0 0
ID=nvme-HYV2TBX4_GR__24113WJHA0000072-part3 /home                  btrfs rw,compress=zstd:3,subvol=@home                   0 0
ID=nvme-HYV2TBX4_GR__24113WJHA0000072-part3 /home/.snapshots       btrfs rw,compress=zstd:3,subvol=@home_.snapshots        0 0
## expansion lower m.2 22110 (1T 3.0 x4)
ID=nvme-MZ1LB960HBJR-000FB_S5XBNA0R330322   /srv                   btrfs rw,compress=zstd:15,subvol=@srv                   0 0
ID=nvme-MZ1LB960HBJR-000FB_S5XBNA0R330322   /srv/.snapshots        btrfs rw,compress=zstd:15,subvol=@srv_.snapshots        0 0
## frontpane m.2 2280 (2T 3.0 x4)
ID=nvme-HYV2TBX3_HXY__00000000000000001116  /srv/backup            btrfs rw,compress=zstd:15,subvol=@srv_backup            0 0
ID=nvme-HYV2TBX3_HXY__00000000000000001116  /srv/backup/.snapshots btrfs rw,compress=zstd:15,subvol=@srv_backup_.snapshots 0 0

Recall that a Btrfs subvolume’s content cannot be snapshotted as a part of the parent folder / subvlume they live in, the reason we have /, /home, /srv, /srv/backup as seperate mountpoints is to have them each snapshotted individually into their .snapshot subfolder / child subvolume, and */.snapshots are also their seperate subvolumes so their content won’t be snapshotted as part of their parent.

As Btrfs subvolumes can totoally just exist in the FS tree accessible to users as if they’re plain folders that just cannot be snapshooted as part of their parent subvolume, it is also possible to omit all these mountpoints and to have a single root subvolume mountpoint, but in that case it would be hard to follow which folder is in fact a subvolume unless you use btrfs-progs and corresponding commands. As always I prefer explicity over implicity, so I always set up Btrfs subvolumes directly in the FS root and mount them each individually into the actual root. You can freely decide which way to follow.

I have the following in my /etc/conf.d/snapper:

SNAPPER_CONFIGS="root home srv srv-backup"

Are I have the following snapper configs under /etc/snapper/configs/:

> ls -lh /etc/snapper/configs/
total 20K
-rw------- 1 root root 1.2K Jul  3  2024 home
-rw------- 1 root root 1.2K Jul  3  2024 root
-rw------- 1 root root 1.2K Jul 29 16:41 srv
-rw------- 1 root root 1.2K Jan  3 14:22 srv-backup

An example config, home, is defined as follows:

# subvolume to snapshot
SUBVOLUME="/home"

# filesystem type
FSTYPE="btrfs"


# btrfs qgroup for space aware cleanup algorithms
QGROUP=""


# fraction or absolute size of the filesystems space the snapshots may use
SPACE_LIMIT="0.5"

# fraction or absolute size of the filesystems space that should be free
FREE_LIMIT="0.2"


# users and groups allowed to work with config
ALLOW_USERS=""
ALLOW_GROUPS=""

# sync users and groups from ALLOW_USERS and ALLOW_GROUPS to .snapshots
# directory
SYNC_ACL="no"


# start comparing pre- and post-snapshot in background after creating
# post-snapshot
BACKGROUND_COMPARISON="yes"


# run daily number cleanup
NUMBER_CLEANUP="yes"

# limit for number cleanup
NUMBER_MIN_AGE="1800"
NUMBER_LIMIT="50"
NUMBER_LIMIT_IMPORTANT="10"


# create hourly snapshots
TIMELINE_CREATE="yes"

# cleanup hourly snapshots after some time
TIMELINE_CLEANUP="yes"

# limits for timeline cleanup
TIMELINE_MIN_AGE="1800"
TIMELINE_LIMIT_HOURLY="12"
TIMELINE_LIMIT_DAILY="3"
TIMELINE_LIMIT_WEEKLY="2"
TIMELINE_LIMIT_MONTHLY="6"
TIMELINE_LIMIT_YEARLY="1"


# cleanup empty pre-post-pairs
EMPTY_PRE_POST_CLEANUP="yes"

# limits for empty pre-post-pair cleanup
EMPTY_PRE_POST_MIN_AGE="1800"

With the above config, and the bundled snapper-timeline.timer and snapper-cleanup.timer systemd units enabled, snapper handles the /home subvolume as follows:

  • Every hour, run snapper-timeline.service to take a snapshot of /home and store it under /home/.snapshots/
    • A new snapshot ID would be generated incrementally, larger than all existing snapshot IDs, e.g. if there are 1, 2, 135, 555, 705, then the new ID would be 706
    • A new folder e.g. /home/.snapshots/706 would be created as the snapper snapshot container (It’s only my own calling, I don’t know how snapper calls them)
    • A Btrfs snapshot of subvolume /home would be created under the container folder, e.g. /home/.snapshots/237/snapshot
    • The corresponding metadata would be stored under the container folder, e.g. /home/.snapshots/237/info.xml, with content like the following:

      <?xml version="1.0"?>
      <snapshot>
        <type>single</type>
        <num>706</num>
        <date>2024-12-22 16:00:05</date>
        <description>timeline</description>
        <cleanup>timeline</cleanup>
        </snapshot>
      
  • Every hour, run snapper-cleanup.service to clean up snapshots under /home/.snapshots/
    • As TIMELINE_LIMITE_HOURLY="12", only keep 12 hourly snapshots taken most recently (current, current - 1, … current - 11; but skip the one taken as 00:00 as it would be considered daily)
    • Likely, only keep 3 daily snapshots taken most recently (skip weekly)
    • Likely, only keep 2 weekly snapshots taken most recently (skip monthly)
    • Remove all snapshots that shall not be kept
  • If explicitly required, snapshots can be manually created by snapper create (-c [config]), e.g. snapper create -c home, and manually removed by snapper delete (-c [config]) [snapshot ID], e.g. snapper delete -c home 1

With the above setup, the folder /home/.snapshots would look like the following after running snapper for a long time:

> ls -lh /home/.snapshots/
total 100K
drwxr-xr-x 1 root root 32 Nov 22 18:00 1/
drwxr-xr-x 1 root root 32 Jan  5 00:00 1003/
drwxr-xr-x 1 root root 32 Jan  6 00:00 1027/
drwxr-xr-x 1 root root 32 Jan  7 00:00 1051/
drwxr-xr-x 1 root root 32 Jan  7 12:00 1063/
drwxr-xr-x 1 root root 32 Jan  7 13:00 1064/
drwxr-xr-x 1 root root 32 Jan  7 14:00 1065/
drwxr-xr-x 1 root root 32 Jan  7 15:00 1066/
drwxr-xr-x 1 root root 32 Jan  7 16:00 1067/
drwxr-xr-x 1 root root 32 Jan  7 17:00 1068/
drwxr-xr-x 1 root root 32 Jan  7 18:00 1069/
drwxr-xr-x 1 root root 32 Jan  7 19:00 1070/
drwxr-xr-x 1 root root 32 Jan  7 20:00 1071/
drwxr-xr-x 1 root root 32 Jan  7 21:00 1072/
drwxr-xr-x 1 root root 32 Jan  7 22:00 1073/
drwxr-xr-x 1 root root 32 Jan  7 23:00 1074/
drwxr-xr-x 1 root root 32 Jan  8 00:00 1075/
drwxr-xr-x 1 root root 32 Jan  8 01:00 1076/
drwxr-xr-x 1 root root 32 Jan  8 02:00 1077/
drwxr-xr-x 1 root root 32 Jan  8 03:00 1078/
drwxr-xr-x 1 root root 32 Jan  8 04:00 1079/
drwxr-xr-x 1 root root 32 Jan  8 05:00 1080/
drwxr-xr-x 1 root root 32 Jan  8 06:00 1081/
drwxr-xr-x 1 root root 32 Jan  8 07:00 1082/
drwxr-xr-x 1 root root 32 Jan  8 08:00 1083/
drwxr-xr-x 1 root root 32 Jan  8 09:00 1084/
drwxr-xr-x 1 root root 32 Jan  8 10:00 1085/
drwxr-xr-x 1 root root 32 Jan  8 11:00 1086/
drwxr-xr-x 1 root root 32 Dec  1 00:00 199/
drwxr-xr-x 1 root root 32 Dec 16 17:32 555/
drwxr-xr-x 1 root root 32 Dec 23 00:00 706/
drwxr-xr-x 1 root root 32 Dec 30 00:00 874/
drwxr-xr-x 1 root root 32 Jan  1 00:00 922/
drwxr-xr-x 1 root root 32 Jan  3 10:00 965/
drwxr-xr-x 1 root root 32 Jan  4 00:00 979/

An example snapper snapshot container folder /home/.snapshots/706/ would look like the following:

> ls -lh /home/.snapshots/706
total 4.0K
-rw-r--r-- 1 root root 187 Dec 23 00:00 info.xml
drwxr-xr-x 1 root root  36 Aug 26 15:56 snapshot/

We can confirm /home/.snapshots/706/snapshot is indeed a Btrfs snapshot / read-only subvolume

> btrfs subvolume show /home/.snapshots/706/snapshot/
@home_.snapshots/706/snapshot
        Name:                   snapshot
        UUID:                   b54462aa-3427-c745-8fcd-ac248143a039
        Parent UUID:            0ce1faab-e8db-e94f-a3c8-be85743e5859
        Received UUID:          -
        Creation time:          2024-12-23 00:00:05 +0800
        Subvolume ID:           2265
        Generation:             167279
        Gen at creation:        167278
        Parent ID:              259
        Top level ID:           259
        Flags:                  readonly
        Send transid:           0
        Send time:              2024-12-23 00:00:05 +0800
        Receive transid:        0
        Receive time:           -
        Snapshot(s):
        Quota group:            n/a

and its content is really what was in /home as that point:

> ls -alh /home/.snapshots/706/snapshot
total 0
drwxr-xr-x 1 root     root       36 Aug 26 15:56 ./
drwxr-xr-x 1 root     root       32 Dec 23 00:00 ../
drwxr-xr-x 1 root     root        0 Apr  8  2024 .snapshots/
drwx------ 1 nomad7ji nomad7ji 1.3K Dec 22 22:30 nomad7ji/

and confirming what we said above, .snapshots as a mountpoint would not be recursively snapshotted:

> ls -alh /home/.snapshots/706/snapshot/.snapshots
total 0
drwxr-xr-x 1 root root  0 Apr  8  2024 ./
drwxr-xr-x 1 root root 36 Aug 26 15:56 ../

A minor thing to note there, is that snapper does not have its “database” that live outside of the subvolume and the container collection, it just calculates what’re already been created to decide new things such as new IDs. This is a thing I appreciate, as everything lives inside the snapshot container itself and you do not need extra data or even snapper itself to recover them. But this also has a “side-effect”: the ID is not strictly incremental, if you remove e.g. 565, 567, 900, 901, then the new ID could be 565, and if you remove all snapshots then they restart from ID 1 again.

Data robustness

Btrfs itself provides you with online data robustness: as long as it runs with CoW feature not turned off then it scrubs and checks data corruption like bitrots reguarlly, and it can report the corruption right away. It’s also not easy to be bricked unless you play with RAID56 which is discouraged by Btrfs developers. However, unless you set it up in a Btrfs native RAID1 or alike setup, the corruption cannot be fixed unless you have a backup, as the unique data has only one copy.

snapper enhances the online data robustness by providing the possibility to roll back: it avoids the data loss caused by either accidental, harmful or then-intentional-now-regretful operations. It has saved my ass for multiple times when I accidentally delete my work files, and thankfully I could get the last hourly snapshot to at least restore some of my work. However, as it only snapshots a subvolume on a single filesystem into subvolumes on the same exact filesytem, it is only a hot backup solution and won’t do magic if the FS itself breaks.

Both Btrfs itself and snapper lives on the hot, currently active storage, and if the underlying drives dies, they can’t have the magic to repair themselves and the data that die along with them.

For complete data robustness you need layered backup, and a solution to do warm/cold backup, which should in most cases be offline. One famous backup strategy that I follow is 3-2-1: to maintain 3 copies of data, to use 2 different types of media, and to keep at least 1 copy off-site.

With the plain Btrfs + snapper setup we already have 2 copies of data: 1 real-time and more than 1 copies in snapshots, I consider the “more than 1 copies in snapshots” part as 1, as the marginal effect decreases fast for the snapshots. What we miss is the last 1 off-site copy of data that’s stored on a different type of media.

snasync

When someone has already defined a good design, it’s better to follow it and improve it, rather than throw it away and start all over. As we already have snapper that does the online snapshots creation and cleaning, the best way to have an off-site backup is simply to backup the snapshots to another Btrfs storage so we can have the same robustness as snapper snapshots.

Introducing snasync, a snapper Btrfs snapshots syncer, to backup your Btrfs snapshots created by snapper to warm or cold, and in most cases remote storage.

snasync does one thing and one thing naively: to sync the snapper containers (my own calling, those folders that live under */.snapshots and containing info.xml and snapshot) one way to “targets”: either local, or remote.

As the snapper container ID is only useful to online snapper operations, not strictly incremental, and not meaningful on the timeline perspective, snasync sync local snapper containers to remote in [prefix]-[timestamp] name style. An example syncing map is as follows:

/home/.snapshots        -> [email protected]:/srv/backup/snapshots
  - /home/.snapshots/1  ->  - [email protected]:/srv/backup/snapshots/pc_home-20241101030001
  - /home/.snapshots/2  ->  - [email protected]:/srv/backup/snapshots/pc_home-20241102030001
  - /home/.snapshots/13 ->  - [email protected]:/srv/backup/snapshots/pc_home-20241106030001
  ...

In which, only 1 would be sent as a whole subvolume, and 2 would be sent as its parent defined as 1, 13 would be sent as its parent defined as 2, etc.

Don’t worry about the snapper metadata as they’re still there:

> ls -lh /srv/backup/snapshots/rz5_root-20241122100006/
total 4.0K
-rw-r--r-- 1 root root 185 Jan  3 14:56 info.xml
drwxr-xr-x 1 root root 150 Jan  3 14:56 snapshot/
> cat /srv/backup/snapshots/rz5_root-20241122100006/info.xml
<?xml version="1.0"?>
<snapshot>
  <type>single</type>
  <num>1</num>
  <date>2024-11-22 10:00:06</date>
  <description>timeline</description>
  <cleanup>timeline</cleanup>
</snapshot>

The timestamp is extracted from the metadata and explicited included in the name so it’s easier to look up without the help of snapper.

An example snasync run is as follows:

targets=(
    --target /srv/backup/snapshots # a simple path refers to local target, this is my online warm backup
    --target [email protected]:/srv/backup/snapshots # a scp-style target means to sync to remote, this is my off-site cool backup (not entirely cold as not off-line)
)
snasync \
    --source / --prefix rz5_root "${targets[@]}" \
    --source /home --prefix rz5_home "${targets[@]}" \
    --source /srv --prefix rz5_srv "${targets[@]}"

Basically you define a source to sync from (either the subvolume containing .snapshots, or .snapshots itself, snasync would figure it out), and the prefix to name the synced snapshots with (otherwise would be derived from source path), and then a few targets to sync them to, each source has their own targets, but in this case they’re all the same.

The whole operating logic of snasync is as follows:

  • If snapper-cleanup.timer is running, stop it and register an on-exit trigger to start it again.
  • If snapper-cleanup.service is running, wait for it to finish.
  • Scan for all sources to get a list of snapper containers to sync, the list would not be updated during this run again are is essentially read-only.
    • If a snapshot is read-write, skip it
    • If info.xml is missing, skip it
    • Timestamp is extracted from info.xml and used as key
    • A timestamp-name map, a timestamp-path map are each created with timestamp with key
    • Timestamps are sorted so later operations go from the minimum to the maximum
  • Iterate through all targets to extract the list of remotes, and do simple SSH connection to each of them with control master specified and quit to warm up the connection.
    • If a remote is out of reach, mark it as bad, and it would not be used in later syncing
  • For each source, fork out to sync it so we’re syncing with multi-process
  • In the source syncer, for each target, for out to sync it so we’re syncing with multi-process
  • Iterate through the corresponding snapper containers, for each of them:
    • If a snapshot was already synced, define parent as the last snapshot, otherwise keep it empty
    • If a container exist in target:
      • If snapshot is read-only or contains received-UUID:
        • If info.xml does not exist, copy the source one to it
        • Consider this container already synced and skip to the next one
      • If snapshot is missing, or not read-only, or does not contain received-UUUID:
        • Delete the snapshot and info.xml
    • If a container does not exist in target, create the container
    • Sync the source snapshot to target
      • Sender has argument --compressed-data to keep the compressed data to save bandwidth, even if it’s running a local sync
      • Sender has argument --parent [snapshot] if we have already seen synced snapshot to do only incremental sync, otherwise it does not have such argument
      • Receiver has argument --force-decompress to decompress the data first and then compress with the target compression options
  • Iterate through target containers that start with the prefix and named in expected format
    • If it’s seen at source, keep it untouched
    • If it’s not seen at source, rename it to add .orphan suffix. It’s up to users whether to delete or keep them.
  • In the source syncer, collect target syncers
  • In the main worker, collect source syncers
  • Bring back snapper-cleanup.timer if this was registered

On my remote backup server the layout looks like the following:

> ls -l /srv/backup/snapshots/
total 0
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241126130020/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241224130016/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241224140003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241225160003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241225170003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241226130022/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241226140022/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241227160003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241227170003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241227180003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241228150001/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241228160001/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_home-20241228170003/
drwxr-xr-x 1 root root 32 Dec 29 17:55 dsk_root-20241126130020/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241224130016/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241224140003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241225160003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241225170003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241226130022/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241226140022/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241227160003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241227170003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241227180003/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241228150001/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241228160001/
drwxr-xr-x 1 root root 32 Dec 29 17:57 dsk_root-20241228170003/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20240920170000/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20240930160000/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241031160001/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241130160020/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241208160007/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241215160017/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241217160023.orphan/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241218160025.orphan/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241219160025/
drwxr-xr-x 1 root root 32 Dec 23 21:36 fuo_home-20241220160005/
......
drwxr-xr-x 1 root root 32 Dec 24 14:48 wtr_root-20241224040006/
drwxr-xr-x 1 root root 32 Dec 24 14:48 wtr_root-20241224050016/
drwxr-xr-x 1 root root 32 Dec 24 14:48 wtr_root-20241224060000/

As the snapshots are sent and received with parent if possible, and compressed on target which shall be mounted with higher compression level than source, the disk space they take is not very much:

> sudo btrfs filesystem du -s /srv/backup/snapshots/*
     Total   Exclusive  Set shared  Filename
  25.81GiB    28.26MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241126130020
  25.80GiB     4.57MiB    18.97GiB  /srv/backup/snapshots/dsk_home-20241224130016
  25.79GiB     7.00MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241224140003
  25.79GiB     1.47MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241225160003
  25.79GiB     1.57MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241225170003
  25.79GiB     3.44MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241226130022
  25.79GiB     4.17MiB    18.96GiB  /srv/backup/snapshots/dsk_home-20241226140022
  ...

Note the snasync only renames the containers that do not exist locally to have a .orphan suffix but not delete them, their content, snapshot and info.xml are still there:

> ls -lh /srv/backup/snapshots/fuo_home-20241217160023.orphan/
total 4.0K
-rw-r--r-- 1 root root 188 Dec 23 21:43 info.xml
drwxr-xr-x 1 root root  36 Jul  9  2024 snapshot/

One can remove all orhpaned containers and snapshots in them if they want to free up space:

> sudo btrfs subvolume delete /srv/backup/snapshots/*.orphan/snapshot
> sudo rm -rf /srv/backup/snapshots/*.orphan

When backed up data is needed one can simply navigate through all the containers and snapshots

nomad7ji@wtr /s/b/snapshots> cd rz5_home-20241122100006/
nomad7ji@wtr /s/b/s/rz5_home-20241122100006> ls
info.xml  snapshot/
nomad7ji@wtr /s/b/s/rz5_home-20241122100006> cd snapshot/
nomad7ji@wtr /s/b/s/r/snapshot> ls
nomad7ji/
nomad7ji@wtr /s/b/s/r/snapshot> cd nomad7ji/
nomad7ji@wtr /s/b/s/r/s/nomad7ji> ls
Android/  Building/  Desktop/  Development/  Documents/  Downloads/  go/  Music/  opt/  Pictures/  Public/  Security/  Templates/  Videos/

Offline cold backup is still needed

While Btrfs + snapper + snasync setup provides enough robustness, all of them relies on the robustness of Btrfs filesystem, and it’s not impossible to fail, and you can not trust the underlying storage 100%.

When possible, please do regular offline cold backup. I do yearly BD-R backups with 25GB HTL BD-R disces with 10% parity volume, and hoard the disces at my parents’ and the parity volumes on network drive. It does not matter if the data is hard to retrieve, so it’s even OK to store it on e.g. AWS S3. They’re only needed when everything else fails, and at that point, every part of the work to retrieve them are always worth it.

]]>
Bash logging, an improvded way2024-12-27T09:15:00+00:002024-12-27T09:15:00+00:00https://7ji.github.io/scripting/2024/12/27/bash-logging-improvedThree months ago I’ve written a blog post to document how to print logs in Bash with function names and line numbers just like in C, however the method documented there relied on eval and is not very clean. A few weeks ago I found a better way to do this and I just had time to write it down.

Let’s start this with a full demo:

log_inner() {
    if [[ "${log_enabled[$1]}" ]]; then
        echo "[${BASH_SOURCE##*/}:${1^^}] ${FUNCNAME[2]}@${BASH_LINENO[1]}: ${*:2}"
    fi
}

log_debug() {
    log_inner debug "$@"
}

log_info() {
    log_inner info "$@"
}

log_warn() {
    log_inner warn "$@"
}

log_error() {
    log_inner error "$@"
}

log_fatal() {
    log_inner fatal "$@"
}

initialize() {
    set -euo pipefail
    declare -gA log_enabled=(
        [debug]='y'
        [info]='y'
        [warn]='y'
        [error]='y'
        [fatal]='y'
    )
    local AIMAGER_LOG_LEVEL="${AIMAGER_LOG_LEVEL:-info}"
    case "${AIMAGER_LOG_LEVEL,,}" in
    'debug')
        :
        ;;
    'info')
        log_enabled[debug]=''
        ;;
    'warn')
        log_enabled[debug]=''
        log_enabled[info]=''
        ;;
    'error')
        log_enabled[debug]=''
        log_enabled[info]=''
        log_enabled[warn]=''
        ;;
    'fatal')
        log_enabled[debug]=''
        log_enabled[info]=''
        log_enabled[warn]=''
        log_enabled[error]=''
        ;;
    *)
        log_fatal "Unknown log level ${AIMAGER_LOG_LEVEL}, shall be one of the"\
            "following (case-insensitive): debug, info, warn, error, fatal"
        return 1
        ;;
    esac
}

work() {
    log_warn "Started working..."
    log_fatal "A fatal error occured!"
}

main() {
    initialize
    log_info "Started running..."
    work
    log_info "Ended"
}

main "$@"

Save the above to /tmp/scripter and run it with bash /tmp/scripter and you’ll have the following output:

[scripter:INFO] main@73: Started running...
[scripter:WARN] work@67: Started working...
[scripter:FATAL] work@68: A fatal error occured!
[scripter:INFO] main@75: Ended

As you can see we have both the script name, log level, function name, line number, and the original log.

Let me break it down and tell you how it works, let’s focus first on the essential inner logging function:

log_inner() {
    if [[ "${log_enabled[$1]}" ]]; then
        echo "[${BASH_SOURCE##*/}:${1^^}] ${FUNCNAME[2]}@${BASH_LINENO[1]}: ${*:2}"
    fi
}

The function’s if-condition does a simple thing: check whether the log level from arg1 ($1) was enabled, and only print everything from arg2 ($2) onwards with a log prefix when it is enabled.

In the printing line, the used variables and their definitions are:

  • $BASH_SOURCE is a Bash built-in variable, storing the path of the Bash file that was interpreted, here it would be /tmp/scripter, and as ##*/ removes everything before the last / (inclusive), ${BASH_SOURCE$$*/} would be scripter;
  • $1 is the log level, ${1^^} converts the log level to upper case, so if $1 is info then ${1^^} is INFO;
  • $FUNCNAME is a Bash built-in array containing all of the function names in the stack, backwards in the order they were called, so the full array here would be FUNCNAME=(log_inner log_[level] main), the one we want is thus ${FUNCNAME[2]}, the actual function that calls our logging wrappers;
  • $BASH_LINENO is a Bash built-in array containing all of the places where the functions are called in the stack, backwards in the order they were called, so the full array here would be BASH_LINENO=([line No. where log_inner was called] [line No. where log_[level] was called]), the one we want is thus ${BASH_LINENO[1]}, the line number where our logging wrapper was called.
  • $* is a Bash magic variable that contains all arguments to the current context (here log_innner) joined by the first character in $IFS (here space) to a single string, we want ${*:2}, which only contains the second to last argument.

A log wrapper then is defined as follows:

log_info() {
    log_inner info "$@"
}

It just passes its built-in log level and all remaining arguments to the inner printing function. The reason I use dedicated log_info, log_warn, etc instead of using the inner log_inner directly is that I want the logging behaviour to be strictly explicit (remember that we have set -u (using undefined variable would result in error) and set -e (error would result in Bash exiting), so caller can only call log_info, this avoids the error that one might send the wrong logging level)

When one wants to log something they can use one of the logging wrapper in their function:

work() {
    ...
    log_fatal "A fatal error occured!"
    ...
}

This prints the following log, which is very helpful when debugging:

[scripter:FATAL] work@68: A fatal error occured!

Note also we have a helper struct to record whether a log level was enabled, instead of figuring it out everything logging was triggered:

declare -gA log_enabled=(
    [debug]='y'
    [info]='y'
    [warn]='y'
    [error]='y'
    [fatal]='y'
)

To disable a logging level just set the corresponding value to empty:

local AIMAGER_LOG_LEVEL="${AIMAGER_LOG_LEVEL:-info}"
case "${AIMAGER_LOG_LEVEL,,}" in
...
'error')
    log_enabled[debug]=''
    log_enabled[info]=''
    log_enabled[warn]=''
    ;;
...
esac

This saves the work that is needed when figuring out whether a log level shall be enabled: getting the value from an assotiated Bash array is much faster than doing calculation every time from environment

Like the following:

if [[ "${log_enabled[info]}" ]]; then
    do_complex_logging_when_info_level_is_enabled
fi

is of course lighter than the following:

case "${AIMAGER_LOG_LEVEL}" in
info|warn|error|fatal)
    do_complex_logging_when_info_level_is_enabled
    ;;
esac

If there’s no need to figure out the log level again during runtime, then the demo can be simplified to the following:

log_inner() {
    echo "[${BASH_SOURCE##*/}:$1] ${FUNCNAME[2]}@${BASH_LINENO[1]}: ${*:2}"
}

log_debug() {
    log_inner DEBUG "$@"
}

log_info() {
    log_inner INFO "$@"
}

log_warn() {
    log_inner WARN "$@"
}

log_error() {
    log_inner ERROR "$@"
}

log_fatal() {
    log_inner FATAL "$@"
}

initialize() {
    set -euo pipefail
    local AIMAGER_LOG_LEVEL="${AIMAGER_LOG_LEVEL:-info}"
    case "${AIMAGER_LOG_LEVEL,,}" in
    'debug')
        :
        ;;
    'info')
        log_debug() { :; }
        ;;
    'warn')
        log_debug() { :; }
        log_info() { :; }
        ;;
    'error')
        log_debug() { :; }
        log_info() { :; }
        log_warn() { :; }
        ;;
    'fatal')
        log_debug() { :; }
        log_info() { :; }
        log_warn() { :; }
        log_error() { :; }
        ;;
    *)
        log_fatal "Unknown log level ${AIMAGER_LOG_LEVEL}, shall be one of the"\
            "following (case-insensitive): debug, info, warn, error, fatal"
        return 1
        ;;
    esac
}

work() {
    log_warn "Started working..."
    log_fatal "A fatal error occured!"
}

main() {
    initialize
    log_info "Started running..."
    work
    log_info "Ended"
}

main "$@"
]]>
Bash logging with Function name and Line No.2024-09-29T09:00:00+00:002024-09-29T09:00:00+00:00https://7ji.github.io/scripting/2024/09/29/bash-logging-with-funcname-linenoUpdated on 2024-12-27: I’ve written an improved way of logging without the need of eval calls and it’s recommend to read the new blog post

When writing some lengthy bash script, one might want the script to log with function name and line number so it would be easy to trace some error-prone logics in the future, like how you would use __FUNCTION__ and __LINE__ macros in C projects compiled with GCC.

Luckily there’re $FUNCNAME and $LINENO built-in variables. So you could write your logging statement like this:

myfunc() {
    echo "[DEBUG] ${FUNCNAME}@${LINENO}: Starting some work..."
    if work; then
        echo "[INFO] ${FUNCNAME}@${LINENO}: Successfully finished the work..."
    else
        echo "[ERROR] ${FUNCNAME}@${LINENO}: Failed to do the work !"
        return 1
    fi
}

However writing these lengthy prefixes is both annoying and error-prone, and it reduces the information density which is unhelpful when you go back to improve the codes.

It’s of course not possible to replace these with functions, as $FUNCNAME and $LINENO would then not trace the place where functions are actually called, but only their inner state. And it would be even more tedious if you want to conditionally log depending on the log level.

To simplify the latter typing work while keeping $FUNCNAME and $LINENO as where they’re called, and have some conditonal log levels, you can define some “macros” that would be expanded by eval:

log_common_start='echo -n "['
log_common_end='] ${FUNCNAME}@${LINENO}: " && false'
log_info="${log_common_start}INFO${log_common_end}"
log_warn="${log_common_start}WARN${log_common_end}"
log_error="${log_common_start}ERROR${log_common_end}"
log_fatal="${log_common_start}FATAL${log_common_end}"

# Debugging-only definitions
if [[ "${aimager_debug}" ]]; then
log_debug="${log_common_start}DEBUG${log_common_end}"
else
log_debug='true'
fi

With the above definition you can write that function instead like this:

myfunc() {
    eval "$log_debug" || echo 'Starting some work...'
    if work; then
        eval "$log_info" || echo 'Successfully finished the work...'
    else
        eval "$log_error" || echo 'Failed to do the work !'
        return 1
    fi
}

The way this works is that, for logging levels enabled, the logging lines are basically expanded to:

echo "prefix" && false || echo 'content'

and both echoes would be executed in this case;

and for logging levels disabled, the logging lines are basically expanded to:

true || echo 'content'

and no echo would be executed in this acse

As a bonus point, you can execute some logics dynamically depending on the log level, e.g.

if ! eval "$log_info"; then
    echo 'Running specific logic when logging level INFO is enabled'
    some_logic_when_info_is_enabled
else
    other_logic_when_info_is_disabled
fi
]]>
Yet Another Arch Linux Router2024-08-02T12:00:00+00:002024-08-02T12:00:00+00:00https://7ji.github.io/networking/2024/08/02/yet-another-arch-router-setupBackground

Recently I got a used enterprise-level edge router, VMWare SD-WAN Edge 620, with two 10G-SFP+ ports and six 1G-RJ45 ports, powered by a 4-core Intel C3558 with 8G of DDR4-ECC memory.

With such a powerful hardware I decided to replace my current apartment router (BananaPi BPi-R4, with MT7988A (4-core A73 @1.8Ghz) + 4G DDR4 + 8G eMMC, running OpenWrt snapshot) with it. So I could get an unplugged BPI-R4 to tinker with (a new device to add Arch Linux ARM support to!).

As I didn’t want VMs nor containers (router in VM/container results in circular network dependency that’s hard to fix once broken, a.k.a. all-in-boom; VM/container on router results in security holes and tainted firewall), and wanted to have some cutting edge caching proxy running (pacoloco as an Arch repo caching proxy + wireguard + some tproxy services, to be precise). No ESXi, no ProxmoxVE: I don’t want any hypervisors and only wanted to choose a generic Linux distro as base. And no pre-defined configuration: I want to set every possible component up by myself.

So Arch router again. I had been using Arch Linux on servers for five years, and I used to use it for router four years ago for almost a year. But I gave up then due to Network Manager breaking and the broken router results in inaccessibility to Internet from time to time, and some family members complaining, of course. This time I decided to use saner components for the whole router, and embrace a more gentle management style.

The following are the essential components I chose for another Arch router four years ago, what I chose for the new router now instead, and, for reference, what OpenWrt uses

  OpenWrt<=21.02 OpenWrt>=22.03 Retired Arch Router New Arch Router
Network Manager UCI + netifd UCI + netifd Network Manager systemd-networkd
Firewall Frontend firewall3 firewall4 - -
Firewall Backend iptables nftables nftables nftables
DNS Server dnsmasq dnsmasq dnsmasq bind9 / named
DHCP v4 Server dnsmasq dnsmasq dnsmasq kea
DHCP v6 Server odhcpd odhcpd - part of systemd-networkd, SLAAC only
SSH Server dropbear dropbear openssh openssh

The new setup has been running stably for a month, and I had finished some projects with BPI-R4, I think it’s time to document how I did the setup

Setup

System installation

Do the base system installation following Arch Installation Guide, don’t do installation with guided installer, as each maintainer of such installer would have their preference for components. We only want the needed bare-minimum parts to make the system bootable, without even configuring network manager.

Network

After first boot and logging in you should have no network. There shouldn’t be any network manager running. If your network is already up and running then you’re on your own to figure out how to remove the pre-configured network stuffs.

Figuring out port layout

Try to ip link up all ports shown in ip link that starts with en or eth, with no network cable connected, e.g.

sudo ip link set enp3s0f0u2u3 up

After this all ports shall still be shown as state DOWN in ip link

Then connect cable to ports one by one and check which port has state UP after being plugged in.

After all ports are figured out, ip link down all ports, e.g.

sudo ip link set enp3s0f0u2u3 down

The below layout is what I had on my Edge 620.

     10 GbE   |           GbE         
--------------+-----------------------
              |[ens2f2] [ens2f0] [eno5] 
 [eno7] [eno8]|[ens2f3] [ens2f1] [eno6]

Port usage and network layout

The basic idea is that no port bandwidth shall be wasted. A 1G port should be connect to another 1G port, 2.5G to 2.5G, 10G to 10G. 100M devices are banned from the network. Cross-bandwidth bridging is done by hardware.

  • ens2f3 (1G) would be used as the WAN port, connecting to landlord’s modem + router with 300M down link speed and 30M up link speed.
  • all other ports would be joined to a bridge, they can communicate with each other freely, but most traffic happen under switches and most devices connected directly to the router don’t access other LAN devices through the software bridge
    • eno7 (10G) would be connected to a 8x2.5G + 10G switch
      • all 2.5G devices would be connected to the 2.5G ports on switch, some doing bonding
      • a wireless router in AP mode with 3x1G + 2.5G would be connected to the switch to function as 2.5G-1G bridge
        • a 1G switch would be connected to the 1G port on AP
        • high-traffic 1G devices would be connected to either the AP or the 1G switch depending on their in-LAN inter-traffic
    • eno8 (10G) would be connected to my home desktop
    • all other 1G ports would be connected to low-traffic 1G devices (consoles, set-top boxes, etc)

WAN

As I’m renting a room (in a three-room apartment) I had to use my landlord’s ISP subscription. The ISP fiber modem + router has no public IPv4, and as not the actual subscriber I had no priviledge to ask the ISP for a public IPv4 address. Luckily still, the ISP delegates an IPv6 /60 suffix. And I could at least get a /64 suffix without breaking the network for my other room-mates. So on my WAN, I would have:

  • a private IPv4 address (private to the apartment LAN)
  • a public IPv6 address (in the /64 for apartment LAN)
  • a public IPv6 /64 suffix (different from the /64 for apartment LAGN, in the /60 sent by ISP).

For these I need to configure the WAN to do DHCPv4 (to get a private IPv4 address to do IPv4 routing), IPv6 SLAAC (to get a public IPv6 address to do IPv6 routing) and DHCPv6-PD (to get a IPv6 /64 suffix to assign to my other devices in my own LAN), so let’s create a .network file for the interface.

/etc/systemd/network/20-GbE-Down-Left-as-WAN.network

[Match]
Name=ens2f3

[Network]
DHCP=yes
IPv6AcceptRA=yes
IPv4Forwarding=yes
IPv6Forwarding=yes
IPv6PrivacyExtensions=no

[Link]
RequiredForOnline=no

[DHCPv6]
WithoutRA=solicit

In which:

  • Network.DHCP=yes ensures we do both DHCPv4 and DHCPv6 to gain both an IPv4 address and an IPv6 address for the interface
  • Network.IPv4Forwarding=yes and Network.IPv6Forwarding=yes ensure we allow forwarding to this interface (unfortunately for IPv6 the option does not work as expected, addtional sysctl set up for IPv6 is needed, read further)
  • Network.IPv6AcceptRA=yes ensures we always do SLAAC for IPv6
  • Network.IPv6PrivacyExtension=no ensures we have a consistent DUID when doing SLAAC for IPv6
  • Link.RequiredForOnline=no ensures startup of DNS and DHCP servers won’t be delayed if WAN connection is down at boot
  • DHCPv6.WithoutRA=solicit ensures we always start a DHCPv6 client to obtain IPv6 Prefix Delegation, even when there’s no Router Advertisement (like when outer LAN only does DHCPv6 for PD but not SLAAC)

LAN Bridge

All LAN interfaces need to be joined into a single bridge, let’s create a .netdev file for the bridge to init it,

/etc/systemd/network/10-Bridge.netdev

[NetDev]
Name=bridge0
Kind=bridge
# MACAddress= 

Note we didn’t explicitly set the MAC Address. As this is a virtual interface it won’t have persistent MAC address. Some client won’t be happy for this (like Windows clients would consider this a new network each time the router MAC address changes as it assotiates a “network” to a router MAC address). On activating a MAC Address would be generated for it, and after that we would want to make it persistent by uncommenting the MACAddress line and setting it up.

Then let’s create a .network file for all LAN interfaces to configure them as slaves of the above bridge,

/etc/systemd/network/20-10GbE-GbE-Others-as-Bridge-Slave.network

[Match]
Name=ens2f0 ens2f1 ens2f2 eno5 eno6 eno7 eno8

[Network]
Bridge=bridge0

[Link]
RequiredForOnline=no

Here Link.RequiredForOnline=no is needed as not all LAN ports would be connected at all time. Without it, systemd-networkd-wait-online.service would hang for minutes waiting for all these ports. If, however, you’re sure that all these ports would be online 100% of the time, then you can surely remove the [Link] section

Then let’s create a .network file to configure L-3 network on the bridge interface

/etc/systemd/network/30-Bridge.network

[Match]
Name=bridge0

[Network]
Address=192.168.67.1/24
DHCPPrefixDelegation=yes
IPMasquerade=ipv4
IPv6SendRA=yes
IPv6AcceptRA=no
IPv6Forwarding=yes

[DHCPPrefixDelegation]
UplinkInterface=ens2f3
SubnetId=0
Announce=yes

[IPv6SendRA]
OtherInformation=yes

In which:

  • Network.DHCPPrefixDelegation=yes ensures we assign the IPv6 prefix we got from another interface (in this case, WAN, ens2f3) to devices connected to this interface.
  • Network.IPMasquerade=ipv4 ensures we do NATv4 for devices connected to this interface. This also implicitly sets Network.IPv4Forwarding=yes so traffic forwarded to this interface is allowed.
  • Network.IPv6SendRA=yes and DHCPPrefixDelegation.Announce=yes ensure we send Router Advertisement to announce the delegated /64 prefix, i.e. do SLAAC for devices connected to this interface to let themselves configure an IPv6 address.
  • Network.IPv6AcceptRA=no ensures we won’t do SLAAC for ourself on this interface, this is not necessarily needed but it would prevent a bad-behaving fake router under the LAN doing dirty stuffs breaking the network of the router itself.
  • Network.IPv6Forwarding=yes ensure we allow IPv6 forwarding to this interface (unfortunately for IPv6 the option does not work as expected, addtional sysctl set up for IPv6 is needed, read further), we didn’t set Network.IPv4Forwarding=yes as it’s implicitly set by Network.IPMasquerade=ipv4.
  • DHCPPrefixDelegation.UplinkInterface=ens2f3 ensures we get IPv6 prefix from the WAN interface, ens2f3
  • DHCPPrefixDelegation.SubnetId=0 ensures we assign the first possible /64 from the prefix, in this case all of the /64 prefix we get. This is still needed even when we only have a single /64 prefix.
  • IPv6SendRA.OtherInformation=yes ensures we tell LAN devices we have a DHCPv6 server, but we only ever send out addtional infos like routing, DNS, etc, but never a DHCPv6 address. We don’t want a single device to have multiple IPv6 GLA address.

Wireguard

I’d also want to connect this apartment network to my wireguard network, partially documented in a previous blog post

So let’s create a .netdev and a .network for the wireguard interface.

The configs are generated by wireguard-deployer I recently wrote with a minimum config like following (it has changed a lot from my previous post, due to hk1 losing its IPv6 connection):

psk: false
netdev: 30-wireguard
network: 40-wireguard
peers:
  ali:
    ip: 192.168.77.1
    endpoint: ali.fuckblizzard.com # all other can connect to
    direct: [fuo, pdh, hk1, t16]
  fuo:
    netdev: 40-Wireguard
    network: 50-Wireguard
    ip: 192.168.77.2
    endpoint: 
      pdh: fuo.fuckblizzard.com
    forward:
      - 192.168.67.0/24
    keep: [ali]
    direct: [ali, pdh]
  pdh:
    ip: 192.168.77.3
    endpoint: 
      ^neighbor: pd4.fuckblizzard.com
      fuo: pd6.fuckblizzard.com
    forward:
      - 192.168.7.0/24
      - 192.168.15.0/24
      - 192.168.17.0/24
    keep: [ali]
    direct: [ali, fuo, t16]
  hk1:
    ip: 192.168.77.96
    endpoint:
      ^child: hk1.lan
    direct: [ali, pdh]
    keep: [ali, pdh]
    children:
      rz5:
        ip: 192.168.77.97
        endpoint: rz5.lan
      a7j:
        ip: 192.168.77.98
        endpoint: a7j.lan
      v7j:
        ip: 192.168.77.99
        endpoint: v7j.lan
  t16:
    ip: 192.168.77.128
    direct: [ali, pdh]

/etc/systemd/network/40-Wireguard.netdev

[NetDev]
Name=wg0
Kind=wireguard

[WireGuard]
ListenPort=51820
PrivateKeyFile=/etc/systemd/network/keys/wg/private-fuo

# ali
[WireGuardPeer]
PublicKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Endpoint=ali.fuckblizzard.com:51820
AllowedIPs=192.168.77.1
AllowedIPs=192.168.77.128
AllowedIPs=192.168.77.96
AllowedIPs=192.168.77.97
AllowedIPs=192.168.77.98
AllowedIPs=192.168.77.99
PersistentKeepalive=25

# pdh
[WireGuardPeer]
PublicKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Endpoint=pd6.fuckblizzard.com:51820
AllowedIPs=192.168.15.0/24
AllowedIPs=192.168.17.0/24
AllowedIPs=192.168.7.0/24
AllowedIPs=192.168.77.3

/etc/systemd/network/50-Wireguard.network

[Match]
Name=wg0

[Network]
Address=192.168.77.2/24
IPv4Forwarding=yes

[Link]
RequiredForOnline=no

[Route]
Destination=192.168.15.0/24
Scope=link

[Route]
Destination=192.168.17.0/24
Scope=link

[Route]
Destination=192.168.7.0/24
Scope=link

In which:

  • Network.IPv4Forwarding=yes ensures IPv4 forwarding to this interface is allowed
  • Link.RequiredForOnline=no ensures failed wireguard connection won’t result in systemd-networkd-wait-online being blocked

Global networkd config

The option Network.IPv6Forwarding in .netdev file sets net.ipv6.conf.[interface].forwarding to 1, similar to how it configs IPv4.

However, for IPv6, the kernel explicitly checks net.ipv6.conf.all.forwarding to decide whether to do IPv6 forwarding, and only does so when it’s set to 1.

Per-interface forwarding on-off option is not a thing, and net.ipv6.conf.[interface].forwarding controls actually the interface-specific host/router behaviour (telling neighbors we’re a router in Neighbour Advertisements with IsRouter=1), so instead of “we would do IPv6 forwarding”, it’s really “(telling others) we can do IPv6 forwarding”.

Due to this, we need to set Network.IPv6Forwarding=yes in /etc/systemd/networkd.conf so networkd would set sysctl net.ipv6.conf.all.forwarding to 1.

/etc/systemd/networkd.conf

[Network]
IPv6Forwarding=yes

I’d recommend against setting it in /etc/sysctl.d, setting it in networkd.conf and we could track all network settings in one place, setting it in /etc/sysctl.d and network sysctls would be scattered in two places and hard to track.

Reference: systemd issue #33414

Starting network

Now everything’s configured, start the network by doing

sudo systemctl enable --now systemd-networkd

Let’s also start a temporary DNS server, we need it before we finish setting up our own actual DNS server. systemd has resolved so let’s use it for the temp job.

sudo systemctl start systemd-resolved
sudo ln -sf /run/systemd/resolve/stub-resolv.conf /etc/resolv.conf

The network should come up like this (all public addresses obfuscated):

> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute 
       valid_lft forever preferred_lft forever
2: bridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether aa:bb:cc:dd:ee:ff brd ff:ff:ff:ff:ff:ff
    inet 192.168.67.1/24 brd 192.168.67.255 scope global bridge0
       valid_lft forever preferred_lft forever
    inet6 1111:2222:3333:4444:xxxx:xxxx:xxxx:xxxx/64 metric 256 scope global dynamic mngtmpaddr 
       valid_lft 174254sec preferred_lft 87854sec
    inet6 fe80::xxxx:xxxx:xxxx:xxxx/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever
3: ens2f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether bb:cc:dd:ee:ff:aa brd ff:ff:ff:ff:ff:ff
    altname enp2s0f0
4: ens2f1: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether cc:dd:ee:ff:aa:bb brd ff:ff:ff:ff:ff:ff
    altname enp2s0f1
5: ens2f2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether dd:ee:ff:aa:bb:cc brd ff:ff:ff:ff:ff:ff
    altname enp2s0f2
6: ens2f3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    link/ether ee:ff:aa:bb:cc:dd brd ff:ff:ff:ff:ff:ff
    altname enp2s0f3
    inet 192.168.1.2/24 metric 1024 brd 192.168.1.255 scope global dynamic ens2f3
       valid_lft 244740sec preferred_lft 244740sec
    inet6 1111:2222:3333:5555:xxxx:xxxx:xxxx:xxxx/64 scope global dynamic mngtmpaddr noprefixroute 
       valid_lft 174254sec preferred_lft 87854sec
    inet6 fe80::xxxx:xxxx:xxxx:xxxx/64 scope link proto kernel_ll 
       valid_lft forever preferred_lft forever
7: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 192.168.77.2/24 scope global wg0
       valid_lft forever preferred_lft forever
8: eno8: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether ff:aa:bb:cc:dd:ee brd ff:ff:ff:ff:ff:ff
    altname enp5s0f0
9: wlp4s0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether ab:cd:ef:ab:cd:ef brd ff:ff:ff:ff:ff:ff
10: eno7: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master bridge0 state UP group default qlen 1000
    link/ether cd:ef:ab:cd:ef:ab brd ff:ff:ff:ff:ff:ff
    altname enp5s0f1
11: eno6: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether ef:ab:cd:ef:ab:cd brd ff:ff:ff:ff:ff:ff
    altname enp7s0f0
12: eno5: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq master bridge0 state DOWN group default qlen 1000
    link/ether df:ab:cd:ef:ab:cd brd ff:ff:ff:ff:ff:ff
    altname enp7s0f1

Remember to go back and set the MAC address for the bridge interface now that it’s generated (here aa:bb:cc:dd:ee:ff on bridge0).

Verify the network by doing a simple pacman -Syu first, reboot if necessary.

Check the network connection, we now should have a fully working network connection on the router, and LAN device should have partially working network connection.

LAN devices currently should be able to:

  • get an IPv6 adddress by SLAAC
  • connect to network if we configure the IPv4 address and DNS manually

So let’s go further to config a DHCPv4 server and a DNS server to assin LAN IPs and resolve DNS queries as caching DNS server.

DHCP server

Devices under the LAN can currently get IPv6 addresses by SLAAC but they can’t get IPv4 address and they can’t get DNS info. We would need a DHCPv4 server to assign IPv4 addresses to devices under the LAN and announce DNS infos, domain search suffixes, etc in the DHCP lease.

Let’s install kea, the ISC’s new reference DHCP server implementation after ISC DHCP

sudo pacman -S kea

kea provides DHCPv4, DHCPv6, DDNS and Control Daemon, we only need DHCPv4 currently.

Config kea DHCPv4 by modifying /etc/kea/kea-dhcp4.conf

My config looks like the following:

{
"Dhcp4": {
    "interfaces-config": {
        "interfaces": [ "bridge0/192.168.67.1" ]
    },
    "control-socket": {
        "socket-type": "unix",
        "socket-name": "/tmp/kea4-ctrl-socket"
    },
    "lease-database": {
        "type": "memfile",
        "lfc-interval": 60
    },
    "expired-leases-processing": {
        "reclaim-timer-wait-time": 10,
        "flush-reclaimed-timer-wait-time": 25,
        "hold-reclaimed-time": 3600,
        "max-reclaim-leases": 100,
        "max-reclaim-time": 250,
        "unwarned-reclaim-cycles": 5
    },
    "renew-timer": 90,
    "rebind-timer": 180,
    "valid-lifetime": 600,
    "option-data": [
        {
            "name": "domain-name-servers",
            "data": "192.168.67.1, 192.168.67.1"
        },
        {
            "name": "domain-name",
            "data": "fuo.lan"
        },
        {
            "name": "domain-search",
            "data": "fuo.lan, lan"
        }
    ],
    "subnet4": [
        {
            "id": 1,
            "subnet": "192.168.67.0/24",
            "pools": [ { "pool": "192.168.67.192 - 192.168.67.254" } ],
            "option-data": [
                {
                    "name": "routers",
                    "data": "192.168.67.1"
                }
            ],
            "reservations": [
                {
                    "hw-address": "AA:BB:CC:DD:EE:FF",
                    "ip-address": "192.168.67.2",
                    "hostname": "switch-sirivision-2500m"
                },
                {
                    "hw-address": "BB:CC:DD:EE:FF:00",
                    "ip-address": "192.168.67.3",
                    "hostname": "ap-tplink-ax6000"
                },
                ......
            ]
        }
    ],
    "loggers": [
    {
        "name": "kea-dhcp4",
        "output-options": [
            {
                "output": "/var/log/kea-dhcp4.log"
            }
        ],
        "severity": "INFO"
    }
  ]
}
}

In which (all Dhcp4. prefix are omitted):

  • interfaces-config.interfaces = [ "bridge0/192.168.67.1" ] tells kea to do DHCPv4 on bridge0, listening on 192.168.67.1
  • lease-database.type = memfile and lease-database.lfc-interval = 60 tells kea to use a in-memory databse and flush it to an on-disk file per-minute.
  • renew-timer = 90 tells kea to let clients to renew their DHCP lease per 90 seconds
  • rebind-timer = 180 tells kea to try to re-bind existing DHCP leases per 3 minutes
  • valid-lifetime = 600 tells kea to send out DHCP leases with 10-minute valid lifetime
  • option-data tells kea to send addtional infos in the DHCP lease
    • domain-name-servers configs DNS for clients, in this case also the router itself
    • domain-name configs a rDNS record for clients, in this case resolving fuo.lan to router itself
    • domain-search configs the local domain serach suffixes for clients, in this case both fuo.lan and lan, so a dot-less domain record would be searched by itself first and then with suffixes, e.g. o5p as o5p then as o5p.fuo.lan then as o5p.lan
  • subnet4 defines a list of IPv4 subnets to do DHCP in
    • subnet defines the actual subnet, in this case 192.168.67.0/24
    • pool defines a list of address ranges to use, in this case 192.168.67.192 - 192.168.67.254
    • option-data defines additional option-data on top of global option-data to send, in this case we send out routers to config a gateway and default route rule for clients
    • reservations defines reserved leases for certain clients

Start kea’s DHCPv4 server

sudo systemctl enable --now kea-dhcp4

LAN devices should now be able to get IPv4 address, routing infos and DNS servers by DHCPv4. They should be able to ping Internet IPv4 addresses, e.g. ping 8.8.8.8, but they can’t resolve domain names yet unless they configure an Internet DNS server, as we’ve set the router itself as DNS server in DHCPv4.

DNS server

We would need a DNS server that functions both as caching server (for generic domains and other .lan domains in my wireguard network) and authoritative server (for our fuo.lan zone).

Let’s install bind9, the ISC’s reference DNS server implementation

sudo pacman -S bind

Config bind9 named by modifying /etc/named.conf

My config looks like the following:

options {
    directory "/var/named";
    pid-file "/run/named/named.pid";

    listen-on-v6  { none; };
    listen-on { 192.168.67.1; 192.168.77.2; 127.0.0.1; };

    allow-recursion { 192.168.67.0/24; 192.168.77.0/24; 127.0.0.1; };
    allow-transfer { none; };
    allow-update { none; };

    recursion yes;
    auth-nxdomain no;
    dnssec-validation no;

    forwarders { 114.114.114.114; 114.114.115.115; };

    version none;
    hostname none;
    server-id none;
};

include "rndc.conf";

zone "localhost" IN {
    type master;
    file "localhost.zone";
};

zone "0.0.127.in-addr.arpa" IN {
    type master;
    file "127.0.0.zone";
};

zone "1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa" {
    type master;
    file "localhost.ip6.zone";
};


zone "7ji.lan" IN {
    type master;
    file "lan.7ji.zone";
};

include "lan.fuo.dhcp.key";

zone "fuo.lan" IN {
    type master;
    file "lan.fuo.zone";
    update-policy {
        grant dhcp-fuo-lan-key wildcard *.fuo.lan A DHCID;
    };
};

zone "67.168.192.in-addr.arpa" {
    type master;
    file "lan.fuo.rdns.zone";
    update-policy {
        grant dhcp-fuo-lan-key wildcard *.67.168.192.in-addr.arpa PTR DHCID;
    };
};

zone "wg7.lan" {
    type forward;
    forward only;
    forwarders {
        192.168.77.1;
    };
};

zone "pdh.lan" {
    type forward;
    forward only;
    forwarders {
        192.168.77.3;
    };
};

zone "lks.lan" {
    type forward;
    forward only;
    forwarders {
        192.168.77.96;
    };
};

In the options section:

  • listen-on-v6 { none; }; disables DNSv6 service and named won’t bind to and listen on an IPv6 address
  • listen-on { 192.168.67.1; 192.168.77.2; 127.0.0.1; }; limits the IPv4 addresses we bind to and listen on: only LAN IPv4, wireguard IPv4, and localhost
  • allow-transfer { none; }; and allow-update { none; }; ensures we’re the master DNS server for zones we manage
  • recursion yes; enables the server to function as a DNS caching server
  • auth-nxdomain no; ensures we won’t touch AA bit in NXDOMAIN response, i.e. we won’t pretend to be authoritative for zones we don’t own
  • dnssec-validation no; disables DNSSEC
  • forwarders { 114.114.114.114; 114.114.115.115; }; sets upstream DNS servers to refer to for domain zones we don’t control
  • version none; ensures we won’t return server version for a query of the name version.bind with type TXT and class CHAO, so we’re mostly transparent for clients
  • hostname none; ensures we won’t return server for a query of the name hostname.bind with type TXT and class CHAOS, so we’re mostly transparent for clients
  • server-id none; ensures we won’t return server ID for a Name Server Identifier (NSID) query, or a query of the name ID.SERVER with type TXT and class CHAOS, so we’re mostly transparent for clients

2 include sections are generated and stored privately for security (tee /dev/stderr is only for demostrating, you don’t actually need it when running)

> tsig-keygen dhcp-fuo-lan-key | tee /dev/stderr | sudo install --mode 640 --group named /dev/stdin /var/named/lan.fuo.dhcp.key
key "dhcp-fuo-lan-key" {
        algorithm hmac-sha256;
        secret "xnk+ZYUhCQCEl19hIsNLgqswMBzsJZf62vrlaxuwTEU=";
};
> printf '%s\n' '{' '	"name": "dhcp-fuo-lan-key",' '	"algorithm": "hmac-sha256",' '	"secret": "'$(sudo sed -n 's/^.\+secret "\(.\+\)";$/\1/p' /var/named/lan.fuo.dhcp.key)'"' '}' | tee /dev/stderr | sudo install --mode 600 /dev/stdin /etc/kea/kea-dhcp-fuo-lan.key
{
        "name": "dhcp-fuo-lan-key",
        "algorithm": "hmac-sha256",
        "secret": "xnk+ZYUhCQCEl19hIsNLgqswMBzsJZf62vrlaxuwTEU="
}
> rndc-confgen | tee /dev/stderr | sudo install --mode 400 /dev/stdin /etc/rndc.conf.temp
# Start of rndc.conf
key "rndc-key" {
        algorithm hmac-sha256;
        secret "E9vh+qxVIitnSrEcvBWbbciTsf2kquLil4V5XNgRgR4=";
};
    
options {
        default-key "rndc-key";
        default-server 127.0.0.1;
        default-port 953;
};
# End of rndc.conf

# Use with the following in named.conf, adjusting the allow list as needed:
# key "rndc-key" {
#       algorithm hmac-sha256;
#       secret "E9vh+qxVIitnSrEcvBWbbciTsf2kquLil4V5XNgRgR4=";
# };
# 
# controls {
#       inet 127.0.0.1 port 953
#               allow { 127.0.0.1; } keys { "rndc-key"; };
# };
# End of named.conf
> sudo grep -v '^#' /etc/rndc.conf.temp | grep -v '^$' | tee /dev/stderr | sudo install --mode 600 /dev/stdin /etc/rndc.conf
key "rndc-key" {
        algorithm hmac-sha256;
        secret "E9vh+qxVIitnSrEcvBWbbciTsf2kquLil4V5XNgRgR4=";
};
options {
        default-key "rndc-key";
        default-server 127.0.0.1;
        default-port 953;
};
> sudo grep '^#' /etc/rndc.conf.temp | grep -v '\.conf' | cut -c 3- | tee /dev/stderr | sudo install --mode 640 --group named /dev/stdin /var/named/rndc.conf
key "rndc-key" {
        algorithm hmac-sha256;
        secret "E9vh+qxVIitnSrEcvBWbbciTsf2kquLil4V5XNgRgR4=";
};

controls {
        inet 127.0.0.1 port 953
                allow { 127.0.0.1; } keys { "rndc-key"; };
};
> sudo rm /etc/rndc.conf.temp

Various zones have different definitions:

  • zone localhost, zone 0.0.127.in-addr.arpa and zone 1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.ip6.arpa come from the default config, keep them as-is
  • zone wg7.lan, zone pdh.lan, zone lks.lan are forwarded to other routers in the wireguard network, the sole forwarder for each zone is their WireGuard IP.
  • zone 7ji.lan is the local CNAME-only zone that resolves some service domains to CNAME fuo.lan domains, so I can use e.g. repo.7ji.lan in different places resolved to different LAN domains, repo.fuo.lan at my apartment, repo.pdh.lan at my parents’ house, etc. File /var/named/lan.7ji.zone shall be created with owner:group set to root:named and permission mode 640 so named can only read it but not update it.

      echo '@                       SOA     @ root (
                                      1          ; serial
                                      0          ; refresh (immediately)
                                      0          ; retry (immediately)
                                      604800     ; expire (1 week)
                                      0          ; minimum (immediately)
                                      )
                              NS      @
                              A       192.168.67.1
      *                       CNAME   fuo.lan.' | sudo install --mode 640 --group named /dev/stdin /var/named/lan.7ji.zone
    

    We’ve not set TTL for any record. TTL would always default to SOA minimum, here 0. This means we don’t want any client to cache multi-place 7ji.lan lookup results.

  • zone fuo.lan is the apartment local domain zone, we would define some CNAME records to resolve them to DDNS domains. File /var/named/lan.fuo.zone shall be created with owner:group set to named:named and permission mode 660 so named can both read it and update it:

      echo '@                       SOA     @ root (
                                      1          ; serial
                                      0          ; refresh (immediately)
                                      0          ; retry (immediately)
                                      604800     ; expire (1 week)
                                      0          ; minimum (immediately)
                                      )
                              NS      @
                              A       192.168.67.1
      fuo                     CNAME   @
      xray                    CNAME   @
      repo                    CNAME   @
      git                     CNAME   wtr
      gmr                     CNAME   opi
      bpi                     CNAME   server-bpi-m5
      o5p                     CNAME   server-opi-5plus
      opi                     CNAME   server-opi-5
      wtr                     CNAME   server-aoostar-wtr-pro' | sudo install --mode 660 --owner named --group named /dev/stdin /var/named/lan.fuo.zone
    

    We’ve not set TTL for any record. TTL would always default to SOA minimum, here 0. This means we don’t want any client to cache LAN fuo.lan lookup results.

  • zone 67.168.192.in-addr.arpa is the apartment reverse DNS domain zone. File /var/named/lan.fuo.rdns.zone shall be created with owner:group set to named:named and permission mode 660 so named can both read it and update it.

      echo '@                       SOA     @ root (
                                      1          ; serial
                                      0          ; refresh (immediately)
                                      0          ; retry (immediately)
                                      604800     ; expire (1 week)
                                      0          ; minimum (immediately)
                                      )
                              NS      fuo.lan.
      1                       PTR     fuo.lan.' | sudo install --mode 660 --owner named --group named /dev/stdin /var/named/lan.fuo.rdns.zone
    

    We’ve not set TTL for any record. TTL would always default to SOA minimum, here 0. This means we don’t want any client to cache LAN 192.168.67.y reverse lookup results.

With bind9 configured up it’s time to start the named server:

sudo systemctl enable --now named

With bind9 named running, we should be able to (by using dig to query 127.0.0.1):

  • Resolve public domains, e.g. dig github.com @127.0.0.1 -> A 20.205.243.166
  • Resolve set LAN domains, e.g. dig xray.fuo.lan @127.0.0.1 -> CNAME fuo.fuo.lan -> CNAME fuo.lan -> A 192.168.67.1
  • Resolve set multi-place LAN domains, e.g. dig repo.7ji.lan @127.0.0.1 -> CNAME fuo.lan -> A 192.168.67.1

Now let’s say goodbye to the temporary systemd-resolved server

sudo systemctl stop systemd-resolved
sudo rm /etc/resolv.conf

The router should use itself as the DNS server.

echo 'nameserver 127.0.0.1
search fuo.lan' | sudo tee /etc/resolv.conf

local DDNS (DHCP + DNS integration)

kea and bind9 can be configured to do DDNS for zones, in this case we want [host].fuo.lan resolved to LAN addresses, and a corresponding rDNS zone resolved to domains.

Modify kea’s DHCPv4 server config to add DDNS options:

{
"Dhcp4": {
    ......
    "dhcp-ddns": {
        "enable-updates": true
    },
    "ddns-qualifying-suffix": "fuo.lan",
    "ddns-override-client-update": true,
    ......
}
}

Restart kea’s DHCPv4 server

sudo systemctl restart kea-dhcp4

Config kea’s DHCP DDNS server by modifying /etc/kea/kea-dhcp-ddns.conf

My config looks like the following:

"DhcpDdns":
{
  "ip-address": "127.0.0.1",
  "port": 53001,
  "control-socket": {
      "socket-type": "unix",
      "socket-name": "/tmp/kea-ddns-ctrl-socket"
  },
  "tsig-keys": [
    <?include "/etc/kea/kea-dhcp-fuo-lan.key"?>
  ],
  "forward-ddns" : {
      "ddns-domains": [{
          "name": "fuo.lan.",
          "key-name": "dhcp-fuo-lan",
          "dns-servers": [{
              "ip-address": "127.0.0.1"
          }]
      }]
  },
  "reverse-ddns" : {
      "ddns-domains": [{
          "name": "67.168.192.in-addr.arpa.",
          "key-name": "dhcp-fuo-lan",
          "dns-servers": [{
              "ip-address": "127.0.0.1"
          }]
      }]
  },
  "loggers": [
    {
        "name": "kea-dhcp-ddns",
        "output-options": [
            {
                "output": "/var/log/kea-ddns.log"
            }
        ],
        "severity": "INFO",
        "debuglevel": 0
    }
  ]
}
}

Start kea’s DDNS server

sudo systemctl enable --now kea-dhcp-ddns

Test whether DDNS work by tail -f /var/log/kea-ddns.log on router and restarting network on a LAN client. When a client requests and get a DHCP lease, an A record [hostname].fuo.lan. to the IP and a PTR record [suffix].67.168.192.in-addr.arpa. to the A record domain should be added automatically.

Pacman caching

I need this as I have a LOT of Arch clients in LAN and want to do pacman -Syu at maximum LAN bandwidth (here 10G)

Install pacoloco, a pacman caching server

sudo pacman -S pacoloco

Modify its config in /etc/pacoloco.yaml

Mine looks like the following:

download_timeout: 3600
purge_files_after: 2592000
repos:
  archlinux:x86_64:
    urls: &urls_archlinux
      - http://mirrors.ustc.edu.cn/archlinux
      - http://mirrors.tuna.tsinghua.edu.cn/archlinux
  archlinuxarm:aarch64:
    urls: &urls_archlinuxarm
      - http://mirrors.ustc.edu.cn/archlinuxarm
      - http://mirrors.tuna.tsinghua.edu.cn/archlinuxarm
  archlinuxarm:armv7h:
    urls: *urls_archlinuxarm
  archlinuxcn:aarch64:
    urls: &urls_archlinuxcn
      - http://mirrors.ustc.edu.cn/archlinuxcn
      - http://mirrors.tuna.tsinghua.edu.cn/archlinuxcn
  archlinuxcn:any:
    urls: *urls_archlinuxcn
  archlinuxcn:arm:
    urls: *urls_archlinuxcn
  archlinuxcn:armv6h:
    urls: *urls_archlinuxcn
  archlinuxcn:armv7h:
    urls: *urls_archlinuxcn
  archlinuxcn:i686:
    urls: *urls_archlinuxcn
  archlinuxcn:x86_64:
    urls: *urls_archlinuxcn
  arch4edu:aarch64:
    urls: &urls_arch4edu
      - http://mirrors.ustc.edu.cn/arch4edu
      - http://mirrors.tuna.tsinghua.edu.cn/arch4edu
  arch4edu:any:
    urls: *urls_arch4edu
  arch4edu:x86_64:
    urls: *urls_arch4edu
prefetch:
  cron: 0 0 3 * * * *

Start pacoloco

sudo systemctl enable --now pacoloco

Clients in LAN including the router itself should be able to use e.g. Server = http://fuo.lan:9129/archlinux/$repo/os/$arch in their pacman.conf, but I don’t quite like ports in repo URLs.

Let’s go further by listen on repo.7ji.lan so a device can be moved to different LAN but still uses the caching server, and sanitize the URL a bit so multiple repos can use a same mirrorlist.

Install nginx, a HTTP server

sudo pacman -S nginx

Modify /etc/nginx.conf to include site configs

http {
    ....
    include sites-enabled/*.conf;
    ...
}

Create a repo config /etc/nginx/sites-available/repo.conf:

server {
    listen 80;
    charset UTF-8;
    server_name repo.7ji.lan;

    rewrite ^/(archlinuxarm|archlinuxcn|arch4edu)/([^/]+)/(.+)$ http://repo.fuo.lan:9129/repo/$1:$2/$2/$3 permanent;
    rewrite ^/archlinux/([^/]+)/os/([^/]+)/(.+)$ http://repo.fuo.lan:9129/repo/archlinux:$2/$1/os/$2/$3 permanent;

    location / {
        autoindex on;
        autoindex_exact_size off;
        autoindex_localtime on;

        root /srv/http/repo;
    }
}

The section location / can be omitted if you don’t have local-only repos. I have a repo 7Ji that I store locally, so I did addtionally:

sudo mkdir -p /srv/http/repo/7Ji

The config would need to be linked to another folder so it would be recognized:

sudo ln -s ../repo.conf /etc/nginx/sites-enabled/

Start nginx:

sudo systemctl enable --now nginx

With this config, clients can use the folloiwng /etc/pacman.d/mirrorlist:

Server = http://repo.7ji.lan/archlinux/$repo/os/$arch

and the following /etc/pacman.d/mirrorlist-3rdparty:

Server = http://repo.7ji.lan/$repo/$arch

and set up their /etc/pacman.conf simply like

[core]
Include = /etc/pacman.d/mirrorlist

[extra]
Include = /etc/pacman.d/mirrorlist

[multilib]
Include = /etc/pacman.d/mirrorlist

[7Ji]
Include = /etc/pacman.d/mirrorlist-3rdparty

[archlinuxcn]
Include = /etc/pacman.d/mirrorlist-3rdparty

[arch4edu]
Include = /etc/pacman.d/mirrorlist-3rdparty

Note for ALARM /etc/pacman.d/mirrorlist would be a little bit different:

Server = http://repo.7ji.lan/archlinuxarm/$arch/$repo

Firewall

Install nftables, the modern firewall

sudo pacman -S nftables

Modify /etc/nftables.conf as you like it. Notably you would need to open dhcpv6-client ports on WAN side even if you want to limit WAN access, otherwise you would not be able to get DHCPv6 routing info and prefix delegation.

Mine looks like the following:

define port_wireguard = 51820;
define port_transmission = 51413;
define port_qbittorrent = 60726;
define open_tcp_udp = { dhcpv6-client };
define open_tcp = { $open_tcp_udp, ssh };
define open_udp = { $open_tcp_udp, $port_wireguard };
define allow_forward_wtr_tcp_udp = { $port_transmission, $port_qbittorrent };
define allow_forward_wtr_tcp = { $allow_forward_wtr_tcp_udp, ssh };
define allow_forward_wtr_udp = { $allow_forward_wtr_tcp_udp };
destroy table inet filter
table inet filter {
  chain input {
    type filter hook input priority filter
    policy drop

    ct state invalid drop comment "early drop of invalid connections"
    ct state {established, related} accept comment "allow tracked connections"
    iifname { lo, bridge0, wg0 } accept comment "allow from loopback, lan, and wireguard"
    ip protocol icmp accept comment "allow icmp"
    meta l4proto ipv6-icmp accept comment "allow icmp v6"
    tcp dport $open_tcp accept comment "allow dhcp v6 client, sshd"
    udp dport $open_udp accept comment "allow dhcp v6 client, wireguard"
    pkttype host limit rate 5/second counter reject with icmpx type admin-prohibited
    counter
  }
  chain forward {
    type filter hook forward priority filter
    ct state established,related accept comment "allow forwarded established and related flows"
    iifname bridge0 accept comment "allow lan forwarding to wan and wireguard"
    iifname wg0 accept comment "allow wireguard forwarding to lan and wan"
    iifname ens2f3 oifname bridge0 jump forward_wan_lan comment "allow certain wan-lan forwarding"
    policy drop
  }
  chain forward_wan_lan {
    ip6 daddr & ::ffff:ffff:ffff:ffff == ::707f:3ff:feb9:ee5c jump forward_wtr
    # return to chain forward
  }
  chain forward_wtr {
    tcp dport $allow_forward_wtr_tcp accept comment "allow transmission, sshd"
    udp dport $allow_forward_wtr_udp accept comment "allow transmission"
    # return to chain forward_wan_lan
  }
}

Start nftables firewall

sudo systemctl enable --now nftables

End

With the above setup you should have a stable and secure network structure. We could also config transparent proxy, but let’s leave it for another post.

]]>