Garage S3 on Kubernetes

I recently upgraded the storage in my main NAS (as apposed to my all NVME mini NAS) which I use for more longer term “cold” storage. It cost me a twice as much as it would have done, if I’d bought the drives back in November when I started planning this, but that is a totally separate post/rant (2 disk mirrored storage from 2TiB to 12TiB).

There are 2 main reasons for the upgrade, first the drives were over 10 years old and the second is to use it as a backup target for a bunch of things, mainly

  • Snapshots from the NVME NAS
  • As an off site backup target for my Dad’s NAS full of his photo collection

My Dad’s NAS is another Synology device (that we will also be upgrading from 2Tib to 8TiB, again mainly due to the age of the existing drives) and this comes with a tool called HyperBackup which supports a bunch of different targets to send data to, but the 2 most useful remote options are

  • Another Synology NAS
  • An S3 Bucket

I don’t really want to directly expose my or my Dad’s Synology device to the Internet, so an S3 bucket is way forward.

S3 Buckets

S3 is a standard that started out as an offering from Amazon as part of AWS. It’s basically a better version of WebDAV that allows Objects (files) to be stored and access via HTTP.

I’ve previously run Minio as a S3 Bucket server in my local Kubernetes cluster, as a place to push photos/videos from my phone and as a restic local backup target.

But they recently made some big changes to move all the front end UI tools tools to their Enterprise licensed paid for product. They have also announced that they have stopped development on the Open Source version.

As I work for a company that makes a product that’s core is Open Source but also sells an Enterprise licensed version, I sort of understand the need to differentiate between the free version of a the paid version. Doing that by taking away features that were previously in the Open Source version feels a little like a rug pull to me, but anyway…

As well as a web GUI for configuring buckets Minio also supported things like AWS style policies that control what different keys can access at a fine grain level, Object Versioning and OIDC SSO login.

I’m still running an older version that is only exposed to my LAN that still has all the old features, but I thought I’d try an alternative for this remote backup target solution.

Garage

Garage is a little more basic than Minio, It doesn’t offer Object Versioning and access keys can have read/write/owner privileges not the finer grained access, but it can do clustering and block level de-duplication to only store data once on a given cluster node (you can configure how many copies to keep across a multi-node cluster).

If you want to expose a bucket as a Web Site then this is done on a separate subdomain rather than sharing the same hostname as the S3 API endpoint as is the case with AWS S3 or Minio.

It can be installed with a helm chart so it was mainly a case of passing values to setup the storage backend and configuring the Ingress hostnames.

As mentioned previously I have a iSCSI backed StorageClass installed in my cluster that allows me to provision PVCs directly against my Synology NAS, so I’m using that here.

One small quirk was even though the project has a multi-arch manifest on Docker Hub, the helm chart defaults to only pointing at the AMD64 containers so it failed to install properly first time on my cluster because I have 2 ARM64 (pi 4) nodes and the pod was initially scheduled on one of these. I had to override the default container image.

Full set of values passed to the helm chart here:

garage:
  replicationFactor: 1
  s3:
    api:
      region: us-east-1
      rootDomain: ".garage.example.com"
    web:
      rootDomain: ".garage-web.k8s.loc"
      index: "index.html"
image:
  repository: dxflrs/garage
  tag: v2.2.0
deployment:
  replicaCount: 1
persistence:
  meta:
    storageClass: "nfs-client"
    size: 1Gi
  data:
    storageClass: "synology-iscsi"
    size: 3Ti
ingress:
  s3:
    api:
      enabled: true
      className: public
      annotations:
        cert-manager.io/cluster-issuer: smallstep
      hosts:
      - host: garage.k8s.loc
        paths:
        - path: /
          pathType: Prefix
      - host: garage.example
        paths:
        - path: /
          pathType: Prefix
      - host: "*.garage.k8s.loc"
        paths:
        - path: /
          pathType: Prefix
      - host: "*.garage.example.com"
        paths:
        - path: /
          pathType: Prefix
      tls:
      - secretName: garage-s3-cert
        hosts:
        - garage.k8s.loc
        - "*.garage.k8s.loc"
    web:
      enabled: true
      className: public
      annotations:
        cert-manager.io/cluster-issuer: smallstep
      hosts:
      - host: "*.garage-web.k8s.loc"
        paths:
        - path: /
          pathType: Prefix
      tls:
      - secretName: garage-web-cert
        hosts:
        - "*.garage-web.k8s.loc"
monitoring:
  metrics:
    enabled: true

The Ingress configuration sets it up to listen on both my internal subdomain and my public and issue HTTPS certificates using my internal CA for the internal names. The external hostnames will be proxied via my externally facing Nginx instance and get public HTTP certificates from LetsEncrypt.

The metrics service is enabled because it exposes the Admin API which I’ll need later.

I also had to create the “layout” manually using kubectl to run the required commands in the pod

kubectl exec --stdin --tty -n garage garage-0 -- ./garage status
kubectl exec -it -n garage garage-0 -- ./garage layout assign -z us-east-1 -c 3T <node_id>
kubectl exec -it -n garage garage-0 -- ./garage layout apply --version 1

The Web Site domain is only exposed internally for now, and I can always expose it later using some nginx regex magic, e.g.

upstream kube {
  server kube-one.local:443;
  server kube-two.local:443;
  server kube-three.local:443;
  server kube-four.local:443;
  keepalive 20;
}
server {
  listen 80;
  listen [::]:80;

  server_name ~^(?&lt;subdomain>[^.]+).example.com;

  location / {
     proxy_pass https://kube;
     proxy_set_header Host $subdomain.garage-web.k8s.loc;
     proxy_set_header X-Real-IP $remote_addr;
     proxy_set_header X-Forwarded-Proto $scheme;
     proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
     proxy_set_header X-NginX-Proxy true;
  }
}

Web GUI

There is a 3rd party web gui for Garage which I’ve managed to get running in the same kubernetes namespace. This allows me to create new buckets and keys without needing to resort to the command line.

There isn’t a helm chart for the gui, but I managed to put together a manifest to get it deployed and configured to talk to the Garage install.

I had to manually create an admin key and use htpasswd to generate an admin password hash.

kubectl exec -it -n garage garage-0 -- ./garage admin-token create web-gui-token

I copied the admin key and password hash into the manifest:

apiVersion: v1
kind: Secret
metadata:
  name: garage-webui
type: Opaque
stringData:
  API_ADMIN_KEY: "xxxxxxxxx"
  AUTH_USER_PASS: "admin:$2y$10$xxxxxxxx"
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: garage-webui
data:
  API_BASE_URL: http://garage-metrics:3903
  S3_REGION: us-east-1
  S3_ENDPOINT_URL: http://garage:3900
---
apiVersion: v1
kind: Service
metadata:
  name: garage-webui
spec:
  selector:
    app: garage-webui
  ports:
  - port: 3909
    targetPort: 3909
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: garage-webui
  labels:
    app.kubernetes.io/name: garage-webui
  annotations:
    cert-manager.io/cluster-issuer: smallstep
spec:
  rules:
  - host: garage-webui.k8s.loc
    http:
      paths:
      - pathType: Prefix
        path: "/"
        backend:
          service:
            name: garage-webui
            port: 
              number: 3909
  tls:
  - hosts:
    - garage-webui.k8s.loc
    secretName: garage-webui-cert
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: garage-webui
spec:
  selector:
    matchLabels:
      app: garage-webui
  template:
    metadata:
      labels:
        app: garage-webui
    spec:
      containers:
      - name: garage-webui
        image: khairul169/garage-webui:latest
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
          requests:
            memory: "128Mi"
            cpu: "500m"
        ports:
        - containerPort: 3909
        envFrom:
        - secretRef:
            name: garage-webui
        - configMapRef:
            name: garage-webui

The admin API of the Garage install is exposed on the metrics service I enabled earlier.

Next

Now it’s all setup and running I need to arrange time to head back up to my folks for a visit, I’ll take an external hard drive with me an have HypeBackup create an initial snapshot on that. I can then copy this into the S3 bucket when I get home. This is because Dad’s photo collection is already about 1TiB which would take ages to push even over FTTP (now I finally have it installed)

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway

(AWS offers a version of this called Snowball, or Snowmobile depending on scale, otherwise known as sneakernet)

Once the backup snapshot are copied into the bucket I’ll be able to change the setting in HyperBackup to point to the S3 endpoint and it will then do incremental backups based on the snapshot.

Acknowledgements

A bunch of this was inspired by @jwildeboer and his blog posts about using garage, especially the web ui.

Fediverse Reactions

WTF is Facebook doing?

I’ve been running my HTTP server logs though a tool called goaccess for a while, this tool generates a bunch of charts showing traffic volume and how it breaks down by the different hostnames/services that NGINX proxies for.

I also have the WordPress JetPack stats enabled on my blog (which is behind the nginx reverse proxy) and as well as generating per post it also shows data on which Countries visitors are coming from. I wanted to extend this geolocation to the full set of logs.

Goaccess also has geolocation support using the MaxMind GeoIP database files. So I thought I’d have a go at setting that up.

MaxMind Databases

MaxMind is a service that collates information about the real world location of IP addresses. They sell very detailed data, but they also make some slightly less accurate versions for free (you do need to sign up for an account). They currently provide 3 versions of the databases.

  1. GeoLite2-Country, this provides Country level geolocation
  2. GeoLite2-City, this provides City level geolocation
  3. GeoLite2-ASN, this provides the ASN (which companies owns the the IP address range)

They update these files roughly once a week, and as part of the MaxMind account you can generate a Access Token to use with the geoipupdate tool to automate keeping them up to date.

A recent update to goaccess means you can actually pass multiple database when generating the charts, which means it can generate both location and owner data by passing both the City and the ASN database.

I run the following command via cron every 6 hours to update the stats page.

$ goaccess /var/log/nginx/access.* -o /var/www/html/stats/index.html \
  --log-format=VCOMBINED -j 2 -a --keep-last=14 --db-path=/root/stats \
  --restore --persist --real-os \
  --geoip-database=/usr/local/share/geoip/GeoLite2-ASN.mmdb \
  --geoip-database=/usr/local/share/geoip/GeoLite2-City.mmdb

The map shows countries shaded by volume of traffic and each little circle is a city and you can filter by continent and then country to get down to city level of details.

The ASN table was the data that really surprised me most which shows that Facebook is by far the largest hitter.

NGINX GeoIP2 Plugin

Nginx can also load the MaxMind databases and this is how basic geo blocking works (e.g. many services have started to block UK based users as a result of the OSA), but it can also be used to make service load in a local language by default, or offer items in the correct local currency.

It is configured by placing the following in /etc/nginx/conf.d/geoip.conf

geoip2 /usr/local/share/geoip/GeoLite2-Country.mmdb {
    auto_reload 60m;
    $geoip2_data_country_iso_code country iso_code;
}

geoip2 /usr/local/share/geoip/GeoLite2-City.mmdb {
    auto_reload 60m;
    $geoip2_data_city_name          city names en;
    $geoip2_data_state_name         subdivisions 0 names en;
    $geoip2_data_location_latitude  location latitude;
    $geoip2_data_location_longitude location longitude;
    $geoip2_data_time_zone          location time_zone;
}

geoip2 /usr/local/share/geoip/GeoLite2-ASN.mmdb {
    auto_reload 60m;
    $geoip_data_asn                 autonomous_system_number;
    $geoip2_organization            autonomous_system_organization;
}

This creates a bunch of variables that can be used in other sections, for example to block all users from a given country you can add the following to the /etc/nginx/conf.d/geoip.conf file

map $geoip2_data_country_iso_code $is_blocked_country {
    default 0;
    XX      1;
}

Where XX is the 2 letter ISO code for the country and you can then add this inside the server block of the required file in /etc/nginx/sites-available

if ($is_blocked_country = 1) {
  return 451;
}

This means any client that attempts to load content from that host will receive a 451 HTTP response code.

In my case I don’t want to block Facebook (yet), I first want to see first what content they are accessing and second what User-Agent they are using so I can add it to my robots.txt file to tell them to stop. If they persist then I’ll think about blocking them.

I want to configure nginx to log the Facebook requests to a separate file I can add this to the end of the /etc/nginx/conf.d/geoip.conf file from earlier

map $geopi2_data_asn $facebook_log {
    default 0;
    32934   1;
}

access_log /var/log/nginx/facebook-access.log vhosts if=$facebook_log;

Facebook requests

Within seconds of turning on the logging it looks like they are trying to brute force my Mastodon instance sign up page.

bluetoot.hardill.me.uk:443 2a03:2880:f814:39:: - - [12/Mar/2026:22:42:29 +0000] "GET /auth/sign_up?accept=8c6df81b3294bde5f30b67f34bd03e3f HTTP/2.0" 200 16075 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
bluetoot.hardill.me.uk:443 2a03:2880:f814:1d:: - - [12/Mar/2026:22:42:29 +0000] "GET /auth/sign_up?accept=8a391e8467a175b510a914e02ea45dfb HTTP/2.0" 200 16075 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"
bluetoot.hardill.me.uk:443 2a03:2880:f814:2f:: - - [12/Mar/2026:22:42:30 +0000] "GET /auth/sign_up?accept=cab6b72305c3c2b75131f821c5f32da1 HTTP/2.0" 200 16075 "-" "meta-externalagent/1.1 (+https://developers.facebook.com/docs/sharing/webmasters/crawler)"

I’ve updated my robots.txt to band the meta-externalagent/1.1 and I’ll give it a day or two for the agent to pull a new copy but I think I’ll be straight up banning their whole ASN at the weekend.

Update: I emailed the contact address on the linked page and got no response, the hits kept coming even with the new robots.txt file, so for now I have a blanket 301 redirect to a 100GiB gzip bomb for all traffic from that ASN.

Mounting Qemu disk images with LVM volumes

I recently had a VM that failed to boot after applying some updates. I didn’t really have time to work out how to fix it so I created a new VM but there were some files in the VM that I really wanted to keep.

So before I deleted the old VM I wanted to work out how to mount the disk image and recover them. I search and found some instructions that worked but because the VM had used LVM for the volumes cleaning up afterwards was tricky and I managed to get into the state that meant the only way to totally clean up was to restart the machine.

The following are some quick notes with full instructions to mount and properly clean up afterwards.

Mount

# modprode nbd max_part=8
# qemu-nbd --connect=/dev/nbd0 /home/user/.local/share/gnomes-boxes/images/ubuntu
# vgscan
# vgchange -ay vgName
# mount /dev/mapper/vgName-lvName /mnt

This does the following

  • Load the nbd module
  • Mounts the disk image at /home/user/.local/share/gnome-boxes/images/ubuntu on /dev/nbd0
  • Looks for new volume groups
  • Activates the discovered volume group
  • Mounts the logical volume on /mnt

Unmount

# umount /mnt
# lvchange -an /dev/mapper/vgName-lvNAme
# vgchange -an vgName
# qemu-nbd --disconnect /dev/nbd0
# rmmod nbd

This does the following

  • Unmounts the logical volume on /mnt
  • Deactivates the logical volume
  • Deactivates the volume group
  • Disconnects the disk image from /dev/nbd0
  • Removes nbd modules

Finally Fibre

For those of you who have been following along on Mastodon for a while you may have noticed the saga of getting FTTP installed at home. What follows is the full story in detail, this is mainly a Cathartic out poring to help me get some closure.

A long time ago…

Work to rollout FTTP in my town started before the 2020 Pandemic lockdown, mainly because one of the Altnets had targeted the more rural areas surrounding the town and had started to dig up all the winding country lanes. All was good (apart from it having an impact on my times for the local cycle loops). BT/Openreach had Fibre enabled the exchange as it was mandated for all new builds.

Work obviously ground to halt during lockdown, but the Altnet picked up again afterwards, but never actually covered town it’s self. Openreach did some work even getting as far as installing some FTTP-OD lines to 2 houses technically on my street (but actually just round the corner). But there were long periods of time when there was no obvious progression and no way to check up.

Progress

Openreach finally pulled the fibre down my street in October 2025, it was made available for purchase on November the 1st and I had been frantically refreshing the Openreach whole sale availability checker every morning since the pull team had been and gone.

I ordered the upgrade from FTTC via my ISP that very morning and an install date was set for November 14th. I got a number of emails and SMS messages in the run up to make sure I was all prepared for the day.

The engineer arrived for morning slot and I had to point out the my existing copper pair came to the property by a duct under my drive from a vault by the road, which in turn is fed from another vault about 8m down the road in the verge, not from the pole at the end of the street. The PON fibre splitter for my install is in the vault in the verge.

After about 30mins they came back and knocked at the door saying they could not get the fibre from the splitter to the vault on the end of my drive. The suggestion was the duct was blocked. I took their word for it rather than going to look closer myself, this was probably my first mistake. Rather than write the whole job off, they did all the inside work and drilled the hole for the fibre into house and hung the little grey fibre splice box on the outside. The reasoning for this was that it would mean the dig team would turn up quicker to complete the work.

I emailed my ISP to let them know the job was not complete and to ensure that my FTTC line would continue to work. That afternoon I got a SMS from Openreach acknowledging the problem and saying they would be in touch in a few days.

Paint lines marking duck path on the road

The next morning I went off to the forest to ride and when I came back to find the path of the duct marked out in paint (this probably wasn’t needed as the path was reasonably obvious from the patches in the tarmac…). This set expectations that things would move quickly, how wrong could I be, mistake number 2.

So at this point I still had my FTTC line up and running and high expectations that things would be fixed quickly.

Monday morning rolls round and I get the next SMS from Openreach, saying that they had been on the Saturday (as seen by the paint on the road) and things were a bit more complicated than expected but they were on the case and would be getting all the permission they needed to do the work, again would be in touch when they knew more.

Tuesday morning and I’m woken by a SMS from my ISP saying my FTTC line was down. I have automatic 4G/LTE fail over so it had kicked in. I got up and emailed my ISP who informed me that even though I had explicitly informed them of the failure to complete on Friday they had not stopped the cancellation order for the copper pair and it had gone through just after midnight on Tuesday morning. When I asked if the line could be re-activated I was told “no”. I asked for this to escalated, because this was a “new” order it was escalated to the Sales manager who again told me it was impossible to re-activate the line.

At this point I wasn’t that worried, I had the 4G/LTE connection that was working well and things all appeared to be moving quickly so it shouldn’t be a problem. The SIM card I had for the backup link was a pay once for a fixed amount of bandwidth per month (25GB) that you could for about 18 months after purchase, this worked pretty well normally it normally only got used if the FTTC line was down, which at worst had been a couple of hours. The 25GB would last me about a week if I just used it for work and laid off any Netflix marathons. But I ordered another 80GB per month version to help tide me over if it took a bit longer, probably just about 2-3 weeks.

Slight diversion

Due to the “interesting” choices made by the electrician who wired the house when it was built the Master Socket is to the right of the front door and has no power sockets near it (the closest being the other side of the front door, which presents some interesting trip hazards). To work around this they had run a length of Cat5e down the wall cavity from the attic. A pair from this was inserted into the extension terminals in the back of the Master Socket and on the other end was another socket faceplate on a box which when opened was found to hold a length of “chock-block” to used to connect the Cat5e runs to each of the rooms that had a phone socket in a wall plate.

All of this was discovered when I moved in and had a FTTC line provisioned, it wouldn’t sync at a decent speed so my ISP arranged for a Openreach engineer to come look and he removed the master socket by the font door, directly spliced a pair from the cat5e to the copper pair and put a new Master socket on the box in the attic with a built in filter. I opted to not connect any of the extensions as I didn’t need a voice line.

When the engineer spliced the copper pair to the Cat5e he removed nearly all of the spare slack that was behind the Master Socket. All of this will become relevant again later on.

Back to the main story

So having been told that there was absolutely no way to get the FTTC copper pair turned back on and the expectation that everything would be completed in 2ish weeks at most, I decided to do some work I’d been planning to do after the install had completed (3rd mistake).

BT ONT powered by PoE adapter

This was to take the old disconnected Master Socket faceplate off the wall by the front door, break the splice with the copper pair and punch the Cat5e into a RJ45 socket. The reason being that I had got the fibre ONT mounted on the wall above the old Master Socket. This was because I wanted to reuse the old exterior wiring ducts and didn’t really want to have to have the exposed fibre routed all across the front of the house, then up to eves on the side to run it into the attic (where all the network gear is). I also didn’t think the engineer would be prepared to do the installation work in the restricted access up there.

So to solve the lack of power problem I punched the attic end of the cable into anther RJ45 socket and added a PoE injector and a PoE splitter to the ends to be able to power the ONT.

Sinking feeling

It was then on the 25th of Nov that Openreach sent a 3rd SMS, saying that the planning was all done, and at the very latest it would all be complete by 30th December.

At this point I went back to my ISP and escalated to a technical manager who decided that it was actually possible to re-activate the copper pair, but after making all the PoE changes the previous weekend, there was not enough slack in the wires to reconnect them (told you it would be relevant). So that wasn’t actually an option now.

I had nearly burnt through the second SIM 80GB card by now, so had to look for something else. Also as it was going to be a longer stint than planned I decided to upgrade the USB cellular modem to something a bit more beefy. I’ve written about that already here.

Another follow up SMS from Openreach on the 13th December again promising it would all be done by the 30th.

While the 30th was a lot later than I expected, it wasn’t really going to cause me that much of a problem as I was heading back to my folks for the last 2.5 weeks of December for Christmas and New year and as all the internal work was done, if the duct got fixed sooner and engineer could complete the install by pulling the fibre and doing the splice in the grey box on the outside of the house and the router should just automatically switch over to FTTP.

But for some reason I didn’t trust this and thought I’d look on the site one.network which shows all the planned road works for the UK, It can show today, next 2 weeks, next 3 months and the next year.

It showed no work planned for the next 3 weeks, but did show an application for work on the 21-23rd of January. I asked my ISP to chase this with Openreach

New Year

When I got back home in the first week of January, the road was as I’d left it (except the weather had washed all the paint markings away) and I’d heard nothing back from either my ISP or Openreach.

Checking on one.network again I found work now planned for the 16th-18th of February. Again no update from Openreach about why the date had slipped.

February 16th rolls by, no sign of a dig team and when I go to check again on the 17th, the work has vanished and the new date is the 27th Feb to the 3rd Match. Chasing Openreach this time we get a response that strongly implies that their dig paperwork has expired because they have kept pushing it back and nobody bothered to check it until the morning of the 17th Feb. To say I was not happy by this point is an understatement.

Surprise twist

On the morning of the 27th I was still in bed when I heard a truck reversing alarm outside on the street, this is a rare occurrence, so I got up and opened the curtains to see a Openreach truck with a bunch of tools and barriers parking up. I got dressed and went to talk to to them.

I managed to keep my frustration in check and had a productive chat with the dig team, I also suggested something that had occurred to me over the long wait, that the duct might not actually be blocked, it just had 2 90° bends in it (the duct runs from the verge vault into the road, turns 90° left to follow the road then 90° left again to the vault at the end of the drive).

I then left them to it and decided to walk to the local shop to get some breakfast, the whole trip took about 15mins and as I came back the dig team informed me I was correct, there was no problem with the duct at all.

This turned out to be fortunate, because some of my neighbours have taken to parking on the pavement opposite where the duct is and with the cars parked there it wasn’t going to be wide enough to actually dig up the duct.

Conclusion

So basically, I’ve had a 3.5 month wait for nothing, all because the first engineer couldn’t look a the scars in the tarmac and work out the actual duct routing and decided it must be blocked.

The second engineer turned up on the 5th March and completed the install in under 30mins (pulling the fibre between the 2 vaults and under the drive to the front of the house).

What mainly pisses me off about all this, apart from all of it being unnecessary, is the expectation setting and then totally failing to meet any of them. While 30th December was later than I would have liked it was pretty clearly communicated and confirmed, there was then basically radio silence from Openreach after 13th December and I had to find out for my self on one.network every time it moved even though they clearly had means to contact me. If they had just communicated openly things would have been a lot smoother.

I am also pretty disappointed in my ISP as they used to have a stellar reputation for being able to hold BT/Openreach’s feet to the fire to get problems resolved and in this case this just hasn’t happened.

I have been promised a detailed description of why all the delays happened and why Openreach didn’t make their much promised 30th December date. I will chase this up on Monday.

Running a self hosted Atuin Server in Kubernetes without Postgresql

Shell history is a service offered by the shell, usually accessed by pressing the up arrow to retrieve the last executed command (and subsequent presses scrolling further back. You can also search history using ctrl-r (at least with bash). And finally the history command lists all commands made by that users up to the limit configured with the HISTSIZE or HISTFILESIZE (man page suggest default limit of 500).

This works well when you only ever have a single shell session open for a given user on a machine, but once you start to open multiple shells then only the last session closed gets written to the .bash_history file.

I’ve been playing with Atuin from @ellie , this is a better shell history manager and it doesn’t have the multiple session problem. It can also sync history between different machines and allow searching across machines. This is useful for when you are trying to remember which machine your ran something on.

The project provides instructions for running the server component on both Docker Compose and Kubernetes, but both sets of instructions use Postgresql as the backend database which is overkill for a single user even across multiple machines.

The server can also use SQLite as the database, which is much lighter weight and only requires a volume mounting to store thins. so I thought I’d work out how to run with SQLite on Kubernetes.

Kubernetes Manifest

The following manifest should provide a minimum viable install.

All the configuration is done via environment variables mounted from the ConfigMap.

It provisions a 1Gb volume for the SQLite database which should be way more than is ever needed. This is mounted on /config in the container.

The Ingress entry is HTTP only because I have a separate NGINX proxy that handles HTTPS termination and getting HTTPS certificates from LetsEncrypt, but I’ll probably add an internal HTTP endpoint on atuin.k8s.loc and have CertManager request an certificate for that endpoint from my internal Small Step CA.

apiVersion: v1
kind: ConfigMap
metadata:
  name: atuin-config
data:
  ATUIN_OPEN_REGISTRATION: "true"
  ATUIN_DB_URI: "sqlite:///config/atuin.db"
  ATUIN_PORT: "8888"
  ATUIN_HOST: "0.0.0.0"
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atuin
spec:
  selector:
    matchLabels:
      app: atuin
  serviceName: "atuin"
  strategy:
    type: Recreate
  replicas: 1
  template:
    metadata:
      labels:
        app: atuin
    spec:
      securityContext:
        fsGroup: 1000
        runAsGroup: 1000
        runAsUser: 1000
      containers:
      - name: atuin
        image: ghcr.io/atuinsh/atuin:v18.11.0
        args:
          - server
          - start
        env:
        - name: ATUIN_OPEN_REGISTRATION
          valueFrom:
            configMapKeyRef:
              name: atuin-config
              key: ATUIN_OPEN_REGISTRATION
        - name: ATUIN_PORT
          valueFrom:
            configMapKeyRef:
              name: atuin-config
              key: ATUIN_PORT
        - name: ATUIN_HOST
          valueFrom:
            configMapKeyRef:
              name: atuin-config
              key: ATUIN_HOST
        - name: ATUIN_DB_URI
          valueFrom:
            configMapKeyRef:
              name: atuin-config
              key: ATUIN_DB_URI
        resources:
            limits:
              cpu: 250m
              memory: 1Gi
            requests:
              cpu: 250m
              memory: 1Gi
        ports:
        - containerPort: 8888
          name: web
        volumeMounts:
        - name: storage
          mountPath: /config
  volumes:
  - name: storage
    persistentVolumeClaim:
      claimName: atuin-storage
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: atuin-storage
spec:
  storageClassName: "default"
  resources:
    requests:
      storage: 1Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteOnce
---
apiVersion: v1
kind: Service
metadata:
  name: atuin
  labels:
    app: atuin
spec:
  ports:
  - port: 80
    name: web
    targetPort: 8888
    protocol: TCP
  clusterIP: None
  selector:
    app: atuin
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: atuin
spec:
  ingressClassName: public
  rules:
  - host: atuin.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: atuin
            port:
              number: 80

And this is deployed with

$ kubectl create namespace atuin
$ kubectl -n atuin apply -f atuin.yaml

Client Setup

Now the server is up and running, the client needs configuring to use it.

This requires changing the default sync_address in the ~/.config/atuin/config.toml file.

## address of the sync server
# sync_address = "https://api.atuin.sh"
sync_address = "http://atuin.k8s.loc"

Once that is done then follow the normal instructions to create an account.

$ atuin register -u <username> -e <email address>

Once this is done if you are running an internet facing sync server you should probably edit the atuin-config ConfigMap to set the ATUIN_OPEN_REGISTRATION value to `”false” and restart the Deployment so others can’t register

You only need to register once on the first machine, on subsequent machines you can just use

$ atuin login -u <username>

This will ask for your password and the key provided when you registered (you can also recover the key from the initial machines with atuin key)

Fediverse Reactions

Adding a HTTP Proxy on My Alternate WAN Link

I have recently run into a couple of services that are insistent that my home broadband IP address is not located in the United Kingdom.

It is just 2 services, everything else (e.g. Netflix, Amazon or Disney+) all show me as being in the UK

Looking both the IPv4 and IPv6 addresses up with whois and checking all the registration data for the AS numbers all point to them being UK registered, so this appears to be a problem with the geolocation provider these services are running not something I can fix.

As a workaround I had been tethering my laptop to my phone when I needed to access these services, but that is a pain. I also want something I can use from devices other than my phone.

Solution

Under normal circumstances I actually have 2 different ISP backhaul connections at home. I have my normal primary FTTx connection and a 5G backup connection that I run a L2TP link to my primary ISP when the FTTx line is down. The 5G backup link is up all the time, it’s just the L2TP tunnel that only gets turned on when needed.

The 5G link geolocates correctly to the UK due to it being from a UK cellular provider, so if I had a way to route certain request via this link even when the FTTx line was up that would solve the problem.

As I’ve mentioned recently I have upgraded my cellular backup WAN connection to use a dedicated 5G router. This router has 2 Ethernet sockets (and can provide it’s own WiFi network) and a USB socket.

All of this means I can easily host a HTTP proxy on the 5G routers LAN and then connect to it from the main house LAN when needed.

I decided to make use of a Pi Zero W 2 I had laying around, but because I don’t have the WiFi enabled on the 5G router (to prevent consuming more channels than needed) I got hold of a USB-to-GO Ethernet adaptor with a micro USB plug (and supports micro USB power pass through).

This means I can power it from the 5G Router and plug it into the second Ethernet socket to connect it to it’s network.

On the Pi Zero I installed the latest 32Bit minimal Trixie build and then installed squid which is the standard HTTP proxy software.

# apt-get install squid

I had to make a small change to the default configuration to let it allow connections from more than just localhost, to do this I added a file called localnet.conf to the /etc/squid/conf.d directory with the following content

http_access allow localnet

This allows any hosts connecting from a RFC1918 address access. This is safe because it’s protected from direct access from the Internet as it’s behind a NAT gateway from the 5G connection and the MikroTik is NATing Home LAN traffic to the 192.168.8.0/24 address it uses to route traffic to the 5G connection.

Browser

I use the FoxyProxy plugin on both FireFox and Chrome which let’s me easily edit the proxy configuration, set rules for which domains to use the proxy for and to toggle them on and off.

Email DNS Records

I spent a little time at the weekend going over my DNS entries around email for my domain. This was prompted by seeing mention of MTA-STS and looking to see what was needed to make it work.

SPF/DKIM/DMARC

  • SPF is a way to list which IP addresses are allowed to send email from a given domain.
  • DKIM is a cryptographic signature of emails proving that they came from the correct mail server for a given domain.
  • DMARC is the combination of SPF and DKIM along with settings policy of what to do when email fails to comply with them. It also includes a reporting mechanism.

All 3 of these are pretty much required these days if you want your email to be accepted by the Big Boys (Google, Microsoft and similar) in the email world.

SPF

SPF is implemented as a TXT DNS record on the domain e.g.

hardill.me.uk.		3600	IN	TXT	"v=spf1 ip4:81.187.174.10 ip6:2001:8b0:2c1:4b4e:c92e:ca15:998a:4874/64 ip6:2001:8b0:2c1:4b4e:b01c:d1d6:8dd1:f3c5/64 ip6:2001:8b0:2c1:4b4e::3/64 +mx +a -all"
  • v=spf1 Version 1 of the SPF protocol
  • ipv4:81.187.174.10 allowed IPv4 address
  • ipv6:2001:8b0:2c1:4b4e::3/64 allowed IPv6 address
  • +mx allow any IP addresses associated with the MX record for this domain
  • +a allow any IP address associated with the A record for this domain
  • -all disallow ALL other IP addresses

DKIM

DKIM uses a TXT record on a custom subdomain of _domainkey and has further values e.g. here we have the entry for a key called foo.

foo._domainkey.hardill.me.uk. 86400 IN TXT "v=DKIM1;k=rsa;p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQC/EGMAhXSg6YGVbpWvpJwj1MxF+NT8e1OlU+/cje6ry6umNVfyyWgpsotKd0V6MgyM3t+jhw9qDXyMLxRsMcbILjnHfKkgEOuJ+O7cuGStZLm93kkELJs0jubegGzs9OU5RblBjJia/32K7LMzHDj+jojHZzaHJm4WmxwFK2HURwIDAQAB"
  • v=DKIM1 Version 1 of the DKIM protocol
  • k=rsa The key is of RSA type
  • p=MIGfMA0GCSqGSIb3DQEBAQUAA4GNADCBiQKBgQC/EGMAhXSg6YGVbpWvpJwj1MxF+NT8e1OlU+/cje6ry6umNVfyyWgpsotKd0V6MgyM3t+jhw9qDXyMLxRsMcbILjnHfKkgEOuJ+O7cuGStZLm93kkELJs0jubegGzs9OU5RblBjJia/32K7LMzHDj+jojHZzaHJm4WmxwFK2HURwIDAQAB this is the public key used to verify the signature included in the email

There are some other flags, one I should probably look at is t=s to ensure that subdomains are also covered.

DMARC

DMARC tells 3rd party email servers what to do if incoming mail fails to meet the requirements of SPF and DKIM.

This is again a TXT record on a specific hostname e.g.

_dmarc.hardill.me.uk.	3600	IN	TXT	"v=DMARC1;p=quarantine;pct=100;rua=mailto:[email protected]"
  • v=DMARC1 Version 1 of the DMARC protocol
  • p=quarantine Tells 3rd party mail servers to quarantine mail which fails SPF and DKIM checks.
  • pct=100 Tells 3rd party mail servers to quarantine 100% of all failing mail
  • rua=mailto:[email protected] Asks the 3rd party mail server to email a daily report of how much email it has processed for this domain with a breakdown of how many passed all SPF/DMARC tests and details about mail which failed.

The reports are sent in a zipped up XML format and not really all that human readable, but there are tools to visualise them. I had a poke round and found one called parsedmarc written in Python. I also found a pre-built docker-compose file to run it with Elastic Search and Grafana to plot the charts. The docker-compose config I found was a little out of date, so I forked it and applied a few of the pending PRs from the original repo until it worked. I pointed at the postmaster email address which had nearly 2 years of reports and it munched through them 10 at time over the course of about an hour.

A row of 3 pie charts showing
- SPF Compliance
- DKIM Compliance
- DMARC outcome

With 2 line charts underneath
- SPF results over time
- DKIM results over time

The data looks pretty good with only a very small number of spoofed emails getting sent and all of them getting quarantined.

MX Records

MX records are what map email domains to the servers that will receive mail for them.

hardill.me.uk.		86400	IN	MX	10 mail.hardill.me.uk.
  • 10 Priority, if multiple records try lowest first
  • mail.hardill.me.uk The host name of the SMTP server to deliver the mail to.

Auto Configuration

I have a whole post about this from a while ago. There are 2 main ways to have email clients automatically configure incoming and outbound email servers based on the email address.

One uses DNS SRV records for the domain.

_submission._tcp.hardill.me.uk.	3600 IN	SRV	0 1 587 mail.hardill.me.uk.
_imaps._tcp.hardill.me.uk. 3600	IN	SRV	0 1 993 mail.hardill.me.uk.
  • 0 Weight, if multiple records try the lowest first
  • 1 Priority, if multiple records try the highest first
  • 587 Port to use
  • mail.hardill.me.uk is the hostname

The alternative method uses HTTP to load values from the .well-known/autoconfig/mail/config.xml file.

MTA-STS

Where as SPF/DKIM/DMARC are responsible for outbound email, MTA-STS is a way to signal to 3rd party mail servers trying to deliver email to users on the domain that it should use TLS for all contact with the MX servers.

MTA-STS is implemented as a TXT and a file hosted on a HTTPS server

_mta-sts.hardill.me.uk.	1800	IN	TXT	"v=STSv1; id=202512140940"
  • v=STSv1 Version 1 of the MTA-STS protocol
  • id=202512140940 An id that is incremented each time the file on the HTTPS server is incremented.

HTTPS Server

The HTTPS server must be hosted on a host with the name mta-sts on the domain it applies to, so in this case mta-sts.hardill.me.uk. The file is hosted at .well-known/mta-sts.txt and looks like this

version: STSv1
mode: testing
mx: mail.hardill.me.uk
max_age: 86400
  • version: STSv1 Version 1 of MTA-STS protocol
  • mode: testing Either testing or enforcing
  • mx: mail.hardill.me.uk Which mail servers to use (there can be multiple of these)
  • max_age: 86400 how long a 3rd party server can cache this file

There is also a reporting mode for MTA-STS similar to DMARC called TLS-Reporting

TLS-Reporting

TLS-Reporting uses a TXT record but a domain entry similar to the SRV records used for the auto configuration mentioned earlier. In this case _smtp._tls e.g.

_smtp._tls.hardill.me.uk. 3600	IN	TXT	"v=TLSRPTv1;rua:mailto:[email protected]"
  • v=TLSRPTv1 Version 1 of TLS- Reporting protocol
  • rua=mailto:[email protected] The email address to send the reports to

The tool I installed earlier to handle the DMARC reports can also pare the JSON, I’m going to keep an eye on the reports for a few weeks then enable enforcing mode.

Fediverse Reactions

DNS over TLS with LetsEncrypt

6 months ago LetsEncrypt announced that they would start issuing certificates for IP addresses. Last week I was curious if they had actually enabled it yet for general consumption, it turned out to be not yet available for everybody, but there was a forum thread you could ask to be added to the testing list (I’ve not linked to it as they have said no more testing, it will go live RSN).

When it was announced it was available on their Staging environment. This behaves just like their production environment, just that the root certificate is not included in all the public trust stores (e.g. in your browser or shipped with your OS). But it does allow you to test things. So after the initial announcement I had tried to have a play to see if I got get a certificate to use with my DNS server to support DNS over TLS.

ACME Clients

LetsEncrypt issue certificates using a protocol called ACME. I’ve talked about ACME before, but mainly on the Server side as I run my own internal private Certificate Authority which can issue certificates using ACME. But there are also a number of different client side implementations available.

The standard one is called certbot which is maintained by the EFF, but as of December 2025 doesn’t support requesting certificates using IP addresses just yet.

Back in August when I was first looking one of the clients that had already implemented support for the new shortlived profile that supports IP address based certificates was lego, which is written in go.

LEGO

lego supports a bunch of different ACME authentication mechanisms to prove ownership of the domain/IP address that the certificate is being requested for, but the shortlived profile when combined with a request for a IP address SAN entry requires that either the http-01 or the tls-alpn-01 method is used (because this proves you can run a HTTP server on the IP address you are requesting a certificate for)

The following command uses and existing HTTP server running on the machine that is serving static content from the /var/www/html directory and uses the LetsEncrypt staging server.

The -d values specifies the hostnames and IP addresses that should be included in the certificate.

--profile shortlived indicates which LetsEncrypt profile to use

lego -d dns.hardill.me.uk -d 81.187.174.10 \
  -d 2001:8b0:2c1:4b4e::2 --http.webroot /var/www/html \
  -m [email protected] --http \
  -s https://acme-staging-v02.api.letsencrypt.org/directory run \
  --profile shortlived

It creates/uses a LetsEncrypt an account using the [email protected] email address.

The certificates are stored under the ./lego/certificates directory in the user that runs the commands home page.

This all worked but because the root certificate for the staging server is not trusted by default on most peoples devices I didn’t actually deploy the certificate. Now I’m on the early testing list. The command was modified as follows:

lego -d dns.hardill.me.uk -d 81.187.174.10 \
  -d 2001:8b0:2c1:4b4e::2 --http.webroot /var/www/html \
  -m [email protected] --http run --profile shortlived

The difference is removing the -s option that changed the default server to contact to the staging environment.

Bind9

Now I have real certificates, time to setup my DNS server.

I’m running bind v9 on a debian derived Linux distribution so all the configuration files live in /etc/bind by default. There are 3 default files named.conf, named.conf.options and named.conf.local.

To enable DNS over TLS first we need to load the certificate and key so I created a new file called named.conf.tls

tls local-tls {
  key-file "/etc/bind/dns.hardill.me.uk.key";
  cert-file "/etc/bind/dns.hardill.me.uk.crt";
};

I had copied the certificate and key file to the /etc/bind directory and made them readable by the bind user.

named.conf just includes the other files, so I added the new named.conf.tls file to the list.

include "/etc/bind/named.conf.tls";
include "/etc/bind/named.conf.options";
include "/etc/bind/named.conf.local";

named.conf.options is where a bunch of configured, but this is where we will add the TLS listener on port 853.

options {
  dnssec-validation yes;
  allow-query { any; };
  // recursive internal only
  allow-query-cache { 127.0.0.1; 192.168.1.0/24; };
  allow-recursion { 127.0.0.1; 192.168.1.0/24; };
  listen-on { any; };
  listen-on-v6 { any; };
  //TLS
  listen-on port 853 tls local-tls { any; };
  listen-on-v6 port 853 tls local-tls { any; };
  rate-limit {
    responses-per-second 40;
  };
};

The changes are the listen-on and listen-on-v6 with port 853 which is the default DNS over TLS port, the tls flag says to use the certificates that where loaded with the local-tls name in the earlier section.

named.conf.local is where my actual zones are configures and no changes are required here.

Renewing Certificates

I’ve mentioned a few times now that LetsEncrypt’s IP based certificates use the shortlived profile. Certificates issued under the default profile have a 90 day life time (this is due to come down to 45 days 2028), but the certificates issued by the shortlived profile only last 160 hours (just short of 7 days).

This means that automating renewing is pretty much required, as having to manually renew them once a week (really every 5 days) is going to get boring really quick.

The lego command to renew looks like this

lego -d dns.hardill.me.uk -d 81.187.174.10 -d 2001:8b0:2c1:4b4e::2 --http.webroot /var/www/html -m [email protected] --http renew --dynamic --profile shortlived --renew-hook="/home/ben/bin/new-dns-cert.sh"

The change is from action run to renew and adding the --renew-hook which is a script to run after the certificate has been renewed. The --dynamic signals to only renew the certificate if more than 50% of it’s life has past for shortlived certs..

This renew hook script copies the files to the /etc/bind directory and then triggers bind to reload.

I placed that command in a file called /home/ben/bin/renew-dns-cert.sh, now we have the renewal scripted we need to run this every day to make sure the cert is renewed before it expires.

Systemd Timers

The old way to do this would be to run the script would be to setup a cron job, but the new approach is to set up a systemd timer.

These can be global or on a user basis, in this case I’ll setup a user time. To do this I need to create 2 files in the /home/ben/.config/systemd/user directory.

The first is lego.service which says to run the shell script /home/ben/bin/renew-dns-cert.sh which runs the renewal command I mentioned earlier.

[Unit]
Description=Renew DNS Certificate

[Service]
ExecStart=/home/ben/bin/renew-dns-cert.sh

The second is the actual timer.

[Unit]
Description=Renew DNS certificates

[Timer]
Persistent=true

# instead, use a randomly chosen time:
OnCalendar=*-*-* 3:35
# add extra delay, here up to 1 hour:
RandomizedDelaySec=1h

[Install]
WantedBy=timers.target

The important bit here is the RanomizedDelaySec=1h which adds a random number of seconds to the `3:35 time to help spread things out a bit.

This is all enabled with

systemclt --user enable lego.timer

Testing

The dig tool can be used to test the new DNS over TLS listener.

dig +tls www.hardill.me.uk @dns.hardill.me.uk

Or using the raw IPv6 address.

dig +tls-ca www.hardill.me.uk @2001:8b0:2c1:4b4e::2

The +tls-ca tells dig to validate the certificate, +tls will use DNS over TLS but not validate the certificate matches the DNS server.

Fediverse Reactions

5G Broadband Fail Over

Due to an unfortunate set of circumstances I have been left in limbo between FTTC and a FTTP install.

BT ONT powered by PoE adapter

My FTTP install failed due to a collapsed duct and my FTTC line was ceased a few days later (an oversight led to this not being cancelled). This was then compounded by being informed it would not be possible to reconstitute the FTTC line (which later turned out not to be the case, but not before I’d made the wiring changes to enable the FTTP ONT to be powered via PoE).

The currently timeline for fixing the duct is at least 6 weeks.

So with no fixed line broadband I’ve been running on my emergency LTE backup and making use my ISP’s L2TP service to get my static IP addresses still routed. This works well for short term periods but not for long term as my only option. I’ve also been burning through the pre-paid 25gb per month SIM so had to go looking for another SIM that can cover more continuous usage. I found an “Unlimited” pay and go plan from Smarty for £20 a month, which hopefully I will only need to pay for a 1 month and we’ll see just what the “Fair Use” is, but the plan explicitly mentions tethering so fingers crossed.

I spent a while searching for a 5G USB stick to replace the 4G/LTE one, but they just don’t appear to be a thing. I only managed to find one which came in at about £250 which seamed a bit steep.

The other options were either battery powered WiFi only travel routers or full on Home router replacement units that come with built in WiFi Access Points and a few Ethernet ports. I decided to go the router option in the end and picked a Zyxel unit that was priced at about £135 (significantly cheaper than the standalone USB device).

My thought was this would be more useful in the long run.

Zyxel 5G NR 4.67 Gbps Indoor Router

The router arrived in a pretty simple box, with a power supply and a Cat5 cable (a real one with all the pairs hooked up unlike the monster with only 2 pairs connected I found in a box last week)

The power supply came with clip on adaptors for UK, Euro and US sockets.

On the back it has 2 Ethernet sockets, a USB socket and a little rubber cover over 2 connectors for external antenna (there is a little switch to swap between the internal and external antenna). There is also a small rubber flap on the base of the device which covers the micro SIM socket.

The USB socket is to plug in a hard drive that the router can expose to the LAN as shared storage.

After powering it up I hooked up a USB Ethernet adaptor up to the laptop and plugged it into the the top socket on the back of the device and stuck the gateway IP address (192.168.1.1) into the browser. This opened up the admin interface and in line with recent EU/UK law the device had a random password printed on the back to enter and it made me change it on first login.

After a couple of speed tests, I dug into the LAN configuration settings and changed the default IP address range to 192.168.8.0/24 to match what the DHCP server on the USB LTE stick was handing out to make things easier and to make sure it didn’t clash with my existing LAN range.

I also disabled the WiFi on the device as I have no need to connect anything directly to this router over WiFi and better to reduce channel contention locally.

I then powered everything down and moved it up in the attic next to the rack and plugged the power supply into the UPS.

Mikrotik setup

As I mentioned in the previous post, the USB LTE stick shows up as an extra interface on the Mikrotik router and the stick hands out a IPv4 address via DHCP. So I just need to make sure interface isn’t used for a default route and add a static route to my ISPs L2TP endpoint.

The steps to switch over from the LTE USB stick to the new router.

  • I removed one of the unused port (6) from the default bridge configuration.
  • Added a IPv4 DHCP client to the Interface, unticked the Use DNS and told it not to set as the default route.
  • Deleted the DHCP client for the LTE interface and then remove the LTE interface (this is needed to fix left over next hop configuration)
  • Update the Static routes to the 2 L2TP endpoints to use the new rotuer as the next hop (192.168.8.1%ether6)

Once these changes were made the L2TP client just connected using the new router (because the default PPPoE connection is still down).

Next

The one thing I didn’t mention so far is that I had totally failed to check Three’s 5G coverage (Smarty are a MVNO on the Three network) for my place before setting all this up. It turns out that Three do not have any 5G coverage at the moment so things are no faster.

This is likely to change soon as Vodafone and Three have just completed a merger and will be combining their networks over the coming weeks. While hopefully my FTTP install will complete real soon now, it does mean I will have a faster backup going forward.

Updating Node-RED Alexa Skill to the latest API version

Over the last couple of months I’ve been pretty busy.

This flurry of activity was triggered by Amazon emailing all the users of the Node-RED Alexa Skill Bridge telling them it was going to stop working between November 1st and 4th.

The first I heard of this impending doom was when one of the users sent me a copy of the email.

We’re reaching out to let you know about changes to how Alexa works with Ben Hardill devices that connect to Alexa using the Node-RED Skill. The Node-RED Skill uses an outdated integration that hasn’t been updated by its developer to support all of Alexa’s existing and new functionality.

As a result, the Ben Hardill devices you have connected using their Alexa Skill will stop working with Alexa on November 1, 2025. This means the devices will no longer work with Alexa voice requests, Routines, or the Alexa app at that time.

While no immediate action is required, we want to alert you to this upcoming change. We know how important it is to have a reliable smart home experience. Works with Alexa certified devices are designed to support Alexa’s latest features and deliver the best voice control. To find products that continue to support Alexa’s growing capabilities, click here.

(Even as a user of my own service I never received this email and after much pressing I finally got Amazon to say they had apparently informed me of this deadline in a single email of 2 years ago and there had been no follow up. I can not find that email in archives.)

I’m currently working full time for a Startup, and since creating the skill I have completely shifted to using Google Assistant based voice devices (I do still have an Echo dot in the spare room), with this in mind I wasn’t really all that motivated to keep the skill running. But the weekend after the discovering this the weather wasn’t great so I had a poke to see what might be done.

The first of the emails went out around the 28th of August giving basically 2 months before the cut off date.

Challenges

The Node-RED Alexa Skill Bridge is 9 years old (my first post about it is from the 5th November 2016), and to be honest it shows. When I built it I wasn’t really thinking about long term maintenance or presenting a simpler abstraction to the end user. I was mainly thinking how quickly could I get something to turn the lights on/off in my flat.

This meant that I exposed way too much of the messages that Amazon uses internally to the Skills to Node-RED and the end user.

The latest version of the Alexa Skill API totally changes this message format, which means that it is going to be tricky to keep the Node-RED interface the same while transitioning to the new API.

Solutions

The skill is actually made up of 3 components (OK 4, but I’m going to ignore the MQTT broker as that is just a transport at the moment).

  • AWS Lambda
  • Web Application
  • Node-RED Nodes

Of these 3 only the AWS Lambda function actually needs to interact with Alexa, this is where the commands come in and response are sent back from. This meant that if I could craft a new Lambda to act as a translation layer between the old format and the new then everything else should (fingers crossed) just keep working.

So this is the approach I decided to take, with the plan to only aim for parity with the existing service. If that could be achieved before the deadline I would consider it a success.

First up was making device discovery work, this wasn’t too bad as all the same device types still existed, just some of them especially thermostat/heating/aircon type devices, had been split up into a few different sub types.

Actions came next, basic commands. Again this was mainly simple mapping, and applying some constraints to limit the new wider set of options to match just those available in the older format.

Finally queries, the old service had pretty much fire and forget and it keeps no state, nor allows querying of state of devices. There were only two exceptions to this

  • Locks
  • Thermostats

Locks allow querying if something is locked and Thermostats can query the set point and ambient temperature.

The old API had commands that these mapped to so I could re-use them.

Testing

Here things got tricky. The Alexa skill console allows for 2 versions of the skill, a deployed version and a development version. By virtue of being the author of the skill my Alexa account is permanently bound to the development instance.

So I could swap in the new Lambda in the development instance and kick the tires, this is OK, but as with any software project, end users always find ways to use things in ways the developers couldn’t possibly imaging.

There is the option to invite other users to the development version as part of a beta test program, but I just couldn’t get this to work. I added around 5 volunteers’ email addresses in the form, but none ever received the invites they were supposed to.

The other problem was that there is no real way to test the skill’s response to Alexa apart from live. The docs are not great and there is nowhere you can get any feedback if the response JSON to discovery or a command is wrong. You just need to keep modifying it until it works.

So the only option was to try and push the whole thing live. My thinking was better to break it 3 weeks ahead of the impending deadline and get people to report bugs and work to get them fixed. There was also the problem that getting a new development version blessed by Amazon for release is a bit of a lottery.

You hit the submit button and it can take over a week before you get a reply, which is normally some complaint about the language used in the description or on the service home page, has not met their brand guidelines. So this was another reason to try and get the new version live as it would stop the clock on the shutdown.

In the end it took 2 goes to get it live, one minor tweak to the description, and remove the link to the End of Free Ride post (because asking for donations/help with the hosting is not allowed even for Open Source skills….).

It went live with basic on/off, brightness and colour for lights. Adding colour temperature for white lights came pretty quickly along with Lock support.

Thermostats started well with set point settings and querying, but not ambient temperature, trying to add that totally broke all discovery for nearly a week and I had to drop the “CUSTOM” heater mode as that has been removed in the latest API.

Current State of Play

I think things should basically be at parity with the old version, so I’ll call that a success.

There might be some lingering issues with trying to create devices with the same name as devices you have deleted. While the new skill still does bulk discovery, there are also new methods for signalling when devices have been removed or modified. These are a push mechanism that needs lot of extra authentication code writing (and state tracking). I’m going to see if I can get away with not having to implement that for now.

I pushed the code for the new Lambda to a local private Forgejo git repository, I’ll try and mirror that to GitHub at some point.

My plan is to leave the skill alone for now, I have no incentive to add new device types allowed by the new skill API and the stated goal was to just keep things working. If I get some time over Christmas I may push a new version of the Node-RED nodes just to bump some dependencies.

Fediverse Reactions