All That Jitters Is Not Code

Self-hosting chronicles: plex-oidc, or just writing everything yourself

2023-03-02T00:00:00+00:00

When I originally designed my home network, in conjunction with allowing outside access to some internal apps, Authentik was the lynchpin of providing the authentication/authorization part of the equation. It’s configurable, supports forwarding to Plex as an authentication and pseudo-authorization provider, and it’s open source. It was a solid choice to build things the way I wanted, knowing that I could at least fork it if I needed to tweak its behavior if it went closed source, and so on.

While I got Authentik configured to perform the task at hand, it didn’t come without some hardships along the way. However, I figured once I had gotten it installed, that would be that: no more hardships, nothing else to tweak. Over the course of a few months of usage, though, the rough edges had built up and it got me thinking about how to shore up that corner of the infrastructure.

Issues with Authentik

Hard to customize styling

This one was superficial, but given the amount of time I spent configuring Authentik just to do the necessary authentication/authorization, it was frustrating not being able easily change the styling of various flows.

Authentik normally allows you to configure the “flow background” which is just the background image displayed on a given flow. It also has a pre-existing stylesheet import in its HTML templates for custom CSS, so long as you can put a CSS file at the right path on disk. Unfortunately, due to Authentik’s UI being heavily based on custom components using the shadow DOM, they cannot be styled simply by injecting a CSS stylesheet at the head of the HTML document.

This limitation made it effectively impossible to style the user-facing flows without manually overriding the various HTML templates and various bits of bundled JavaScript and CSS. Doing so would not be conducive to updating to new versions of Authentik without having to make sure our changes still applied cleanly and hadn’t broken UI functionality.

Author’s note: There was an issue filed for this shortcoming which now theoretically has a fix, which is great!

High memory overhead

I ran Authentik on fly.io, which notably has a free tier: up to 3 virtual machines with 1 shared vCPU and 256MB of memory. This should be more than enough for a simple web application that serves maybe 10-15 requests a day at most, and yet, I had to go above and beyond the free tier to run Authentik.

Fundamentally, Authentik and its requirements meant we needed to deploy three applications, and thus use up all of the allocatable free tier VMs: one for Authentik itself, one for Redis, and one for Postgres. This would be fine if the free timer VM size but adequately, but you can probbably guess where this is going…

Generally speaking, Redis was able to fit within the free tier constraints because it’s already designed to be configured with memory limits. As far as Postgres, it was also usually able to fit within limits but during its own startup, as well as Authentik’s startup, it would often hit the memory limits and OOM itself. The amount of data stored was less than 5MB total. Still, between pre-allocated buffers and various queries made by Authentik, it often seemed to manage to just crest the memory limits long enough to be hit by the OOM killer. Authentik itself was the significant outlier, frequently ballooning up to 350-400MB idle RSS over time.

Due to this, I had to increase the size of the two VMs used for Postgres and Authentik to have 512MB of memory. Increasing the memory allocation knocked me out of the free tier. As I’ve said in previous posts, fly.io is awesome and I’m okay with paying them, but it certainly rubbed me the wrong way: the application is idle nearly 100% of the time, and yet I had to bump up the VM size, and incur a monthly bill, just because it was a hungry hungry hippo for memory those few times it happened to serve a request every week.

Fundamental flow-based design led to a janky browser experience

While Authentik’s flow-based design is excellent from a composability standpoint, being able to build complex flows with branching logic and so on, my needs were far simpler: always show users the button to login via Plex on the landing page, and when the login pop-up modal closes because they successfully authenticated to Plex, the landing page should be able to send all the data back to Authentik, verify that the user was authenticated and authorized and so on, and then immediately respond with the data required to trigger a browser redirect back to the OIDC SP (service provider, Cloudflare Access in this case) right then and then.

In practice, Authentik added many redirect steps that were also just needlessly slow.

The authorization flow always starts with a landing page that provides a button to trigger a login modal pop-up. This is necessary because without the login modal being triggered from a direct user action, modern browsers will stop the pop-up, as it was performing a cross-domain request. Alright, so we’re always stuck having to load a page where the user needs to at least click the login button.

Once a user successfully authenticated with Plex, the modal closes, and Authentik does its first redirect to its authentication page, where a little flash banner quickly pops up to say “you’re authenticated!”. In reality, the page is saying the user is authenticated to Authentik itself, and not exactly Plex: while we authenticate the user via Plex, we have to create a user in Authentik to map the Plex user to, and then force the user to be authenticated with Authentik. Understandable, but slightly clunky for our usecase.

After that, the user is redirected again because now we have to authorize them, and again, they see a flash banner pop-up that they were successfully authorized. As a user, you may now be confused on what the hell the login pop-up did if you had to be redirected to all of these other pages. Most users don’t understand (nor should they need to) the difference between their successful “authentication” and their successful “authorization.”

Technically speaking, though, this was suboptimal because authenticating to Plex was also how we authorized the user: do they have access to a specific Plex server? Authentik’s design, though, didn’t allow this to be collapsed: we had to authenticate the user with Plex first and then essentially start back at the very top of a traditional authentication/authorization flow.

Slow OIDC performance led to confusion on if things were “stuck”

On the final redirect, where the user was told they were now authenticated, a redirect would eventually be generated to send the user back the OIDC SP. This involved generating signed tokens and other OIDC-related bits, which is fine and dandy. This part was frustratingly slow, though. due to some inefficiencies in Authentik around loading and handling signing keys.

From a user perspective, the main problem was that this delay in processing created confusion around whether or not the user was successfully authenticated/authorized, and if they were supposed to be seeing the actual application. Of course, they only needed to wait a few seconds and then the page would finally load, and they would be redirected to the protected resource. On a page that appears to be fully loaded, though, with no progress indicator, having to wait three or four seconds can certainly cause the user to think something has gone wrong.

Admittedly, I traced this down to signing keys being reloaded multiple times in a single request, making a response that should have taken one second to return sometimes take nearly three to four seconds instead. I tweaked a local checkout of Authentik to find and fix this inefficiency, but admittedly, I never contributed it back upstream. I do still plan to.

Crafting the minimum viable solution: `plex-oidc`

All of this led me to think: could I just write my own solution?

In theory, all we need to do is make the same API calls to Plex to handle the Plex authentication flow, and then query for the Plex user’s server associations to handle the authorization. Sprinkle some OIDC on top of that to use it from Cloudflare Access, and voilà!

My day job involves writing Rust, and I had already written a ForwardAuth service for working with Cloudflare Access tokens based on Rust, which has to deal with OIDC tokens.. so I had a good idea of how I’d approach this and decided to just sketch something out and see how far I could get in a sitting.

Handling the basics

Most of this is somewhat rote, but I started out with an equivalent skeleton based on cloudflare-access-forwardauth, which uses tracing for logging/tracing, axum for the web “framework” – routing, request handlers, parsing headers/query parameters, etc. – and a cast of other helper crates: axum_sessions for handling sessions, tower_http for adding per-request logging and CORS, tower_governor for rate limiting, include_dir for handling static content, and a few others. We’re also using tokio as the executor, since axum is based on hyper, which is built on tokio.

This let us rig up the basics of our service: async runtime, logging, exposing an HTTP API with sensible observability and hardened defaults. I additionally built out some of the configuration types which I used with serde to deserialize our service configuration from a configuration file on disk.

Again, more or less what you’d consider the “basics.”

Handling OIDC

Next, I needed to handle the OIDC flow that this service would be a part of. I’m using Cloudflare Access to front all of my exposed self-hosted services, which meant that I only needed to handle the standard “authorization code” flow.

Cloudflare Access hits the “authorize” endpoint of my service, at which point we trigger the Plex login flow. If the user successfully authenticates with Plex, and they’re authorized to access our Plex server, we redirect back the user back to the redirect URI provided in the authorization flow request. Finally, Cloudflare Access calls the “token” endpoint of the service behind-the-scenes to retrieve a token for the now-authorized user, and if that goes well, the user is redirected to the intended resource, and we’re done.

I used the openidconnect crate to build the OIDC routes and handle all of the relevant bits:

a configuration endpoint which details which OAuth 2.0 flows are available, what scopes can be requested, what signing algorithms we can use, and so on
a JWKS (JSON Web Key Set) endpoint, which describes which signing keys are valid in terms of verifying a signed JWT (JSON Web Token) returned by the service
the authorize and token endpoints, which start the authorization flow and allow retrieving the token of an authorized user, respectively

openidconnect is more geared towards building OIDC clients, rather than servers, but all of the necessary and relevant primitives exist in the crate. This did mean, however, that we had to lean a lot of the Authentik source itself, and random blog posts and application docs from Google on OIDC flows – including a surprisingly straightforward set of OIDC docs from IBM (go figure) – to understand what each step needed to do, functionally. RFCs are utterly dense, and I find them impenetrable for understanding things holistically like “ok, how does OIDC work?”.

Thanks, random bloggers, for all of your OIDC explanations.

I also used an in-application, in-memory cache to hold all of our pending authorization flow data. Most full solutions (Authentik, Keycloak, etc.) will actually put this stuff in a database for you, since they rightfully expect that things such as refresh tokens will be, you know, refreshed over time… and that you want to keep authorized users authorized even if the service restarts. My stakes are lower – authorizing friends and family to access media apps – so I went with the simplest possible approach.

I used moka for this, which is an easy-to-use and capable caching crate. moka also uses quanta, one of my crates for providing cross-platform access to fast, high-precision timing sources like TSC. Something something subscribe to my Soundcloud… but really, I wanted to support people who support me. Thanks moka! 👋

Authenticating and authorizing users via Plex

The final piece of the puzzle was getting the users authenticated with Plex, which we then piggyback on to do authorization. Plex’s API lets you query if a user has access to a particular server, so the flow ends up looking something like:

user authenticates to Plex
once authenticated, we can query their friend list to see if a specified user ID (our user, the admin user/owner of the given Plex server) is contained
if contained, they are authorized, otherwise, they’re not authorized

This is pretty straightforward, so I whipped up some helper types to use for (de)serializing request and response payloads and wrote a simple client wrapper based on reqwest, which was a bit simpler to deal with than using hyper directly. If I was on a budget, maybe in terms of reducing transitive dependencies or maximizing performance or memory usage or something, I might use hyper directly to only do exactly what was needed, but that wasn’t a concern here.

Flexing those frontend muscles and writing some HTML/CSS/JS

Ultimately, the service needs to show at least one page to the user where the Plex login flow itself is triggered, which means we need to render some HTML and provide some JS. As mentioned early on, I wasn’t particularly happy with the lack of styling capabilities exposed by Authentik so I took my time here to tweak things exactly how I wanted them.

We don’t use any sort of CSS framework, or any JS libraries, or build tooling. All of the HTML, CSS, and JS was hand-rolled. Yes, it’s possible to write HTML, CSS, and JS that return a decent-looking page on modern browsers, across multiple device types and screen sizes, without having to possess otherworldly knowledge. Could I have whipped up the UI much faster if I used something like Tailwind CSS? Yes. Would my static assets be much smaller and faster to load if I used a build pipeline like Webpack or Vite to bundle and minify and tree shake and all of that? Yes.

Is what I have more than acceptable for my use case? Unquestionably “yes.”

Using include_dir, we bundle these static assets directly into the binary. They get served normally via a dedicated path for static assets, which means they can then be trivially CDN-ified, but ultimately, they’re just part of the binary, which is one less thing to need to faff about with during deployment.

Putting it all together

Now with the service written, and the various phases complete, the lifecycle of authorizing a user looks like this:

a user navigates to a resource protected by Cloudflare Access
the user is not authorized, so they’re redirected to the OIDC authorization endpoint
we initialize the authorization flow for the user, and return the landing page with sparkly HTML/CSS
the user clicks a button to pop-up a login modal for Plex
the landing page polls the Plex API on an interval to figure out when the user has finally authenticated with Plex
once they’ve been authenticated, the landing page sends the user’s Plex token back to the service to continue the authorization flow
we actually authorize the user by checking if the configured Plex server is shared with them
if they’re authorized, we redirect back to Cloudflare Access with the relevant OIDC authorization code
Cloudflare Access calls the OIDC token endpoint to exchange the authorization code for their access/ID token and verifies that the tokens came from us
if all of that succeeded, they’re eventually redirect to the underlying resource, with their Cloudflare Access JWT token (which is where cloudflare-access-forwardauth would take over)
fin!

Mission accomplished?

Simply put: yes.

Improved user experience

The authentication/authorization flow now feels like you’re really only hitting a single page – the landing page – and then you’re being whisked off to the protected resource.

In contrast, a user going through the Authentik flow would see the landing page, and then the authentication flow page, and then the authorization flow page. Add to that fact that the authorization flow page sat there for a while due to the aforementioned slowness with generating OIDC-related responses, and now things feel vastly snappier.

In practice, Cloudflare Access itself is always going to do some redirects – two before the user gets to the landing page, and two after we send the user to the OIDC redirect URI – but those pages load quickly, which is no surprise given that they live on the Cloudflare side. Perhaps most importantly, they load/display no user-visible content, so again, the user only ever feels like they load a single page – the landing page – and then they’re redirected and it spins ever so briefly before the protected resource is loaded for them.

This is a testament to going with a hand-rolled solution to optimize for the requirements at hand. This isn’t so much a knock against Authentik itself, but for this use case, it was entirely overkill and proved to provide a suboptimal UX.

Improved styling

Not a whole lot to say here. We control all of the HTML/CSS styling now, so we can (and did) do whatever we want with it.

Improved memory overhead

We were able to go from three deployed Fly apps (Authentik, Postgres, Redis) down to one app (plex-oidc). We were also able to switch back to the smallest VM size – 1 shared vCPU, 256MB memory – and plex-oidc idles at around 30MB RSS.

Our bill is now comfortably within the free tier for Fly.io, so we pay nothing to host it. Woot!

Was it worth it?

While I definitely achieved the things I wanted to achieve, it’s still worth considering: was it worth it?

I spent close to 2-3 weeks of part-time tinkering to build plex-oidc and get it functioning how I wanted. Realistically, my users log in once a week at most. When you do the math – 5-10 logins a week, and 6-7 seconds “wasted” by using Authentik – the amount of user time saved is very small. That fact is inescapable.

Having operated this service for a few weeks now, the one thing I didn’t originally consider, or at least didn’t have top of mind, was that Authentik burned some of my time, in an operations sense. Sometimes people would hit bugs with the authentication/authorization flow, or Authentik would crash due to OOM and get into a weird state with Postgres locks or stalled Celery tasks or whatever.

In almost all of those cases, I just restarted the relevant Fly app, but it still required me to disengage from whatever I was doing, mentally and physically. This was a system I pointed friends and family at, that I just wanted to work. Even knowing that it’s not the end of the world if I didn’t check on a problem for a few hours, it felt like a splinter in my mind.

All of this said, plex-oidc has not had a single hiccup since I deployed it. It just works. No OOM issues, no weird issues seemingly related to shoehorning my intended authentication/authorization flow into Authentik’s model of flows, nothing. It just works, and keeps on working. That part made it worth it, and continues to make it worth it every day.

But where’s the source code, bub?

Admittedly, I started working on this as a public repo, because why not?, but then I made it private. In fact, I bundled it into my infra monorepo of sorts, since this allowed me to iterate faster by inlining secrets and tokens directly without having to nerdsnipe myself: oh, where should I store these secrets? how will I deploy these secrets? There’s also the aspect of the static content being hand-rolled for my particular set-up, which means colors and background images and hastily-designed logos specific to my set-up.

I’ll likely open source this at some point in the near future once I have some time to clean up the aforementioned things, and figure out how to generify the HTML/CSS and all of that.

Self-hosting chronicles: Authentik configuration

2022-11-25T00:00:00+00:00

In my inaugural post, I briefly covered the shape of my new approach to exposing self-hosted applications to the public internet in a reasonably secure way. Most of this set up depends on a piece of software called Authentik, an open-source identity provider which acts as the glue between Cloudflare Access and the actual authentication mechanisms I want to depend on.

Getting Authentik set up and configured the way I wanted it to be configured was by far the hardest part of the entire process, so it felt worth writing up on the off-chance that someone running into the same problems as I did can potentially find this post, presented as an encapsulated list of problem/answer tuples, and get themselves sorted that much faster.

Deployment

Authentik, as an identity provider, has to run somewhere. When you think of Google or Github or Okta as identity providers, what’s the one thing they all share in common? Their IdP endpoints are exposed to the public internet, because you need to be able to communicate with them from your browser or mobile device. Authentik is no different in this regard, and so I needed to run it somewhere where I could expose it to the public internet. As laid out in the aforementioned post, I chose Fly.io for this purpose.

Fly.io is a … something between an infrastructure-as-a-service provider and platform-as-a-service provider, depending on how hard you squint. They have the trappings of a PaaS provider, where their flyctl CLI tool, and a simple fly.toml configuration file, is essentially all you need to start creating and deploying applications.

Despite this, I still hit some roadblocks that were a culmination of how Authentik expects things to be done, and how Fly.io works.

Writing to standard error, or not

The first problem I encountered was that /dev/stderr does not seem to be available on Fly.io.

The “lifecycle” script used by Authentik has a simple logging function which pipes its output to /dev/stderr. I forget the exact error message that I got… maybe “file doesn’t exist” or maybe “permission denied”, but I got an error, consistently.

My original searching lead me to this forum post on community.fly.io, which made me originally think that this was potentially a bug with both stdout and stderr, and perhaps it hadn’t yet been fixed for stderr? I had also stumbled onto this Unix StackExchange question where one of the answers suggests just using the 1>&2 style of redirection instead of the stderr /dev entry itself.

Admittedly, I never investigated this more holistically and instead took the simpler approach of just not having output redirected to /dev/stderr:

FROM ghcr.io/goauthentik/server:2022.9.0
USER root
RUN sed -i -e 's# > /dev/stderr##g' /lifecycle/ak # <-- This lil' guy right here.
USER authentik
ENTRYPOINT ["/lifecycle/ak"]
CMD ["server"]

Limitations with Upstash Redis

The next issue I encountered, once Authentik could actually start, was that the managed Redis service (Upstash Redis) doesn’t support multiple Redis databases.

Authentik uses Redis for two purposes: caching within Django and Celery. Celery is a Python library for distributed task processing, where you can enqueue tasks to be run by a pool of Python worker processes, and those workers pick up the tasks and run them and report the status and so on. Celery has a concept of “backends” which are the systems that actually get used for the transport of task messages, of which Redis is a supported backend.

Authentik takes advantage of already wanting Redis for Django caching and configures Celery to use it, too. The problem is that Authentik points itself, and Celery, at different Redis “databases” in an attempt to isolate the data between the two use cases. Redis databases are a logical construct for isolating the keyspace, so that keys don’t overlap and clobber each other, and so on.

Upstash Redis, which exists by itself but is offered as a managed service on Fly.io, doesn’t support multiple Redis databases, instead only supporting one database, the default database, at index 0. Luckily, Authentik already exposed the ability to change the database selection as part of its configuration, which was simply achieved by setting the follow environment variables:

AUTHENTIK_REDIS__CACHE_DB = "0"
AUTHENTIK_REDIS__MESSAGE_QUEUE_DB = "0"
AUTHENTIK_REDIS__WS_DB = "0"

This shoves all usages of Redis – both Authentik itself and Celery – onto the same Redis database. So far, I’ve yet to experience any issues with doing so, and even in a brief conversation with the creator of Authentik, they weren’t necessarily sure if Authentik needed to actually isolate the data like this, or at least isolate it any longer.

Admittedly, I ended up spinning up my own Redis container/application on the side because using Upstash Redis kept leading to weird Redis issues from Celery’s perspective. There was originally a bug with how Upstash Redis implemented MULTI/EXEC for certain commands that was definitively wrong (compared to the official Redis behavior) which Upstash fixed a few weeks after I reproduced and reported the behavior… but even after it was fixed, somehow, their service still acted weird when used by Authentik/Celery.

I didn’t have the time or energy to keep debugging it, so I spun up my own Redis container/application. Maybe there were more bugs with Upstash Redis that are now fixed, or maybe I was doing something wrong with my particular configuration… who knows.

Configuration

At this point, I at least had Authentik deployed and running. I won’t go into all aspects of configuring it prior to the first initial successful deployment, because the rest of it is fairly mundane and covered by their documentation.

The real bugaboo, however, was actually configuring Authentik in terms of its behavior.

How Authentik works

Authentik has a very particular viewpoint/approach in terms of how it operates. This is not to say that the approach Authentik takes is bad, just that deviating from it can often leave you frustrated.

To start, Authentik has an out-of-the-box experience that configures a number of stages, policies, and flows to give you a setup that is close to how most people might be expected to use it. You can and should read the documentation, but Authentik uses the concept of flows and stages as a way to describe the authentication/authorization journey a user takes. Configuring these flows/stages is how you configure how Authentik works: how users authenticate (username/password? federated login?), whether or not they’re authorized to access a particular applicatiom, enforcement of password strength or two-factor measures, and so on.

To achieve this, Authentik supports a feature called “blueprints.” These are YAML files that can be processed to programmatically create flows, stages, policies, bindings, and most all of the various model objects that are used to configure Authentik’s behavior. These YAML files are essentially data mocks on steroids: you define the model objects themselves, and are given access to helper logic in the form of YAML “tags”… such as setting the value of a primary key field to !KeyOf ... to have the blueprint importer search for another blueprint-managed object by a logical identifier and insert the actual primary key for you.

Using blueprints

As Authentik is an identity provider, many users do in fact use it as the source of authentication/authorization itself… as in, users are registered in Authentik as the source of truth, and everything flows outwards from Authentik. The out-of-the-box experience works well for this, and in many instances, I’ve seen users be told to simply tweak the out-of-the-box flows/stages to achieve their desired outcome.

As part of my deployment, though, Authentik was simply glue logic between an existing authn/authz mechanism – Plex’s own identity provider – and Cloudflare Access. I didn’t want local users created, and I didn’t care about specific applications, and I certainly didn’t want Authentik to proxy any access or anything like that. I just wanted an identity provider.

Most importantly, I wanted to use blueprints, because they seemed to be the only way (a good way, to be fair!) to idempotently configure Authentik.

Blueprint pitfalls

As I started to look into configuring Authentik entirely via blueprints, I hit numerous pitfalls and struggled often with finding a consistent source of documentation and answers to mystifying behavior. I’ll list these out in no particular order.

Blueprint file structure is documented with varying levels of specificity

Looking at the existing blueprint files can be a useful hands-on example of how to write your own, but this leaves a lot to be desired. As blueprints are essentially model object mappings, you’re sort of writing your very own INSERT INTO table (field, field, ...) VALUES (...) but in YAML. This means you actually need to go and look at the model definition for objects you want to add, or look at the source code.

There’s no documentation for the models, either by themselves or in the context of blueprint development. This means you have to become intimately familiar with the models if you’re trying to create something via blueprints that doesn’t already have representation in the out-of-the-box blueprints.

Beyond that, there’s a lot of uncertainty around parts of the blueprint definition such, as, for example, specifiying identifiers. Identifiers are used to provide a unique identifier (duh) for a given model object.. which is fine, and makes sense. No issue there. Again, however, there’s little to no documentation on the models, so if you don’t specify the right identifier fields, the blueprint importer just yells at you.

The experience is poor, to say the least.

Blueprints have no ordering constraints

Authentik supports what they call a “meta model”: since blueprints are tied almost exclusively to actual models, they have a concept of “meta models” which allow operations other than creating a model object. The only meta model currently is “apply blueprint.”

If you’re like me, you might read that and think “oh nice, dependency management!”. Or maybe something along those lines. It’s certainly what I thought, given that meta models are described as:

This meta model can be used to apply another blueprint instance within a blueprint instance. This allows for dependency management and ensuring related objects are created.

Alas: nope.

The meta blueprint apply operation certainly can import and apply other blueprints, but it cannot use them in any sort of way that would actually qualify as dependency management. There’s no ordering (apply blueprint first, and then create the actual model objects described further down in the blueprint, etc) and there’s no logical inclusion, which means you can’t even reference model objects created in meta-applied blueprints (i.e. using the special YAML “tags”) because it doesn’t actually load the blueprint in an inclusive way, or dependably apply it first.

That’s not to say any of this is easy to program up, but, I mean… the documentation says the words “dependency management”, and there is no hint whatsoever of anything that could reasonably be considered dependency management. All it lets you do is prevent one blueprint from running in the meta-applied blueprints have been run successfully… so you get to wait multiple hours for blueprint runs to finally make enough progress to cover all of the blueprint dependencies. Not great at all.

I currently use a single blueprint file to reliably construct all of the necessary model objects and be able to correctly reference them in the creation of subsequent model objects.

Actual flow behavior and figuring out what matters

As I was trying to write blueprints for repeatable, idempotent configuration, I also still had to figure out what I even needed to configure to achieve my desired setup.

Remember that Authentik uses a concept of flows and stages, where flows are tied to specific phases of the journey of a user through an identity provider, such as identification, authorization, enrollment, and so on. Stages are merely the next step down, and allow configuring the actual steps taken within each of those flows.

This is where things mostly got/felt wonky because, again, the documentation has oscillating specificity. If you go to the documentation section on “Flows”, the landing page has some decent high-level information on flows, and the various available flows, but it has more content by volume on the various ways you can configure a flow’s web UI (flows are also weirdly intertwined with the context they’re used in i.e. web via “headless” (LDAP)) so you really have to spelunk here.

A lot of boilerplate to satisfy various flow constraints

As an example of where this all gets wonky and cargo cult-y, let’s consider the authentication flow.

Authentik itself is always the main entrypoint, even if like me, you end up totally depending on a federated social login setup a la Google/Plex/etc. This means we need an authentication flow, which is all fine and good. We can configure an identification stage in that flow that specifies how a user should identify themselves. There’s already a fairly intuitive bit of verbiage on the configuration modal for an identification stage around showing buttons for specific identification sources, such as these federated social login options. Think of it like the typical “Sign In with Google” buttons you may have seen before.

“Great”, you think. You dutifully select the federated social login source you want to use, and unselect the username/email options because we don’t want to login as local users. You try to use it and… weird, doesn’t work. As it turns out, you actually need to specify an authentication flow for the federal social login source itself. Additionally, you also need an enrollment flow for your federated source login source.

This is ultimately because you need to log in to Authentik itself (remember the part about Authentik being the “entrypoint”), which means need to creating a local user in Authentik even if you’re farming out that aspect to something like Google… which necessitates the separate authentication and enrollment flows.

This isn’t called out anywhere in the documentation that I have been able to find. Pure trial and error here.

Making everything a flow leads to a suboptimal user experience

Despite the above, I eventually managed to get my intended flow to working. Handing it over to a user to test out though yielded the following: “my browser refreshed a lot of times and seemed like it didn’t actually work, and then I eventually landed on the application”.

What the user was describing was an idiosyncracy of everything – including sources – needing a flow, and how flows are executed.

Since our federated social login source needed source-specific flows – two of them – this actually meant that the web-based flow redirects them three to four times in total:

they land at the Authentik “identify yourself” page where they have to manually click on the button for authenticating via Plex
a modal pops up for Plex login, etc, and closes once they do so
the page then redirects to source authentication page which logs them in
if they’re brand new, they’re also redirected to the source enrollment flow first
then they’re redirected to the source authorization flow (which is a no-op, in my case) which sits around for 2-3 seconds inefficiently calculating JWTs or something
finally, they’re redirected back to the SP initiator (the entity that triggered the IdP flow in the first place) and can actually use the thing

Some of this could be solved by allowing for short circuiting behavior around default options, such as not requiring the user to have pick an identification source if there’s only one available… or not redirecting to a specific flow if it’s literally a no-op. There’s an open PR for the former, albeit stalled on getting pushed through.

These are paper cuts that make using Authentik clunky from the user perspective. I’m not happy having to spend so much time actually configuring flows to do what I want, but at least if I can turn that into a consistent and intuitive experience for users, then the effort was worth it. When a user is getting redirected multiple times, or hitting pages that are so slow to execute that they wonder if they’re stuck… then the experience is not consistent or intuitive.

Conclusion

I intended to write this post in a very problem/answer-oriented way, and I managed that in some regard, but clearly I diverged a bit because recounting the sheer effort I expended to get things configured made it hard to not vent a bit.

Hopefully for anyone reading this, it gives you some additional insight to get Authentik configured with less effort than I personally spent.

As for me, I do intend to eventually contribute back, or try to contribute back, improvements to the sharp edges that chewed up so much of my time. I do think there’s something to be said, though, about focusing on the user experience aspect where possible, especially when it comes to security. My users can’t get frustrated and actually, like… do anything less secure as a workaround to avoid redirect hell: they use this thing I’ve configured and provided, or they don’t get access at all. For other people deploying Authentik, though? Who knows what their use case is and if these shortcomings might undermine their ability to provide an alternative authn/authz solution to their users that doesn’t feel more clunky to use than whatever they came from.

Self-hosting the hard way: securely exposing services on the World Wide Web

2022-10-27T00:00:00+00:00

Prelude

Over the past few years, there’s been somewhat of a rennaissance around self-hosting services, such as media storage and e-mail. Whatever the reasons may be – financial, technical, moral, etc – there’s an appetite to host services on ones own infrastructure, under ones control. I myself host a number of services on my own infrastructure, complete with a cheeky location code for my basement “datacenter”. However, most people’s lives don’t exist a bubble that ends exactly where their wifi network begins. Being able to use these services on the go is just as important as using them at home where they might be hosted.

tl;dr: Cloudflare Tunnel for avoiding having to directly expose my home infrastructure, Authentik running on Fly.io for exposing an externally-accessible Identity Provider, and Cloudflare Access using Authentik as an IdP for authenticating/authorizing all requests before they ever hit my network

Why not Tailscale?

Now, I’m a tech person, and you’re probably a tech person too if you’re reading this blog, and you’ve potentially already muttered at the screen, bewildered: “why not just use Tailscale?”. Tailscale, in a nutshell, provides point-to-point encrypted connections to allow accessing remote devices as if you were on the same local network, but from anywhere. It has clients for all major desktop and mobile platforms and using it is an absolute breeze. Tailscale is already part of my homelab ecosystem, but it has one particular limitation: it’s not good for sharing arbitrary access to self-hosted services. Sure, the client support is there, but asking friends and family to install Tailscale on all the devices they might want to use, and then managing that access … that’s not a task I want to take on.

Figuring our requirements

With all of this in mind, I set out to find a solution to the problem as I saw it, and came up with a list of requirements:

Services must be accessible from “anywhere”
New software should not be required (i.e. you shouldn’t need a VPN to access the equivalent of a hosted GMail… just your browser)
Should not be directly exposed to the internet (no port forwarding)
All requests must be authenticated/authorized before traffic ever reaches anything in our infrastructure
Should be free (no software/hosting cost, ideally)

Over the course of a few weeks, I did a bunch of research, scouring the likes of the r/selfhosted subreddit, popular forums for home hosting like ServeTheHome, and general Google spelunking. After a lot of tinkering and toiling, I finally came up with a solution that checks off all the boxes, which I’ll go through below.

Wiring up our services to the internet

Cloudflare has a large catalog of services they offer, but one of the more intriguing ones for self-hosters is their “Tunnel” service. Cloudflare Tunnel (neé Argo Tunnel) provides a way to run a small daemon within your network that establishes a reverse connection back to Cloudflare’s POPs, where in turn you can expose applications inside your network across that tunnel. Similar to configuring, say, nginx or Apache, you point cloudflared at an upstream target (or targets) to send traffic to (like localhost:8080 or svc.mylocal.network:9000) and then configure, on the Cloudflare side, what public hostnames to expose those services at. When traffic hits Cloudflare’s edge, it gets sent across the established tunnel, and ultimately lands at your service.

This is really, really cool and helps us check off a lot of boxes, at least partially:

Our service is not exposed directly to the internet. Attackers could still exploit RCEs in our service or stuff like that, but we don’t have to loosen our firewall rules one bit.
Our service is, for all intents and purposes, accessible to anyone with an internet connection. We’re not concerned with impediments such as country-level firewalls, DNS blackholing, corporate proxies, etc.
Cloudflare Tunnel is free. All of the other bits – a basic Cloudflare account, hosting DNS for a domain, protecting subdomains with TLS, and so on – are also free.

Admittedly, I started looking at Cloudflare Tunnel before Cloudflare’s absolute fumbling with the whole Kiwi Farms situation. In no uncertain terms: fuck Kiwi Farms and all of the dipshits who gave it life.

If there was another company providing a service equivalent to Cloudflare Tunnel, I’d use it. Until then, I’m not actually giving Cloudflare money, and I’m mostly trying to provide a service to friends and family in a secure way.

If you know of an equivalent offering, I’m all ears!

I repurposed my existing Cloudflare account to host the DNS for my quirky self-hosting domain (this one!). The setup instructions for Cloudflare Tunnel and cloudflared, the daemon that runs in your infrastructure, are short but straightforward. I spun up a simple “Hello world!” app on one of my servers, and ran cloudflared via Docker. After a small bit of configuration in the Zero Trust dashboard to create a new subdomain and associate it with a target on my side of the tunnel, we were serving traffic… except I had misspelled “world” as “wordl”, so now my low-grade dyslexia was publically – albeit securely – on display to the whole… “wordl.”

All told, this proof of concept took less than 30 minutes. Honestly, it felt a lot like magic. I was on a roll at this point.

Onward!

We need a bouncer at the entrance

While having the services safely exposed to the internet was half the battle, I still needed an answer to the problem of authentication/authorization. As alluded to above, I was trying to imagine what the chinks in the armor might be for a set-up like this, and naturally, software vulnerabilities came to mind. Even if a service is fronted by Cloudflare, an attacker can still get requests to the service. I write software for a living, and I know just how much code is out there, waiting for someone to walk by and notice how trivially it can be eviscerated.

I needed a way to actually protect the service before traffic was allowed through, by authenticating and authorizing users. I considered the security of the applications themselves as out of scope here:

generally, I trust the people to whom I’ve given/will give access to
some of these applications have a node_modules folder that would make a security researcher either salivate or run away from screaming… so I chose to value my sanity and ignore delving too deep on the application side

With that said, again, I tried to be regimented and came up with a list of requirements/invariants for doing authentication/authorization:

all requests must be authenticated/authorized before traffic reaches my home infrastructure, so we can’t host any part of an authn/authz solution on said home infrastructure
we want to let users authenticate with identities they already have, otherwise we’re back to the Tailscale problem, where we’re constantly managing not only access to be on the same tailnet, but to the self-hosted applications themselves (I don’t want a ‘local network has admin’ authorization scheme)

Admittedly, I’m writing this in a different order than how I approached finding a solution because it felt a lot like having no answer at all until all of the pieces came together. With that said, what I ultimately landed on was based on another Cloudflare product: Cloudflare Access.

As mentioned above, Cloudflare Tunnel is part of Cloudflare’s overall “Zero Trust” product suite, which is their set of offerings for small businesses and enterprises to do the whole zero trust thing: trusted devices with device posture management, being able to access internal resources by currying favor granted to their trusted devices, and bringing all different manner of services into the fold in a generic way, whether they’re self-hosted (Cloudflare Tunnel) or external, by providing authorization at the Cloudflare edge (Cloudflare Access). Their product people would probably snort here at my crude explaining of the Zero Trust product suite, but suffice to say that it provides all of the building blocks we need to build this solution.

Anyways, Cloudflare Access provides the authorization aspect by allowing you to specify identity providers that users can authenticate to, which then allow you to authorize them with simple policies on the Cloudflare side, and Cloudflare handles managing the cookies/tokens/proxying of traffic to your services. In particular, it can sit in front of a service exposed by Cloudflare Tunnel without the two CF products having to be configured to know about each other. As long as both the tunnel and the Access configuration are set to handle the same specific domain, it just seems to work…. the authentication/authorization happens first, and then it continues on with tunneling the traffic.

Honestly, a lot like magic.

Who’s on the guest list?

I knew now that Cloudflare Access was the answer to how to handle authorizing all requests before they made it to my home infrastructure, but what I didn’t know yet was how to authenticate users. I also didn’t know how to authenticate them off-site, to avoid the catch-22 of traffic needing to hit my home infrastructure first.

Some more spelunking later, I uncovered a name that kept showing up while sifting r/selfhosted: Authentik and Authelia. These projects, and others like them, are essentially “build your own identity provider” solutions. They allow you to handle all the common aspects of running an identity provider: creating/managing users and groups, importing users from other identity providers, doing authorization, and so on. I ultimately chose Authentik because one of the services I run is Plex, and Authentik has support for using Plex as an identity provider itself, which meant for services where people were already accessing my Plex content, they could authenticate using the same identity/credentials. Further, Plex provides an API to distinguish if a user authenticated through their identity provider has access to a specific Plex server, which meant I could essentially get authorization for free in some cases. If a user should only get access to an application because it’s related to their access to my Plex server, then authentication and authorization through Plex becomes one and the same.

Configuring Authentik was fairly painful: it took me a while to pour over the documentation, figure out how to create my own identity provider, configure all of the various flows/stages correctly, wire it up to Cloudflare Access, and test it. Documenting all of the steps I took, and the configuration I landed on would be too big for this post, but is something I’m looking to post about more in the future.. ideally with an opinionated configuration to help others start from. In the interest of brevity, here’s how I configured Authentik:

Authentik gets configured to use Plex as a federated authentication source, further constrained to users that have access to my Plex server
Authentik is configured to create shadow user accounts locally to mirror the users in Plex, and assigns them to a specific group (not required, but for my own sanity)
Authentik exposes a dedicated OAuth2/OpenID Connect endpoint that uses this Plex-based federated authentication for its own authentication flow, and authorization is a no-op on the Authentik side since we get it implicitly based on how the Plex authentiation works
Authentik is configured to send the user’s Plex token in the OIDC claims so that we can pass it to the underlying services being protected

Cloudflare Access, in addition to the cookies it uses itself for ensuring users are authenticated before passing thr traffic through, will also send a cookie/header with a JSON Web Token of the user’s information. You get the common stuff from the common OpenID scopes – username, email, yadda yadda – but you can also shove in custom fields from the OIDC claims – which you can customize on the Authentik side – into the JWT. This means not only can we validate the JWT on our side to make sure Cloudflare was really involved, but that we can shuttle along custom data – like the authenticated uer’s Plex token, which we need for forward authentication – in the JWT as well.

Keeping the bouncer protected, too

At this point, I managed to figure out how to protect my home infrastructure when used as an origin server, as well as how to expose an authentication/authorization mechanism and have Cloudflare protect our resources with it, but we still had the problem that our identity provider itself needed to be pubically accessible in order for Cloudflare Access to reach it. Even if Cloudflare was also fronting Authentik, we still had the same potential issue: what if Authentik has a vulnerability that can be exploited prior to getting through the authentication/authorization flow?

I solved this by simply… running it on external infrastructure. I had been wanting to use Fly.io for a while – I know one of the engineers there, it’s a cool product, their blog posts are great, and they’re providing a ton of value to customers – and this struck me as the perfect opportunity. Since we don’t generally need any advanced/complex redundancy or resiliency – Plex is our primary authentication/authorization source, so all we need to do is have a repeatable way to configure Authentik to use it – we could afford to run Authentik is a stripped down way on lower-end hardware. Fly.io was a great fit for this.

Right off the bat, Fly.io lets you run applications at the edge so long as you can bundle them up into a Docker image. Authentik already had a Docker image, so that’s a solved problem. We also needed to provide Authentik a database and cache: technically, I didn’t really care about long-term storage of anything since we could suffice with in-memory storage of OAuth2 tokens, etc… but Authentik is more general-purpose than that and requires Postgres and Redis.

Luckily, Fly.io has been launching managed services, either by providing the management tooling themselves (Postgres) or acting as a marketplace for third-party providers to run their managed services on (Redis, specifically Upstash Redis). Oh yeah, and Fly.io has a generous free tier for not only deployed applications, but also for these managed services. Nice.

I configured the requisite Postgres and Redis services on Fly.io, then crafted a Dockerfile (and ultimately a fly.toml deployment configuration) for Authentik. Authentik has a feature they call “blueprints” which allows defining its configuration (some parts of it, at least) as YAML files that can be loaded at start… which I didn’t take advantage of initially but have been working to switch over to. Blueprints are the starting point of “how do I reconfigure this if my Fly.io account explodes somehow?”, or if I want to migrate all of this to another cloud provider.

After manually configuring Authentik to do all of the OAuth2/OpenID and Plex federated authentication bits, I had one last step which was to configure some vanity DNS to stick Authentik behind, and then a small configuration on the Cloudflare Access side to point to it. Much elided tweaking and futzing and hair pulling later, I had Cloudflare Access using my isolated Authentik deployment to authenticate and authorize users, before sending traffic over Cloudflare Tunnel to an application hosted on my home infrastructure… and it was all free!

I ended up purchasing some credits from Fly.io because their service is great and I wanted to spin up some ex tra application deployments with resources beyond what the free tier provides. Authentik also ended up using a lot of memory when running its Django-based migrations, which would cause OOMs.

Ultimately, it should cost me like $5-7/month for my upsized VMs, but I’m still passively looking into possible performance optimizations that could be contributed upstream to Authentik that would allow dropping back down to the free tier VM sizing.

Let’s pretend for just a moment

Much ink was spilled above, so let’s briefly recap the steps and setup we undertook here:

I have an existing Plex server, shared with friends and family, which acts as the authentication (and authorization) provider. Plex already has mechanisms to share itself (network-wise) outside of what’s described in this blog post, but I was exposing another application that needs Plex credentials.
I have an application inside my network, which is accessible on-network at http://app.cluster.local:5000, which I want to expose externally at https://app.vanitydomain.com.
I created a Cloudflare account and set it up to host DNS for vanitydomain.com.
I set up Cloudflare Tunnel to proxy traffic from https://app.vanitydomain.com to http://app.cluster.local:5000, and deployed cloudflared internally to support that.
I created a Fly.io account and deployed Authentik, using Fly.io’s Managed Postgres and Managed Redis (Upstash Redis) services, which I then put behind a vanity subdomain of https://app.auth.vanitydomain.com.
I configured Authentik to use Plex as a federated authentication (and de-facto authorization) source by allowing users to authenticate to Plex (Plex as in plex.tv, not my Plex server specifically) which then provides de-facto authorization as it only allows authenticated Plex users who also have access to my Plex server.
I configured Authentik to expose an OAuth2/OpenID Connect endpoint, which ultimately uses the federated Plex authentication source as a passthrough, and additionally forwards data like their user groups, Plex token, etc, in the OIDC claims.
I configured Cloudflare Access with a new authentication source that was pointed at our Authentik-based OAuth2/OpenID IdP, located at https://app.auth.vanitydomain.com, with a no-op authorization policy, as Authentik handles that for us.
I configured Cloudflare Access to expose/protect an application located at https://app.vanitydomain.com using the authentication source we just configured.

As a user, all they have to do is navigate to https://app.vanitydomain.com. When they do that for the first time, Cloudflare Access sees that they have no existing cookie marking them as authenticated, and so they enter the authentication and authorization flow. Cloudflare Access redirects to an account-specific Cloudflare Access service provider endpoint (part of the OAuth2/OpenID flow) which then sends them to https://app.auth.vanitydomain.com where they authenticate with Plex. Once the Plex authentication happens, and things look good, they’re redirected back to the account-specific Cloudflare Access service provider endpoint, which now does the “authorization” and if that’s successful, sends them to https://app.vanitydomain.com, but with a Cloudflare Access-specific URI path. This specific path is handled by Cloudflare, but crucially, provides a secure mechanism for cookies to be set on our application domain, by Cloudflare, on our behalf. (neat!) Finally, the user is redirected to the original resource, now with their authentication cookies being sent with the request, which lets Cloudflare Access know they’re authenticated and authorized to access to the given resource. Cloudflare Tunnel takes over from here, routing the request over the tunnel, with the cookie/header from the Cloudflare Access side so our application can access any of the authentication/authorization data that’s relevant, and finally the user is interacting with the application.

What we’ve learned/accomplished

We set out to expose an application running on our home infrastructure without having to necessarily expose it directly to the internet, and without messing with routers and firewalls. We also set out to only allow authorized users to access said application, based on federated authentication/authorization that these users had already onboarded with and that we ultimately controlled.

We were able to do all of this without any traffic ever hitting our home infrastructure before the user/request was authenticated and authorized, and without needing to host the authentication/authorization endpoints on the very same home infrastructure, such that Cloudflare and Fly.io bear almost all of the weight of any potential DDoS/intrustion attempts. If a bug in Authentik was found and exploited, there’s the potential for the authentication/authorization flow to be compromised, leading to getting access to the protected application or our home infrastructure… but now we can focus more of our time and energy on securing the Authentik deployment rather than also having to harden every single application that we want to expose.

Finally, but just as importantly: we were able to do this all for free*, with primarily open-source software that can be examined and replaced if need be, save for the Tunnel/Access magic provided by Cloudflare.

All in all, not the most convoluted infrastructure I’ve ever spun up, but sure as heck one of the more useful bits of infrastructure I’ve set up.

More to come.