Jekyll2023-03-13T01:53:48+00:00https://notes.catdad.science/feed.xmlAll That Jitters Is Not CodeA brief written account of the adventures of someone with too much time on their hands.Toby LawrenceSelf-hosting chronicles: plex-oidc, or just writing everything yourself2023-03-02T00:00:00+00:002023-03-02T00:00:00+00:00https://notes.catdad.science/2023/03/02/self-hosting-chronicles-plex-oidc<p>When I originally designed my home network, in conjunction with allowing outside access to some
internal apps, <a href="https://goauthentik.io">Authentik</a> was the lynchpin of providing the authentication/authorization
part of the equation. It’s configurable, supports forwarding to Plex as an authentication and
pseudo-authorization provider, and it’s open source. It was a solid choice to build things the way I
wanted, knowing that I could at least fork it if I needed to tweak its behavior if it went closed
source, and so on.</p>
<p>While I got Authentik configured to perform the task at hand, it didn’t come without some
<a href="/2022/11/25/self-hosting-chronicles-authentik-configuration.html">hardships</a> along the way. However, I figured once I had gotten it installed, that would
be that: no more hardships, nothing else to tweak. Over the course of a few months of usage, though,
the rough edges had built up and it got me thinking about how to shore up that corner of the
infrastructure.</p>
<h2 id="issues-with-authentik">Issues with Authentik</h2>
<h3 id="hard-to-customize-styling">Hard to customize styling</h3>
<p>This one was superficial, but given the amount of time I spent configuring Authentik just to do the
necessary authentication/authorization, it was frustrating not being able easily change the styling
of various flows.</p>
<p>Authentik normally allows you to configure the “flow background” which is just the background image
displayed on a given flow. It also has a pre-existing stylesheet import in its HTML templates for
custom CSS, so long as you can put a CSS file at the right path on disk. Unfortunately, due to
Authentik’s UI being heavily based on custom components using the shadow DOM, they cannot be styled
simply by injecting a CSS stylesheet at the head of the HTML document.</p>
<p>This limitation made it effectively impossible to style the user-facing flows without manually
overriding the various HTML templates <em>and</em> various bits of bundled JavaScript and CSS. Doing so
would not be conducive to updating to new versions of Authentik without having to make sure our
changes still applied cleanly and hadn’t broken UI functionality.</p>
<blockquote>
<p>Author’s note: There was an <a href="https://github.com/goauthentik/authentik/issues/2693">issue</a> filed for this shortcoming which now
theoretically has a fix, which is great!</p>
</blockquote>
<h3 id="high-memory-overhead">High memory overhead</h3>
<p>I ran Authentik on <a href="https://fly.io">fly.io</a>, which notably has a free tier: up to 3 virtual machines with 1
shared vCPU and 256MB of memory. This should be more than enough for a simple web application that
serves maybe 10-15 requests a day at most, and yet, I had to go above and beyond the free tier to
run Authentik.</p>
<p>Fundamentally, Authentik and its requirements meant we needed to deploy three applications, and thus
use up all of the allocatable free tier VMs: one for Authentik itself, one for Redis, and one for
Postgres. This would be fine if the free timer VM size but adequately, but you can probbably guess
where this is going…</p>
<p>Generally speaking, Redis was able to fit within the free tier constraints because it’s already
designed to be configured with memory limits. As far as Postgres, it was also usually able to fit
within limits but during its own startup, as well as Authentik’s startup, it would often hit the
memory limits and OOM itself. The amount of data stored was less than 5MB total. Still, between
pre-allocated buffers and various queries made by Authentik, it often seemed to manage to just crest
the memory limits long enough to be hit by the OOM killer. Authentik itself was the significant
outlier, frequently ballooning up to 350-400MB idle RSS over time.</p>
<p>Due to this, I had to increase the size of the two VMs used for Postgres and Authentik to have 512MB
of memory. Increasing the memory allocation knocked me out of the free tier. As I’ve said in
previous posts, fly.io is awesome and I’m okay with paying them, but it certainly rubbed me the
wrong way: the application is idle nearly 100% of the time, and yet I had to bump up the VM
size, and incur a monthly bill, just because it was a hungry hungry hippo for memory those few times
it happened to serve a request every week.</p>
<h3 id="fundamental-flow-based-design-led-to-a-janky-browser-experience">Fundamental flow-based design led to a janky browser experience</h3>
<p>While Authentik’s flow-based design is excellent from a composability standpoint, being able to
build complex flows with branching logic and so on, my needs were far simpler: always show users the
button to login via Plex on the landing page, and when the login pop-up modal closes because they
successfully authenticated to Plex, the landing page should be able to send all the data back to
Authentik, verify that the user was authenticated and authorized and so on, and then immediately
respond with the data required to trigger a browser redirect back to the OIDC SP (service provider,
Cloudflare Access in this case) right then and then.</p>
<p>In practice, Authentik added many redirect steps that were also just needlessly slow.</p>
<p>The authorization flow always starts with a landing page that provides a button to trigger a login
modal pop-up. This is necessary because without the login modal being triggered from a direct user
action, modern browsers will stop the pop-up, as it was performing a cross-domain request. Alright,
so we’re always stuck having to load a page where the user needs to at least click the login button.</p>
<p>Once a user successfully authenticated with Plex, the modal closes, and Authentik does its first
redirect to its authentication page, where a little flash banner quickly pops up to say “you’re
authenticated!”. In reality, the page is saying the user is authenticated to Authentik itself, and
not exactly Plex: while we authenticate the user via Plex, we have to create a user in Authentik to
map the Plex user to, and then force the user to be authenticated with Authentik. Understandable,
but slightly clunky for our usecase.</p>
<p>After that, the user is redirected <em>again</em> because now we have to authorize them, and again, they
see a flash banner pop-up that they were successfully authorized. As a user, you may now be confused
on what the hell the login pop-up did if you had to be redirected to all of these other pages. Most
users don’t understand (nor should they need to) the difference between their successful
“authentication” and their successful “authorization.”</p>
<p>Technically speaking, though, this was suboptimal because authenticating to Plex was also how we
authorized the user: do they have access to a specific Plex server? Authentik’s design, though,
didn’t allow this to be collapsed: we had to authenticate the user with Plex first and then
essentially start back at the very top of a traditional authentication/authorization flow.</p>
<h3 id="slow-oidc-performance-led-to-confusion-on-if-things-were-stuck">Slow OIDC performance led to confusion on if things were “stuck”</h3>
<p>On the final redirect, where the user was told they were now authenticated, a redirect would
eventually be generated to send the user back the OIDC SP. This involved generating signed tokens
and other OIDC-related bits, which is fine and dandy. This part was frustratingly slow, though. due
to some inefficiencies in Authentik around loading and handling signing keys.</p>
<p>From a user perspective, the main problem was that this delay in processing created confusion around
whether or not the user was successfully authenticated/authorized, and if they were supposed to be
seeing the actual application. Of course, they only needed to wait a few seconds and then the page
would finally load, and they would be redirected to the protected resource. On a page that appears
to be fully loaded, though, with no progress indicator, having to wait three or four seconds can
certainly cause the user to think something has gone wrong.</p>
<p>Admittedly, I traced this down to signing keys being reloaded multiple times in a single request,
making a response that should have taken one second to return sometimes take nearly three to four
seconds instead. I tweaked a local checkout of Authentik to find and fix this inefficiency, but
admittedly, I never contributed it back upstream. I do still plan to.</p>
<h2 id="crafting-the-minimum-viable-solution-plex-oidc">Crafting the minimum viable solution: <code>plex-oidc</code></h2>
<p>All of this led me to think: could I just write my own solution?</p>
<p>In theory, all we need to do is make the same API calls to Plex to handle the Plex authentication
flow, and then query for the Plex user’s server associations to handle the authorization. Sprinkle
some OIDC on top of that to use it from Cloudflare Access, and voilà!</p>
<p>My day job involves writing Rust, and I had already written a <a href="https://github.com/tobz/cloudflare-access-forwardauth">ForwardAuth service for working with
Cloudflare Access tokens</a> based on Rust, which has to deal with OIDC
tokens.. so I had a good idea of how I’d approach this and decided to just sketch something out and
see how far I could get in a sitting.</p>
<h3 id="handling-the-basics">Handling the basics</h3>
<p>Most of this is somewhat rote, but I started out with an equivalent skeleton based on
<code>cloudflare-access-forwardauth</code>, which uses <a href="https://docs.rs/tracing"><code>tracing</code></a> for logging/tracing, <a href="https://docs.rs/axum"><code>axum</code></a>
for the web “framework” – routing, request handlers, parsing headers/query parameters, etc. – and
a cast of other helper crates: <a href="https://docs.rs/axum-sessions"><code>axum_sessions</code></a> for handling sessions,
<a href="https://docs.rs/tower-http"><code>tower_http</code></a> for adding per-request logging and CORS,
<a href="https://docs.rs/tower_governor"><code>tower_governor</code></a> for rate limiting, <a href="https://docs.rs/include_dir"><code>include_dir</code></a> for handling
static content, and a few others. We’re also using <a href="https://docs.rs/tokio"><code>tokio</code></a> as the executor, since <code>axum</code> is
based on <a href="https://docs.rs/hyper"><code>hyper</code></a>, which is built on <code>tokio</code>.</p>
<p>This let us rig up the basics of our service: async runtime, logging, exposing an HTTP API with
sensible observability and hardened defaults. I additionally built out some of the configuration
types which I used with <a href="https://docs.rs/serde"><code>serde</code></a> to deserialize our service configuration from a
configuration file on disk.</p>
<p>Again, more or less what you’d consider the “basics.”</p>
<h3 id="handling-oidc">Handling OIDC</h3>
<p>Next, I needed to handle the OIDC flow that this service would be a part of. I’m using Cloudflare
Access to front all of my exposed self-hosted services, which meant that I only needed to handle the
standard “authorization code” flow.</p>
<p>Cloudflare Access hits the “authorize” endpoint of my service, at which point we trigger the Plex
login flow. If the user successfully authenticates with Plex, and they’re authorized to access our
Plex server, we redirect back the user back to the redirect URI provided in the authorization flow
request. Finally, Cloudflare Access calls the “token” endpoint of the service behind-the-scenes to
retrieve a token for the now-authorized user, and if that goes well, the user is redirected to the
intended resource, and we’re done.</p>
<p>I used the <a href="https://docs.rs/openidconnect"><code>openidconnect</code></a> crate to build the OIDC routes and handle all of the
relevant bits:</p>
<ul>
<li>a configuration endpoint which details which OAuth 2.0 flows are available, what scopes can be
requested, what signing algorithms we can use, and so on</li>
<li>a JWKS (JSON Web Key Set) endpoint, which describes which signing keys are valid in terms of
verifying a signed JWT (JSON Web Token) returned by the service</li>
<li>the authorize and token endpoints, which start the authorization flow and allow retrieving the
token of an authorized user, respectively</li>
</ul>
<p><code>openidconnect</code> is more geared towards building OIDC clients, rather than servers, but all of the
necessary and relevant primitives exist in the crate. This did mean, however, that we had to lean a
lot of the Authentik source itself, and <a href="https://darutk.medium.com/diagrams-of-all-the-openid-connect-flows-6968e3990660">random blog posts</a> and <a href="https://auth0.com/docs/authenticate/login/oidc-conformant-authentication/oidc-adoption-auth-code-flow">application
docs</a> from Google on OIDC flows – including a surprisingly <a href="https://www.ibm.com/docs/en/was-liberty/base?topic=connect-configuring-openid-client-in-liberty">straightforward set of
OIDC docs</a> from IBM (go figure) – to understand what each step needed to do,
functionally. RFCs are utterly dense, and I find them impenetrable for understanding things
holistically like “ok, how does OIDC work?”.</p>
<p>Thanks, random bloggers, for all of your OIDC explanations.</p>
<p>I also used an in-application, in-memory cache to hold all of our pending authorization flow data.
Most full solutions (Authentik, Keycloak, etc.) will actually put this stuff in a database for you,
since they rightfully expect that things such as refresh tokens will be, you know, refreshed over
time… and that you want to keep authorized users authorized even if the service restarts. My
stakes are lower – authorizing friends and family to access media apps – so I went with the
simplest possible approach.</p>
<p>I used <a href="https://docs.rs/moka"><code>moka</code></a> for this, which is an easy-to-use and capable caching crate. <code>moka</code> also uses
<a href="https://docs.rs/quanta"><code>quanta</code></a>, one of my crates for providing cross-platform access to fast, high-precision
timing sources like TSC. Something something subscribe to my Soundcloud… but really, I wanted to
support people who support me. Thanks <code>moka</code>! 👋</p>
<h3 id="authenticating-and-authorizing-users-via-plex">Authenticating and authorizing users via Plex</h3>
<p>The final piece of the puzzle was getting the users authenticated with Plex, which we then piggyback
on to do authorization. Plex’s API lets you query if a user has access to a particular server, so
the flow ends up looking something like:</p>
<ul>
<li>user authenticates to Plex</li>
<li>once authenticated, we can query their friend list to see if a specified user ID (our user, the
admin user/owner of the given Plex server) is contained</li>
<li>if contained, they are authorized, otherwise, they’re not authorized</li>
</ul>
<p>This is pretty straightforward, so I whipped up some helper types to use for (de)serializing request
and response payloads and wrote a simple client wrapper based on <a href="https://docs.rs/reqwest"><code>reqwest</code></a>, which was a
bit simpler to deal with than using <code>hyper</code> directly. If I was on a budget, maybe in terms of
reducing transitive dependencies or maximizing performance or memory usage or something, I might use
<code>hyper</code> directly to only do <em>exactly</em> what was needed, but that wasn’t a concern here.</p>
<h3 id="flexing-those-frontend-muscles-and-writing-some-htmlcssjs">Flexing those frontend muscles and writing some HTML/CSS/JS</h3>
<p>Ultimately, the service needs to show at least one page to the user where the Plex login flow itself
is triggered, which means we need to render some HTML and provide some JS. As mentioned early on, I
wasn’t particularly happy with the lack of styling capabilities exposed by Authentik so I took my
time here to tweak things exactly how I wanted them.</p>
<p>We don’t use any sort of CSS framework, or any JS libraries, or build tooling. All of the HTML, CSS,
and JS was hand-rolled. Yes, it’s possible to write HTML, CSS, and JS that return a decent-looking
page on modern browsers, across multiple device types and screen sizes, without having to possess
otherworldly knowledge. Could I have whipped up the UI much faster if I used something like
<a href="https://tailwindcss.com/">Tailwind CSS</a>? Yes. Would my static assets be much smaller and faster to load if I
used a build pipeline like <a href="https://webpack.js.org/">Webpack</a> or <a href="https://vitejs.dev/">Vite</a> to bundle and minify and tree shake
and all of that? Yes.</p>
<p>Is what I have more than acceptable for my use case? Unquestionably “yes.”</p>
<p>Using <code>include_dir</code>, we bundle these static assets directly into the binary. They get served
normally via a dedicated path for static assets, which means they can then be trivially CDN-ified,
but ultimately, they’re just part of the binary, which is one less thing to need to faff about with
during deployment.</p>
<h3 id="putting-it-all-together">Putting it all together</h3>
<p>Now with the service written, and the various phases complete, the lifecycle of authorizing a user
looks like this:</p>
<ul>
<li>a user navigates to a resource protected by Cloudflare Access</li>
<li>the user is not authorized, so they’re redirected to the OIDC authorization endpoint</li>
<li>we initialize the authorization flow for the user, and return the landing page with sparkly
HTML/CSS</li>
<li>the user clicks a button to pop-up a login modal for Plex</li>
<li>the landing page polls the Plex API on an interval to figure out when the user has finally
authenticated with Plex</li>
<li>once they’ve been authenticated, the landing page sends the user’s Plex token back to the service
to continue the authorization flow</li>
<li>we actually authorize the user by checking if the configured Plex server is shared with them</li>
<li>if they’re authorized, we redirect back to Cloudflare Access with the relevant OIDC authorization
code</li>
<li>Cloudflare Access calls the OIDC token endpoint to exchange the authorization code for their
access/ID token and verifies that the tokens came from us</li>
<li>if all of that succeeded, they’re eventually redirect to the underlying resource, with their
Cloudflare Access JWT token (which is where <code>cloudflare-access-forwardauth</code> would take over)</li>
<li>fin!</li>
</ul>
<h2 id="mission-accomplished">Mission accomplished?</h2>
<p>Simply put: yes.</p>
<h3 id="improved-user-experience">Improved user experience</h3>
<p>The authentication/authorization flow now feels like you’re really only hitting a single page – the
landing page – and then you’re being whisked off to the protected resource.</p>
<p>In contrast, a user going through the Authentik flow would see the landing page, and then the
authentication flow page, and then the authorization flow page. Add to that fact that the
authorization flow page sat there for a while due to the aforementioned slowness with generating
OIDC-related responses, and now things feel vastly snappier.</p>
<p>In practice, Cloudflare Access itself is always going to do some redirects – two before the user
gets to the landing page, and two after we send the user to the OIDC redirect URI – but those pages
load quickly, which is no surprise given that they live on the Cloudflare side. Perhaps most
importantly, they load/display no user-visible content, so again, the user only ever feels like they
load a single page – the landing page – and then they’re redirected and it spins ever so briefly
before the protected resource is loaded for them.</p>
<p>This is a testament to going with a hand-rolled solution to optimize for the requirements at hand.
This isn’t so much a knock against Authentik itself, but for this use case, it was entirely overkill
and proved to provide a suboptimal UX.</p>
<h3 id="improved-styling">Improved styling</h3>
<p>Not a whole lot to say here. We control all of the HTML/CSS styling now, so we can (and did) do
whatever we want with it.</p>
<h3 id="improved-memory-overhead">Improved memory overhead</h3>
<p>We were able to go from three deployed Fly apps (Authentik, Postgres, Redis) down to one app
(<code>plex-oidc</code>). We were also able to switch back to the smallest VM size – 1 shared vCPU, 256MB
memory – and <code>plex-oidc</code> idles at around 30MB RSS.</p>
<p>Our bill is now comfortably within the free tier for Fly.io, so we pay nothing to host it. Woot!</p>
<h2 id="was-it-worth-it">Was it worth it?</h2>
<p>While I definitely achieved the things I wanted to achieve, it’s still worth considering: <em>was it
worth it?</em></p>
<p>I spent close to 2-3 weeks of part-time tinkering to build <code>plex-oidc</code> and get it functioning how I
wanted. Realistically, my users log in once a week at most. When you do the math – 5-10 logins a
week, and 6-7 seconds “wasted” by using Authentik – the amount of user time saved is very small.
That fact is inescapable.</p>
<p>Having operated this service for a few weeks now, the one thing I didn’t originally consider, or at
least didn’t have top of mind, was that Authentik burned some of <em>my</em> time, in an operations sense.
Sometimes people would hit bugs with the authentication/authorization flow, or Authentik would crash
due to OOM and get into a weird state with Postgres locks or stalled Celery tasks or whatever.</p>
<p>In almost all of those cases, I just restarted the relevant Fly app, but it still required me to
disengage from whatever I was doing, mentally and physically. This was a system I pointed friends
and family at, that I just wanted to work. Even knowing that it’s not the end of the world if I
didn’t check on a problem for a few hours, it felt like a splinter in my mind.</p>
<p>All of this said, <code>plex-oidc</code> has not had a single hiccup since I deployed it. It just works. No OOM
issues, no weird issues seemingly related to shoehorning my intended authentication/authorization
flow into Authentik’s model of flows, nothing. It just works, and keeps on working. That part made
it worth it, and continues to make it worth it every day.</p>
<h2 id="but-wheres-the-source-code-bub">But where’s the source code, bub?</h2>
<p>Admittedly, I started working on this as a public repo, because <em>why not?</em>, but then I made it
private. In fact, I bundled it into my infra monorepo of sorts, since this allowed me to iterate
faster by inlining secrets and tokens directly without having to nerdsnipe myself: oh, where should
I store these secrets? how will I deploy these secrets? There’s also the aspect of the static
content being hand-rolled for my particular set-up, which means colors and background images and
hastily-designed logos specific to my set-up.</p>
<p>I’ll likely open source this at some point in the near future once I have some time to clean up the
aforementioned things, and figure out how to generify the HTML/CSS and all of that.</p>Toby LawrenceWhen I originally designed my home network, in conjunction with allowing outside access to some internal apps, Authentik was the lynchpin of providing the authentication/authorization part of the equation. It’s configurable, supports forwarding to Plex as an authentication and pseudo-authorization provider, and it’s open source. It was a solid choice to build things the way I wanted, knowing that I could at least fork it if I needed to tweak its behavior if it went closed source, and so on.Self-hosting chronicles: Authentik configuration2022-11-25T00:00:00+00:002022-11-25T00:00:00+00:00https://notes.catdad.science/2022/11/25/self-hosting-chronicles-authentik-configuration<p>In my <a href="https://notes.catdad.science/2022/10/27/self-hosting-hard-way-exposing-services.html">inaugural
post</a>, I
briefly covered the shape of my new approach to exposing self-hosted applications to the public
internet in a reasonably secure way. Most of this set up depends on a piece of software called
<a href="https://goauthentik.io/">Authentik</a>, an open-source identity provider which acts as the glue
between Cloudflare Access and the actual authentication mechanisms I want to depend on.</p>
<p>Getting Authentik set up and configured the way I wanted it to be configured was by far the hardest
part of the entire process, so it felt worth writing up on the off-chance that someone running into
the same problems as I did can potentially find this post, presented as an encapsulated list of
problem/answer tuples, and get themselves sorted that much faster.</p>
<h2 id="deployment">Deployment</h2>
<p>Authentik, as an identity provider, has to run somewhere. When you think of Google or Github or Okta
as identity providers, what’s the one thing they all share in common? Their IdP endpoints are
exposed to the public internet, because you need to be able to communicate with them from your
browser or mobile device. Authentik is no different in this regard, and so I needed to run it
somewhere where I could expose it to the public internet. As laid out in the aforementioned post, I
chose <a href="https://fly.io/">Fly.io</a> for this purpose.</p>
<p>Fly.io is a … something between an infrastructure-as-a-service provider and platform-as-a-service
provider, depending on how hard you squint. They have the trappings of a PaaS provider, where their
<code>flyctl</code> CLI tool, and a simple <code>fly.toml</code> configuration file, is essentially all you need to start
creating and deploying applications.</p>
<p>Despite this, I still hit some roadblocks that were a culmination of how Authentik expects things
to be done, and how Fly.io works.</p>
<h3 id="writing-to-standard-error-or-not">Writing to standard error, or not</h3>
<p>The first problem I encountered was that <code>/dev/stderr</code> does not seem to be available on Fly.io.</p>
<p>The <a href="https://github.com/goauthentik/authentik/blob/main/lifecycle/ak">“lifecycle” script</a> used by
Authentik has a simple logging function which pipes its output to <code>/dev/stderr</code>. I forget the exact
error message that I got… maybe “file doesn’t exist” or maybe “permission denied”, but I got an
error, consistently.</p>
<p>My original searching lead me to
<a href="https://community.fly.io/t/redirect-logs-to-stdout/514">this forum post on community.fly.io</a>,
which made me originally think that this was potentially a bug with both stdout and stderr, and
perhaps it hadn’t yet been fixed for stderr? I had also stumbled onto this
<a href="https://unix.stackexchange.com/questions/38538/bash-dev-stderr-permission-denied">Unix StackExchange question</a>
where one of the answers suggests just using the <code>1>&2</code> style of redirection instead of the stderr
<code>/dev</code> entry itself.</p>
<p>Admittedly, I never investigated this more holistically and instead took the simpler approach of
just not having output redirected to <code>/dev/stderr</code>:</p>
<pre><code class="language-dockerfile">FROM ghcr.io/goauthentik/server:2022.9.0
USER root
RUN sed -i -e 's# > /dev/stderr##g' /lifecycle/ak # <-- This lil' guy right here.
USER authentik
ENTRYPOINT ["/lifecycle/ak"]
CMD ["server"]
</code></pre>
<h3 id="limitations-with-upstash-redis">Limitations with Upstash Redis</h3>
<p>The next issue I encountered, once Authentik could actually start, was that the managed Redis
service (<a href="https://docs.upstash.com/redis">Upstash Redis</a>) doesn’t support multiple Redis databases.</p>
<p>Authentik uses Redis for two purposes: caching within Django and Celery. Celery is a Python library
for distributed task processing, where you can enqueue tasks to be run by a pool of Python worker
processes, and those workers pick up the tasks and run them and report the status and so on. Celery
has a concept of “backends” which are the systems that actually get used for the transport of task
messages, of which Redis is a supported backend.</p>
<p>Authentik takes advantage of already wanting Redis for Django caching and configures Celery to use
it, too. The problem is that Authentik points itself, and Celery, at different Redis “databases” in
an attempt to isolate the data between the two use cases. Redis databases are a logical construct
for isolating the keyspace, so that keys don’t overlap and clobber each other, and so on.</p>
<p>Upstash Redis, which exists by itself but is offered as a managed service on Fly.io, doesn’t support
multiple Redis databases, instead only supporting one database, the default database, at index 0.
Luckily, Authentik already exposed the ability to change the database selection as part of its
configuration, which was simply achieved by setting the follow environment variables:</p>
<pre><code>AUTHENTIK_REDIS__CACHE_DB = "0"
AUTHENTIK_REDIS__MESSAGE_QUEUE_DB = "0"
AUTHENTIK_REDIS__WS_DB = "0"
</code></pre>
<p>This shoves all usages of Redis – both Authentik itself and Celery – onto the same Redis database.
So far, I’ve yet to experience any issues with doing so, and even in a brief conversation with the
creator of Authentik, they weren’t necessarily sure if Authentik needed to actually isolate the data
like this, or at least isolate it any longer.</p>
<p>Admittedly, I ended up spinning up my own Redis container/application on the side because using
Upstash Redis kept leading to weird Redis issues from Celery’s perspective. There was originally a
bug with how Upstash Redis implemented <code>MULTI</code>/<code>EXEC</code> for certain commands that was definitively
wrong (compared to the official Redis behavior) which Upstash fixed a few weeks after I reproduced
and reported the behavior… but even after it was fixed, somehow, their service still acted weird
when used by Authentik/Celery.</p>
<p>I didn’t have the time or energy to keep debugging it, so I spun up my own Redis
container/application. Maybe there were more bugs with Upstash Redis that are now fixed, or maybe I
was doing something wrong with my particular configuration… who knows.</p>
<h2 id="configuration">Configuration</h2>
<p>At this point, I at least had Authentik deployed and running. I won’t go into all aspects of
configuring it prior to the first initial successful deployment, because the rest of it is fairly
mundane and covered by their documentation.</p>
<p>The real bugaboo, however, was actually configuring Authentik in terms of its behavior.</p>
<h3 id="how-authentik-works">How Authentik works</h3>
<p>Authentik has a very particular viewpoint/approach in terms of how it operates. This is not to say
that the approach Authentik takes is bad, just that deviating from it can often leave you
frustrated.</p>
<p>To start, Authentik has an out-of-the-box experience that configures a number of stages, policies,
and flows to give you a setup that is close to how most people might be expected to use it. You can
and <em>should</em> read the documentation, but Authentik uses the concept of <em>flows</em> and <em>stages</em> as a way
to describe the authentication/authorization journey a user takes. Configuring these flows/stages is
how you configure how Authentik works: how users authenticate (username/password? federated login?),
whether or not they’re authorized to access a particular applicatiom, enforcement of password
strength or two-factor measures, and so on.</p>
<p>To achieve this, Authentik supports a feature called “blueprints.” These are YAML files that can be
processed to programmatically create flows, stages, policies, bindings, and most all of the various
model objects that are used to configure Authentik’s behavior. These YAML files are essentially data
mocks on steroids: you define the model objects themselves, and are given access to helper logic in
the form of YAML “tags”… such as setting the value of a primary key field to <code>!KeyOf ...</code> to have
the blueprint importer search for another blueprint-managed object by a logical identifier and
insert the actual primary key for you.</p>
<h3 id="using-blueprints">Using blueprints</h3>
<p>As Authentik is an identity provider, many users do in fact use it as the source of
authentication/authorization itself… as in, users are registered in Authentik as the source of
truth, and everything flows outwards from Authentik. The out-of-the-box experience works well for
this, and in many instances, I’ve seen users be told to simply tweak the out-of-the-box flows/stages
to achieve their desired outcome.</p>
<p>As part of my deployment, though, Authentik was simply glue logic between an existing authn/authz
mechanism – Plex’s own identity provider – and Cloudflare Access. I didn’t want local users
created, and I didn’t care about specific applications, and I certainly didn’t want Authentik to
proxy any access or anything like that. I just wanted an identity provider.</p>
<p>Most importantly, I wanted to use blueprints, because they seemed to be the only way (a good way, to
be fair!) to idempotently configure Authentik.</p>
<h3 id="blueprint-pitfalls">Blueprint pitfalls</h3>
<p>As I started to look into configuring Authentik entirely via blueprints, I hit numerous pitfalls and
struggled often with finding a consistent source of documentation and answers to mystifying
behavior. I’ll list these out in no particular order.</p>
<h4 id="blueprint-file-structure-is-documented-with-varying-levels-of-specificity">Blueprint file structure is documented with varying levels of specificity</h4>
<p>Looking at the existing blueprint files can be a useful hands-on example of how to write your own,
but this leaves a lot to be desired. As blueprints are essentially model object mappings, you’re sort
of writing your very own <code>INSERT INTO table (field, field, ...) VALUES (...)</code> but in YAML. This
means you actually need to go and look at the model definition for objects you want to add, or look
at the source code.</p>
<p>There’s no documentation for the models, either by themselves or in the context of blueprint
development. This means you have to become intimately familiar with the models if you’re trying to
create something via blueprints that doesn’t already have representation in the out-of-the-box
blueprints.</p>
<p>Beyond that, there’s a lot of uncertainty around parts of the blueprint definition such, as, for
example, specifiying identifiers. Identifiers are used to provide a unique identifier (duh) for a
given model object.. which is fine, and makes sense. No issue there. Again, however, there’s little
to no documentation on the models, so if you don’t specify the right identifier fields, the
blueprint importer just yells at you.</p>
<p>The experience is poor, to say the least.</p>
<h4 id="blueprints-have-no-ordering-constraints">Blueprints have no ordering constraints</h4>
<p>Authentik supports what they call a “meta model”: since blueprints are tied almost exclusively to
actual models, they have a concept of “meta models” which allow operations other than creating a
model object. The only meta model currently is “apply blueprint.”</p>
<p>If you’re like me, you might read that and think “oh nice, dependency management!”. Or maybe
something along those lines. It’s certainly what I thought, given that meta models are described as:</p>
<blockquote>
<p>This meta model can be used to apply another blueprint instance within a blueprint instance. This
allows for dependency management and ensuring related objects are created.</p>
</blockquote>
<p>Alas: <em>nope</em>.</p>
<p>The meta blueprint apply operation certainly can import and apply other blueprints, but it <em>cannot</em>
use them in any sort of way that would actually qualify as dependency management. There’s no
ordering (apply blueprint first, and then create the actual model objects described further down in
the blueprint, etc) and there’s no logical inclusion, which means you can’t even reference model
objects created in meta-applied blueprints (i.e. using the special YAML “tags”) because it doesn’t
actually load the blueprint in an inclusive way, or dependably apply it first.</p>
<p>That’s not to say any of this is easy to program up, but, I mean… the documentation <em>says</em> the
words “dependency management”, and there is no hint whatsoever of anything that could reasonably be
considered dependency management. All it lets you do is prevent one blueprint from running in the
meta-applied blueprints have been run successfully… so you get to wait multiple hours for
blueprint runs to finally make enough progress to cover all of the blueprint dependencies. Not great
at all.</p>
<p>I currently use a single blueprint file to reliably construct all of the necessary model objects and
be able to correctly reference them in the creation of subsequent model objects.</p>
<h3 id="actual-flow-behavior-and-figuring-out-what-matters">Actual flow behavior and figuring out what matters</h3>
<p>As I was trying to write blueprints for repeatable, idempotent configuration, I also still had to
figure out what I even needed to configure to achieve my desired setup.</p>
<p>Remember that Authentik uses a concept of flows and stages, where flows are tied to specific phases
of the journey of a user through an identity provider, such as identification, authorization,
enrollment, and so on. Stages are merely the next step down, and allow configuring the actual steps
taken within each of those flows.</p>
<p>This is where things mostly got/felt wonky because, again, the documentation has oscillating
specificity. If you go to the documentation section on “Flows”, the landing page has some decent
high-level information on flows, and the various available flows, but it has more content by volume
on the various ways you can configure a flow’s web UI (flows are also weirdly intertwined with the
context they’re used in i.e. web via “headless” (LDAP)) so you really have to spelunk here.</p>
<h4 id="a-lot-of-boilerplate-to-satisfy-various-flow-constraints">A lot of boilerplate to satisfy various flow constraints</h4>
<p>As an example of where this all gets wonky and cargo cult-y, let’s consider the authentication flow.</p>
<p>Authentik itself is always the main entrypoint, even if like me, you end up totally depending on a
federated social login setup a la Google/Plex/etc. This means we need an authentication flow, which
is all fine and good. We can configure an identification stage in that flow that specifies how a
user should identify themselves. There’s already a fairly intuitive bit of verbiage on the
configuration modal for an identification stage around showing buttons for specific identification
sources, such as these federated social login options. Think of it like the typical “Sign In with
Google” buttons you may have seen before.</p>
<p>“Great”, you think. You dutifully select the federated social login source you want to use, and
unselect the username/email options because we don’t want to login as local users. You try to use it
and… weird, doesn’t work. As it turns out, you actually need to specify an authentication flow for
the federal social login source itself. Additionally, you also need an enrollment flow for your
federated source login source.</p>
<p>This is ultimately because you need to log in to Authentik itself (remember the part about Authentik
being the “entrypoint”), which means need to creating a local user in Authentik even if you’re
farming out that aspect to something like Google… which necessitates the separate authentication
and enrollment flows.</p>
<p>This isn’t called out anywhere in the documentation that I have been able to find. Pure trial and
error here.</p>
<h4 id="making-everything-a-flow-leads-to-a-suboptimal-user-experience">Making everything a flow leads to a suboptimal user experience</h4>
<p>Despite the above, I eventually managed to get my intended flow to working. Handing it over to a
user to test out though yielded the following: “my browser refreshed a lot of times and seemed like
it didn’t actually work, and then I eventually landed on the application”.</p>
<p>What the user was describing was an idiosyncracy of everything – including sources – needing a
flow, and how flows are executed.</p>
<p>Since our federated social login source needed source-specific flows – two of them – this actually
meant that the web-based flow redirects them three to four times in total:</p>
<ul>
<li>they land at the Authentik “identify yourself” page where they have to manually click on the
button for authenticating via Plex</li>
<li>a modal pops up for Plex login, etc, and closes once they do so</li>
<li>the page then redirects to source authentication page which logs them in</li>
<li>if they’re brand new, they’re also redirected to the source enrollment flow first</li>
<li>then they’re redirected to the source authorization flow (which is a no-op, in my case) which
sits around for 2-3 seconds inefficiently calculating JWTs or something</li>
<li>finally, they’re redirected back to the SP initiator (the entity that triggered the IdP flow in the
first place) and can actually use the thing</li>
</ul>
<p>Some of this could be solved by allowing for short circuiting behavior around default options, such
as not requiring the user to have pick an identification source if there’s only one available… or
not redirecting to a specific flow if it’s literally a no-op. There’s an
<a href="https://github.com/goauthentik/authentik/pull/3696">open PR</a> for the former, albeit stalled on
getting pushed through.</p>
<p>These are paper cuts that make using Authentik clunky from the user perspective. I’m not happy
having to spend so much time actually configuring flows to do what I want, but at least if I can
turn that into a consistent and intuitive experience for users, then the effort was worth it. When a
user is getting redirected multiple times, or hitting pages that are so slow to execute that they
wonder if they’re stuck… then the experience is not consistent or intuitive.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I intended to write this post in a very problem/answer-oriented way, and I managed that in some
regard, but clearly I diverged a bit because recounting the sheer effort I expended to get things
configured made it hard to not vent a bit.</p>
<p>Hopefully for anyone reading this, it gives you some additional insight to get Authentik configured
with less effort than I personally spent.</p>
<p>As for me, I do intend to eventually contribute back, or try to contribute back, improvements to the
sharp edges that chewed up so much of my time. I do think there’s something to be said, though,
about focusing on the user experience aspect where possible, especially when it comes to security.
My users can’t get frustrated and actually, like… do anything less secure as a workaround to avoid
redirect hell: they use this thing I’ve configured and provided, or they don’t get access at all.
For other people deploying Authentik, though? Who knows what their use case is and if these
shortcomings might undermine their ability to provide an alternative authn/authz solution to their
users that doesn’t feel more clunky to use than whatever they came from.</p>Toby LawrenceIn my inaugural post, I briefly covered the shape of my new approach to exposing self-hosted applications to the public internet in a reasonably secure way. Most of this set up depends on a piece of software called Authentik, an open-source identity provider which acts as the glue between Cloudflare Access and the actual authentication mechanisms I want to depend on.Self-hosting the hard way: securely exposing services on the World Wide Web2022-10-27T00:00:00+00:002022-10-27T00:00:00+00:00https://notes.catdad.science/2022/10/27/self-hosting-hard-way-exposing-services<h2 id="prelude">Prelude</h2>
<p>Over the past few years, there’s been somewhat of a rennaissance around self-hosting services, such
as media storage and e-mail. Whatever the reasons may be – financial, technical, moral, etc –
there’s an appetite to host services on ones own infrastructure, under ones control. I myself host a
number of services on my own infrastructure, complete with a cheeky location code for my basement
“datacenter”. However, most people’s lives don’t exist a bubble that ends exactly where their wifi
network begins. Being able to use these services on the go is just as important as using them at
home where they might be hosted.</p>
<p><strong>tl;dr: Cloudflare Tunnel for avoiding having to directly expose my home infrastructure, Authentik
running on Fly.io for exposing an externally-accessible Identity Provider, and Cloudflare Access
using Authentik as an IdP for authenticating/authorizing all requests before they ever hit my
network</strong></p>
<h2 id="why-not-tailscale">Why not Tailscale?</h2>
<p>Now, I’m a tech person, and you’re probably a tech person too if you’re reading this blog, and
you’ve potentially already muttered at the screen, bewildered: “why not just use Tailscale?”.
<a href="https://tailscale.com/">Tailscale</a>, in a nutshell, provides point-to-point encrypted connections to
allow accessing remote devices as if you were on the same local network, but from anywhere. It has
clients for all major desktop and mobile platforms and using it is an absolute breeze. Tailscale is
already part of my homelab ecosystem, but it has one particular limitation: it’s not good for
sharing <em>arbitrary</em> access to self-hosted services. Sure, the client support is there, but asking
friends and family to install Tailscale on all the devices they might want to use, and then managing
that access … that’s not a task I want to take on.</p>
<h2 id="figuring-our-requirements">Figuring our requirements</h2>
<p>With all of this in mind, I set out to find a solution to the problem as I saw it, and came up with
a list of requirements:</p>
<ol>
<li>Services must be accessible from “anywhere”</li>
<li>New software should not be required (i.e. you shouldn’t need a VPN to access the equivalent of a
hosted GMail… just your browser)</li>
<li>Should not be directly exposed to the internet (no port forwarding)</li>
<li>All requests must be authenticated/authorized <em>before</em> traffic ever reaches anything in our
infrastructure</li>
<li>Should be free (no software/hosting cost, ideally)</li>
</ol>
<p>Over the course of a few weeks, I did a bunch of research, scouring the likes of the
<a href="https://www.reddit.com/r/selfhosted/">r/selfhosted</a> subreddit, popular forums for home hosting like
<a href="https://forums.servethehome.com/index.php">ServeTheHome</a>, and general Google spelunking. After a
lot of tinkering and toiling, I finally came up with a solution that checks off all the boxes, which
I’ll go through below.</p>
<h2 id="wiring-up-our-services-to-the-internet">Wiring up our services to the internet</h2>
<p>Cloudflare has a large catalog of services they offer, but one of the more intriguing ones for
self-hosters is their “Tunnel” service.
<a href="https://www.cloudflare.com/products/tunnel/">Cloudflare Tunnel</a> (neé <em>Argo Tunnel</em>) provides a way
to run a small daemon within your network that establishes a reverse connection <em>back</em> to
Cloudflare’s POPs, where in turn you can expose applications inside your network across that tunnel.
Similar to configuring, say, nginx or Apache, you point <code>cloudflared</code> at an upstream target (or
targets) to send traffic to (like <code>localhost:8080</code> or <code>svc.mylocal.network:9000</code>) and then
configure, on the Cloudflare side, what public hostnames to expose those services at. When traffic
hits Cloudflare’s edge, it gets sent across the established tunnel, and ultimately lands at your
service.</p>
<p>This is really, really cool and helps us check off a lot of boxes, at least partially:</p>
<ul>
<li>Our service is <em>not</em> exposed directly to the internet. Attackers could still exploit RCEs in our
service or stuff like that, but we don’t have to loosen our firewall rules one bit.</li>
<li>Our service is, for all intents and purposes, accessible to anyone with an internet connection.
We’re not concerned with impediments such as country-level firewalls, DNS blackholing, corporate
proxies, etc.</li>
<li>Cloudflare Tunnel is <em>free</em>. All of the other bits – a basic Cloudflare account, hosting DNS for
a domain, protecting subdomains with TLS, and so on – are also free.</li>
</ul>
<blockquote>
<p>Admittedly, I started looking at Cloudflare Tunnel before Cloudflare’s absolute fumbling with the
whole Kiwi Farms situation. In no uncertain terms: fuck Kiwi Farms and all of the dipshits who
gave it life.</p>
<p>If there was another company providing a service equivalent to Cloudflare Tunnel, I’d use it.
Until then, I’m not actually giving Cloudflare money, and I’m mostly trying to provide a service
to friends and family in a secure way.</p>
<p>If you know of an equivalent offering, I’m all ears!</p>
</blockquote>
<p>I repurposed my existing Cloudflare account to host the DNS for my quirky self-hosting domain (this
one!). The setup instructions for Cloudflare Tunnel and <code>cloudflared</code>, the daemon that runs in your
infrastructure, are short but straightforward. I spun up a simple “Hello world!” app on one of my
servers, and ran <code>cloudflared</code> via Docker. After a small bit of configuration in the Zero Trust
dashboard to create a new subdomain and associate it with a target on my side of the tunnel, we were
serving traffic… except I had misspelled “world” as “wordl”, so now my low-grade dyslexia was
publically – albeit <em>securely</em> – on display to the whole… “wordl.”</p>
<p>All told, this proof of concept took less than 30 minutes. Honestly, it felt a lot like magic. I was
on a roll at this point.</p>
<p>Onward!</p>
<h2 id="we-need-a-bouncer-at-the-entrance">We need a bouncer at the entrance</h2>
<p>While having the services safely exposed to the internet was half the battle, I still needed an
answer to the problem of authentication/authorization. As alluded to above, I was trying to
imagine what the chinks in the armor might be for a set-up like this, and naturally, software
vulnerabilities came to mind. Even if a service is fronted by Cloudflare, an attacker can still get
requests to the service. I write software for a living, and I know just how much code is out there,
waiting for someone to walk by and notice how trivially it can be eviscerated.</p>
<p>I needed a way to actually protect the service before traffic was allowed through, by authenticating
and authorizing users. I considered the security of the applications themselves as out of scope
here:</p>
<ul>
<li>generally, I trust the people to whom I’ve given/will give access to</li>
<li>some of these applications have a <code>node_modules</code> folder that would make a security researcher
either salivate or run away from screaming… so I chose to value my sanity and ignore delving too
deep on the application side</li>
</ul>
<p>With that said, again, I tried to be regimented and came up with a list of requirements/invariants
for doing authentication/authorization:</p>
<ul>
<li>all requests must be authenticated/authorized <em>before</em> traffic reaches my home infrastructure, so
<em>we</em> can’t host any part of an authn/authz solution on said home infrastructure</li>
<li>we want to let users authenticate with identities they already have, otherwise we’re back to the
Tailscale problem, where we’re constantly managing not only access to be on the same tailnet, but
to the self-hosted applications themselves (I don’t want a ‘local network has admin’ authorization
scheme)</li>
</ul>
<p>Admittedly, I’m writing this in a different order than how I approached finding a solution because
it felt a lot like having no answer at all until all of the pieces came together. With that said,
what I ultimately landed on was based on another Cloudflare product:
<a href="https://www.cloudflare.com/products/zero-trust/access/">Cloudflare Access</a>.</p>
<p>As mentioned above, Cloudflare Tunnel is part of Cloudflare’s overall “Zero Trust” product suite,
which is their set of offerings for small businesses and enterprises to do the whole zero trust
thing: trusted devices with device posture management, being able to access internal resources by
currying favor granted to their trusted devices, and bringing all different manner of services into
the fold in a generic way, whether they’re self-hosted (Cloudflare Tunnel) or external, by providing
authorization at the Cloudflare edge (Cloudflare Access). Their product people would probably snort
here at my crude explaining of the Zero Trust product suite, but suffice to say that it provides all
of the building blocks we need to build this solution.</p>
<p>Anyways, Cloudflare Access provides the authorization aspect by allowing you to specify identity
providers that users can authenticate to, which then allow you to authorize them with simple
policies on the Cloudflare side, and Cloudflare handles managing the cookies/tokens/proxying of
traffic to your services. In particular, it can sit in front of a service exposed by Cloudflare
Tunnel without the two CF products having to be configured to know about each other. As long as both
the tunnel and the Access configuration are set to handle the same specific domain, it just seems to
work…. the authentication/authorization happens first, and then it continues on with tunneling the
traffic.</p>
<p>Honestly, a lot like magic.</p>
<h2 id="whos-on-the-guest-list">Who’s on the guest list?</h2>
<p>I knew now that Cloudflare Access was the answer to how to handle authorizing all requests before
they made it to my home infrastructure, but what I didn’t know yet was <em>how</em> to authenticate users.
I also didn’t know how to authenticate them off-site, to avoid the catch-22 of traffic needing to
hit my home infrastructure first.</p>
<p>Some more spelunking later, I uncovered a name that kept showing up while sifting <strong>r/selfhosted</strong>:
<a href="https://goauthentik.io/">Authentik</a> and <a href="https://www.authelia.com/">Authelia</a>. These projects, and
others like them, are essentially “build your own identity provider” solutions. They allow you to
handle all the common aspects of running an identity provider: creating/managing users and groups,
importing users from other identity providers, doing authorization, and so on. I ultimately chose
Authentik because one of the services I run is Plex, and Authentik has support for using Plex as an
identity provider itself, which meant for services where people were already accessing my Plex
content, they could authenticate using the same identity/credentials. Further, Plex provides an API
to distinguish if a user authenticated through their identity provider has access to a specific Plex
server, which meant I could essentially get authorization for free in some cases. If a user should
only get access to an application because it’s related to their access to my Plex server, then
authentication and authorization through Plex becomes one and the same.</p>
<p>Configuring Authentik was fairly painful: it took me a while to pour over the documentation, figure
out how to create my own identity provider, configure all of the various flows/stages correctly,
wire it up to Cloudflare Access, and test it. Documenting all of the steps I took, and the
configuration I landed on would be too big for this post, but is something I’m looking to post about
more in the future.. ideally with an opinionated configuration to help others start from. In the
interest of brevity, here’s how I configured Authentik:</p>
<ul>
<li>Authentik gets configured to use Plex as a federated authentication source, further constrained to
users that have access to my Plex server</li>
<li>Authentik is configured to create shadow user accounts locally to mirror the users in Plex, and
assigns them to a specific group (not required, but for my own sanity)</li>
<li>Authentik exposes a dedicated OAuth2/OpenID Connect endpoint that uses this Plex-based federated
authentication for its own authentication flow, and authorization is a no-op on the Authentik side
since we get it implicitly based on how the Plex authentiation works</li>
<li>Authentik is configured to send the user’s Plex token in the OIDC claims so that we can pass it to
the underlying services being protected</li>
</ul>
<p>Cloudflare Access, in addition to the cookies it uses itself for ensuring users are authenticated
before passing thr traffic through, will also send a cookie/header with a
<a href="https://jwt.io/">JSON Web Token</a> of the user’s information. You get the common stuff from the
common OpenID scopes – <code>username</code>, <code>email</code>, yadda yadda – but you can also shove in custom fields
from the OIDC claims – which you can customize on the Authentik side – into the JWT. This means
not only can we validate the JWT on our side to make sure Cloudflare was really involved, but that
we can shuttle along custom data – like the authenticated uer’s Plex token, which we need for
forward authentication – in the JWT as well.</p>
<h2 id="keeping-the-bouncer-protected-too">Keeping the bouncer protected, too</h2>
<p>At this point, I managed to figure out how to protect my home infrastructure when used as an origin
server, as well as how to expose an authentication/authorization mechanism and have Cloudflare
protect our resources with it, but we still had the problem that our identity provider itself needed
to be pubically accessible in order for Cloudflare Access to reach it. Even if Cloudflare was also
fronting Authentik, we still had the same potential issue: what if Authentik has a vulnerability
that can be exploited prior to getting through the authentication/authorization flow?</p>
<p>I solved this by simply… running it on external infrastructure. I had been wanting to use
<a href="https://fly.io">Fly.io</a> for a while – I know one of the engineers there, it’s a cool product,
their blog posts are great, and they’re providing a ton of value to customers – and this struck me
as the perfect opportunity. Since we don’t generally need any advanced/complex redundancy or
resiliency – Plex is our primary authentication/authorization source, so all we need to do is have
a repeatable way to configure Authentik to use it – we could afford to run Authentik is a stripped
down way on lower-end hardware. Fly.io was a great fit for this.</p>
<p>Right off the bat, Fly.io lets you run applications at the edge so long as you can bundle them up
into a Docker image. Authentik already had a Docker image, so that’s a solved problem. We also
needed to provide Authentik a database and cache: technically, I didn’t really care about long-term
storage of anything since we could suffice with in-memory storage of OAuth2 tokens, etc… but
Authentik is more general-purpose than that and <em>requires</em> Postgres and Redis.</p>
<p>Luckily, Fly.io has been launching managed services, either by providing the management tooling
themselves (Postgres) or acting as a marketplace for third-party providers to run their managed
services on (Redis, specifically Upstash Redis). Oh yeah, and Fly.io has a generous free tier for
not only deployed applications, but also for these managed services. <em>Nice.</em></p>
<p>I configured the requisite Postgres and Redis services on Fly.io, then crafted a Dockerfile (and
ultimately a <code>fly.toml</code> deployment configuration) for Authentik. Authentik has a feature they call
“blueprints” which allows defining its configuration (some parts of it, at least) as YAML files that
can be loaded at start… which I didn’t take advantage of initially but have been working to switch
over to. Blueprints are the starting point of “how do I reconfigure this if my Fly.io account
explodes somehow?”, or if I want to migrate all of this to another cloud provider.</p>
<p>After manually configuring Authentik to do all of the OAuth2/OpenID and Plex federated
authentication bits, I had one last step which was to configure some vanity DNS to stick Authentik
behind, and then a small configuration on the Cloudflare Access side to point to it. Much elided
tweaking and futzing and hair pulling later, I had Cloudflare Access using my isolated Authentik
deployment to authenticate and authorize users, before sending traffic over Cloudflare Tunnel to an
application hosted on my home infrastructure… and it was all free!</p>
<blockquote>
<p>I ended up purchasing some credits from Fly.io because their service is great and I wanted to spin
up some ex tra application deployments with resources beyond what the free tier provides.
Authentik also ended up using a lot of memory when running its Django-based migrations, which
would cause OOMs.</p>
<p>Ultimately, it should cost me like $5-7/month for my upsized VMs, but I’m still passively looking
into possible performance optimizations that could be contributed upstream to Authentik that would
allow dropping back down to the free tier VM sizing.</p>
</blockquote>
<h2 id="lets-pretend-for-just-a-moment">Let’s pretend for just a moment</h2>
<p>Much ink was spilled above, so let’s briefly recap the steps and setup we undertook here:</p>
<ul>
<li>I have an existing Plex server, shared with friends and family, which acts as the authentication
(and authorization) provider. Plex already has mechanisms to share itself (network-wise) outside
of what’s described in this blog post, but I was exposing another application that needs Plex
credentials.</li>
<li>I have an application inside my network, which is accessible on-network at
<code>http://app.cluster.local:5000</code>, which I want to expose externally at
<code>https://app.vanitydomain.com</code>.</li>
<li>I created a Cloudflare account and set it up to host DNS for <code>vanitydomain.com</code>.</li>
<li>I set up Cloudflare Tunnel to proxy traffic from <code>https://app.vanitydomain.com</code> to
<code>http://app.cluster.local:5000</code>, and deployed <code>cloudflared</code> internally to support that.</li>
<li>I created a Fly.io account and deployed Authentik, using Fly.io’s Managed Postgres and Managed
Redis (Upstash Redis) services, which I then put behind a vanity subdomain of
<code>https://app.auth.vanitydomain.com</code>.</li>
<li>I configured Authentik to use Plex as a federated authentication (and de-facto authorization)
source by allowing users to authenticate to Plex (Plex as in <code>plex.tv</code>, not my Plex server
specifically) which then provides de-facto authorization as it only allows authenticated Plex
users who also have access to my Plex server.</li>
<li>I configured Authentik to expose an OAuth2/OpenID Connect endpoint, which ultimately uses the
federated Plex authentication source as a passthrough, and additionally forwards data like their
user groups, Plex token, etc, in the OIDC claims.</li>
<li>I configured Cloudflare Access with a new authentication source that was pointed at our
Authentik-based OAuth2/OpenID IdP, located at <code>https://app.auth.vanitydomain.com</code>, with a no-op
authorization policy, as Authentik handles that for us.</li>
<li>I configured Cloudflare Access to expose/protect an application located at
<code>https://app.vanitydomain.com</code> using the authentication source we just configured.</li>
</ul>
<p>As a user, all they have to do is navigate to <code>https://app.vanitydomain.com</code>. When they do that for
the first time, Cloudflare Access sees that they have no existing cookie marking them as
authenticated, and so they enter the authentication and authorization flow. Cloudflare Access
redirects to an account-specific Cloudflare Access service provider endpoint (part of the
OAuth2/OpenID flow) which then sends them to <code>https://app.auth.vanitydomain.com</code> where they
authenticate with Plex. Once the Plex authentication happens, and things look good, they’re
redirected back to the account-specific Cloudflare Access service provider endpoint, which now does
the “authorization” and if that’s successful, sends them to <code>https://app.vanitydomain.com</code>, but with
a Cloudflare Access-specific URI path. This specific path is handled by Cloudflare, but crucially,
provides a secure mechanism for cookies to be set on our application domain, by Cloudflare, on our
behalf. (neat!) Finally, the user is redirected to the original resource, now with their
authentication cookies being sent with the request, which lets Cloudflare Access know they’re
authenticated and authorized to access to the given resource. Cloudflare Tunnel takes over from
here, routing the request over the tunnel, with the cookie/header from the Cloudflare Access side so
our application can access any of the authentication/authorization data that’s relevant, and finally
the user is interacting with the application.</p>
<h2 id="what-weve-learnedaccomplished">What we’ve learned/accomplished</h2>
<p>We set out to expose an application running on our home infrastructure without having to necessarily
expose it directly to the internet, and without messing with routers and firewalls. We also set out
to only allow authorized users to access said application, based on federated
authentication/authorization that these users had already onboarded with and that we ultimately
controlled.</p>
<p>We were able to do all of this without <em>any</em> traffic ever hitting our home infrastructure before the
user/request was authenticated and authorized, and without needing to host the
authentication/authorization endpoints on the very same home infrastructure, such that Cloudflare
and Fly.io bear almost all of the weight of any potential DDoS/intrustion attempts. If a bug in
Authentik was found and exploited, there’s the potential for the authentication/authorization flow
to be compromised, leading to getting access to the protected application or our home
infrastructure… but now we can focus more of our time and energy on securing the Authentik
deployment rather than also having to harden every single application that we want to expose.</p>
<p>Finally, but just as importantly: we were able to do this all for free*, with primarily open-source
software that can be examined and replaced if need be, save for the Tunnel/Access magic provided by
Cloudflare.</p>
<p>All in all, not the most convoluted infrastructure I’ve ever spun up, but sure as heck one of the
more useful bits of infrastructure I’ve set up.</p>
<p>More to come.</p>Toby LawrencePrelude