Finn Runtime: Platform Engineering Exercise

Welcome, and thanks for making the time. This exercise is meant to mirror the kind of work you'd actually do on the platform team here. There are no trick questions and no gotchas. We care far more about how you think than about whether every line of HCL is perfect.

Context

Finn is one of our products. It's a self-serve research assistant that pharma and biotech scientists use to answer questions like "what's the evidence for PD-1 expression as a prognostic biomarker in triple-negative breast cancer." Under the hood, Finn is an agent runtime:

A user submits a query.
An LLM planner decomposes the query into steps.
Worker agents execute those steps, calling tools (web search, internal APIs, database lookups, other LLMs).
Results stream back to the user's browser over Server-Sent Events as tokens arrive.

A single session can run anywhere from 30 seconds to 10 minutes and costs real money in LLM calls.

Today's architecture (the problem)

Finn's runtime is a FastAPI service. Right now it's deployed as one ECS Fargate service with one task. Every user session runs in-process on that one task. When it OOMs, every in-flight session dies and every user sees their streaming response cut off mid-sentence. Deploys are worse: we restart the task and everyone loses their work.

Signups are growing. We need this production-ready.

Your job

Design and implement the infrastructure to run the Finn agent runtime reliably. You'll spend time on three things:

Part 1: Architecture (roughly 15 minutes, discussion)

Talk us through how you'd architect this. We'll be in the room and this is a conversation, not a monologue. Draw on the whiteboard, ask us questions, push back on the constraints if something seems off. Things worth thinking about:

How do you scale horizontally when sessions are long-lived and stateful?
What happens to an in-flight session during a deploy?
How do you stop a runaway agent burning tokens?
What do you autoscale on? CPU is not the answer.
How does an engineer debug a failed session at 2am?

Part 2: Terraform (roughly 35 minutes, hands-on)

Build a Terraform module for the Finn runtime service. Scope:

An ECS Fargate service and task definition
An ALB with a target group and listener
IAM task role and execution role, scoped appropriately
A Secrets Manager reference for the LLM API key
Autoscaling configuration

Treat this like something you'd check into mithrl/infra and reuse across staging and prod. We're not going to apply anything. We'll run terraform init, validate, and plan together against dummy credentials to sanity-check the output.

A starter layout is in ./terraform/. Use it, ignore it, or restructure it. Your call.

Part 3: Operations (roughly 10 minutes, discussion)

Walk us through what happens after you merge this PR:

How does a new version get deployed without dropping sessions?
How do you know the service is healthy beyond "ALB returns 200"?
What's the on-call runbook when agents start failing?
How does a new hire on your team get a working dev environment?

Assumptions you can make

The Finn service container is already built and published to ECR at <account>.dkr.ecr.us-west-2.amazonaws.com/finn-runtime:<tag>.
A VPC with public and private subnets already exists. Their IDs will be passed in as variables.
The Anthropic API key is already in Secrets Manager at mithrl/finn/anthropic-api-key.
AWS region is us-west-2.
You have Terraform 1.6+ and the AWS CLI installed on this laptop. Credentials are dummy values, which is fine.

What we're looking for

Clean, readable Terraform. Sensible defaults. Thoughtful variable and output design. Appropriate IAM scoping. Evidence that you've thought about the "day two" questions: deploys, observability, failure modes, dev loops. We care more about the quality of a smaller slice than a sprawling half-finished module.

Any questions before you start? Go for it.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
envs/staging		envs/staging
modules/finn_runtime		modules/finn_runtime
terraform		terraform
Dockerfile		Dockerfile
README.md		README.md
SETUP.md		SETUP.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Finn Runtime: Platform Engineering Exercise

Context

Today's architecture (the problem)

Your job

Part 1: Architecture (roughly 15 minutes, discussion)

Part 2: Terraform (roughly 35 minutes, hands-on)

Part 3: Operations (roughly 10 minutes, discussion)

Assumptions you can make

What we're looking for

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Finn Runtime: Platform Engineering Exercise

Context

Today's architecture (the problem)

Your job

Part 1: Architecture (roughly 15 minutes, discussion)

Part 2: Terraform (roughly 35 minutes, hands-on)

Part 3: Operations (roughly 10 minutes, discussion)

Assumptions you can make

What we're looking for

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages