Skip to content

aviai/devops-interview-ii

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Finn Runtime: Platform Engineering Exercise

Welcome, and thanks for making the time. This exercise is meant to mirror the kind of work you'd actually do on the platform team here. There are no trick questions and no gotchas. We care far more about how you think than about whether every line of HCL is perfect.

Context

Finn is one of our products. It's a self-serve research assistant that pharma and biotech scientists use to answer questions like "what's the evidence for PD-1 expression as a prognostic biomarker in triple-negative breast cancer." Under the hood, Finn is an agent runtime:

  1. A user submits a query.
  2. An LLM planner decomposes the query into steps.
  3. Worker agents execute those steps, calling tools (web search, internal APIs, database lookups, other LLMs).
  4. Results stream back to the user's browser over Server-Sent Events as tokens arrive.

A single session can run anywhere from 30 seconds to 10 minutes and costs real money in LLM calls.

Today's architecture (the problem)

Finn's runtime is a FastAPI service. Right now it's deployed as one ECS Fargate service with one task. Every user session runs in-process on that one task. When it OOMs, every in-flight session dies and every user sees their streaming response cut off mid-sentence. Deploys are worse: we restart the task and everyone loses their work.

Signups are growing. We need this production-ready.

Your job

Design and implement the infrastructure to run the Finn agent runtime reliably. You'll spend time on three things:

Part 1: Architecture (roughly 15 minutes, discussion)

Talk us through how you'd architect this. We'll be in the room and this is a conversation, not a monologue. Draw on the whiteboard, ask us questions, push back on the constraints if something seems off. Things worth thinking about:

  • How do you scale horizontally when sessions are long-lived and stateful?
  • What happens to an in-flight session during a deploy?
  • How do you stop a runaway agent burning tokens?
  • What do you autoscale on? CPU is not the answer.
  • How does an engineer debug a failed session at 2am?

Part 2: Terraform (roughly 35 minutes, hands-on)

Build a Terraform module for the Finn runtime service. Scope:

  • An ECS Fargate service and task definition
  • An ALB with a target group and listener
  • IAM task role and execution role, scoped appropriately
  • A Secrets Manager reference for the LLM API key
  • Autoscaling configuration

Treat this like something you'd check into mithrl/infra and reuse across staging and prod. We're not going to apply anything. We'll run terraform init, validate, and plan together against dummy credentials to sanity-check the output.

A starter layout is in ./terraform/. Use it, ignore it, or restructure it. Your call.

Part 3: Operations (roughly 10 minutes, discussion)

Walk us through what happens after you merge this PR:

  • How does a new version get deployed without dropping sessions?
  • How do you know the service is healthy beyond "ALB returns 200"?
  • What's the on-call runbook when agents start failing?
  • How does a new hire on your team get a working dev environment?

Assumptions you can make

  • The Finn service container is already built and published to ECR at <account>.dkr.ecr.us-west-2.amazonaws.com/finn-runtime:<tag>.
  • A VPC with public and private subnets already exists. Their IDs will be passed in as variables.
  • The Anthropic API key is already in Secrets Manager at mithrl/finn/anthropic-api-key.
  • AWS region is us-west-2.
  • You have Terraform 1.6+ and the AWS CLI installed on this laptop. Credentials are dummy values, which is fine.

What we're looking for

Clean, readable Terraform. Sensible defaults. Thoughtful variable and output design. Appropriate IAM scoping. Evidence that you've thought about the "day two" questions: deploys, observability, failure modes, dev loops. We care more about the quality of a smaller slice than a sprawling half-finished module.

Any questions before you start? Go for it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors