Airbnb Engineering & Data Science

It Wasn’t a Culture Problem: Upleveling Alert Development at Airbnb

laurenmackevich — Wed, 04 Mar 2026 18:50:58 +0000

Observability as Code (OaC) — defining alerts, dashboards, and SLOs via code rather than UI — is table stakes for large engineering organizations. With OaC, observability adopts software development’s version control, code review, and testing processes, achieving the same level of discipline as a result. At Airbnb’s scale (thousands of engineers and services), this is the foundation that lets teams ship confidently while maintaining the reliability our guests and hosts depend on.

Yet there’s a critical gap in most OaC workflows. While we bring rigor to alert definitions through code review and version control, the actual behavior of those alerts often can’t be validated until they’re live. Production becomes the proving ground. Problems surface either as noise that erodes trust or silence that hides real incidents.

This tolerance of high alert noise might appear to be a culture problem, but we realized it was actually a gap in the developer workflow. We solved it by building accessible, fast feedback loops to preview, validate, and surface actionable insights on alert behavior before PR submission. With these changes, development cycles collapsed from weeks to minutes, and we successfully migrated 300,000 alerts from a vendor to Prometheus, a feat that wouldn’t have been possible otherwise.

Airbnb’s OaC North Star

Our Observability as Code North Star is for product teams to receive out-of-the-box, best-practice monitoring from platform teams. When a product engineer adopts Kubernetes, a service framework, or a database, they should inherit battle-tested alerts, dashboards, and SLOs. The best monitoring gives product engineers the benefit of all of Airbnb’s infra and platform domain expertise immediately. We call this “zero touch.”

We began our OaC journey more than 10 years ago when we built Interferon, starting with 1,000 alerts. Today, we manage 300,000. By any measure, Interferon was a success, scaling our monitoring practices towards our North Star.

However, this success introduced new operational challenges. With so many alerts in production, validating any OaC changes became costly, and it became harder to iterate. Engineers faced a difficult tradeoff: tuning an alert template might reduce noise, but it also risked losing an important signal. Without a way to preview alert behavior, the safer choice was often to leave things as-is.

The problem: Traditional code review can’t validate alert behavior

The problem wasn’t Interferon itself, but rather a development workflow gap. Our North Star requires platform teams to define and maintain monitoring patterns at scale, but we lacked the necessary tooling to effectively validate those patterns against reality.

Traditional code review can validate syntax and logic, and unit tests can verify outputs. But neither can answer the questions that matter most: “How will these alerts behave in production? What noise might they generate? Will they needlessly wake up on-call engineers at 3 AM?”

That’s why you have to validate alerts against real-world data. If your assumptions about the data are wrong, your alert is wrong.

However, off-the-shelf query visualization tools are insufficient for the job. They don’t account for alert-specific parameters, and most notably, they show downsampled data that masks actual alert behavior — step sizes don’t match evaluation intervals. Also, the further you move from static configs, the harder validation becomes. With templating, reviewers must manually copy-paste queries and fill in variables. For a single alert, this is tedious and error-prone; for changes affecting hundreds of services, it’s impossible.

So in practice, developers often had to resort to a weeks-long process of deploying a new alert side-by-side with the existing one, waiting for real-world data to come in, validating, and then iterating.

The solution: Making alert behavior visible

What if an infrastructure engineer could validate a Kubernetes alert template, one that fans out to thousands of services, in 30 seconds instead of 30 days?

That question among others prompted us to rethink and rebuild our OaC platform from the ground up. Building on top of Prometheus’ open source foundations, we could develop the exact UX our engineers needed, particularly local diffs and pre-deployment validation.

The core of this workflow is local-first development: The same code and inputs run in production must run identically on a developer’s laptop and in CI. In addition, we built Change Reports that show how alerts will be modified and bulk backtesting that simulates alerts against historical data.

We rolled out this platform incrementally, with each milestone providing compounding value.

Phase 1: Text-based diffs everywhere

An example alert diff in markdown format

We first generated markdown alert diffs with field-level granularity and query links — the “terraform plan” of OaC. We met developers where they work, in terminal via CLI and in PRs via CI. This solved the basic visibility problem: engineers could finally review the OaC generated alerts without error-prone copy-pasting.

Phase 2: Dedicated Change Report UI

The left image lists all alert modifications resulting from a code change. The right image dives into a specific alert.

We then built a Change Report UI showing side-by-side alert diffs exactly as they will appear in production, removing the guesswork and mental mapping between config and UI. However, the user was still responsible for mentally simulating alert behavior, which is challenging even for Prometheus experts to get right.

Phase 3: Historical simulation via bulk backtesting

The backtesting integration into the change list (left) and individual alert (right) views

Finally, we built a backtesting system that runs proposed alerts against historical data, hooking directly into Prometheus’s rule manager. Backtesting allows users to understand which alerts would have fired, when, and why, as if they had existed the entire time. Displaying this simulated state inline in the Change Report UI answered the question that matters most: “How would this alert behave in production?”

We backtest in bulk for the entire diff — hundreds or thousands of related alerts — and surface quality signals to help reviewers focus their attention. We compute a “noisiness” metric and show alert firing timelines in the table view, letting users sort by potential problems and focus their effort.

Putting it all together

On the new platform, when a user makes an OaC change, they will now generate a Change Report via CLI or CI. We post a Change Report on all PRs.

The user reviews their changes via the UI. In this example, a one week backtest was conducted, and the changes are sorted by noisiness, seen in the “Tuning” column, to help direct users’ attention.

The user can dive into individual alerts to learn more. This one looks to be problematic, firing once per day.

The user can set overrides. Given the graph, 1.14 looks like a better trigger threshold.

They can then see the impact of their changes. No more alert firings. This looks good and is ready to ship.

Our learnings

Compatibility over novelty

A key architectural decision was compatibility over novelty. Rule groups are Prometheus’ standardized format. By taking that as input, we hooked directly into Prometheus’s rule evaluation engine rather than reimplementing it. We wrote results as Prometheus time series blocks, exposing the data via the standard query API. This meant building analysis tools once. This standardization made our system portable, allowing us to reach all developers in their existing workflows.

Guardrails aren’t optional

Simulating thousands of alerts over 30 days quickly and without service degradations required careful design. Each backtest runs in its own Kubernetes pod with autoscaling to prevent resource contention. Concurrency limits, error thresholds, and multiple circuit breakers prevent cascading failures. A backtesting system that can destabilize production is worse than no system at all.

Perfect is the enemy of shipped

Our simulator doesn’t account for recording rule dependencies. We could have built a more sophisticated dependency resolver, but users can separate this into two distinct tasks — modify the recording rule first, then backtest alerts that depend on it — assuming they know they should. The Change Report UI helps, because when modified dependencies are detected, it highlights them and prompts resolution. This turns a technical limitation into a guided workflow. We shipped the 80% solution that immediately delivers value, leveraging the UI to close gaps.

Own the full surface area

Monitoring is often an afterthought — engineers have limited time under tight deadlines. Our job is to make that time as effective as possible. Prometheus is powerful but exposes a low-level API that requires expertise most engineers don’t have. To achieve our North Star, we introduced abstractions like anomaly detection, burn rate alerts, and change detection. But abstractions only simplify things when you own all the touchpoints: the input language engineers write, the generation process, the UI that displays results, and the validation tools that provide feedback. Partial ownership creates leaky abstractions. Full ownership lets us ruthlessly optimize for developer experience.

The impact

Successful migration from a vendor to Prometheus

We migrated 300,000 alerts from a vendor to Prometheus. Rewriting every alert would have been impossible with our old workflow, but we achieved it thanks to our Change Report UI, bulk backtesting, and an additional vendor-specific integration. By codifying our domain knowledge in the UX, what originally promised to be a multi-year slog of manual effort became a structured, confident migration.

Collapsed development cycles

The typical developer workaround for making alert changes — deploy side-by-side, wait, then iterate — became obsolete. Engineers now make and validate alert changes all within a single PR. Platform teams confidently deploy template changes affecting thousands of services. What once took a month of iteration now takes an afternoon.

Culture transformation

Even though we realized we had a workflow problem, not a culture problem, solving this problem still ended up transforming our culture. We reduced companywide alert noise by 90%, and engineers stopped tolerating noisy alerts and started competing to improve them. Platform teams resumed iterating on shared patterns. Alert hygiene became a point of pride, not a chore to avoid.

Turnkey alert testing is the biggest positive improvement to alerts management in Airbnb’s history.

I’ve seen people spend hours debating the merits of changing an alert only to do nothing because of fear, uncertainty, and doubt. This new alert testing capability completely evaporates the stop energy and allows us to monitor with confidence.

– Gregory Szorc, Senior Staff Software Engineer

Conclusion

Local-first development, Change Reports, and bulk backtesting give us the necessary tools to incrementally reach our North Star. Platform teams can now confidently iterate on monitoring for their domains. Zero touch is becoming how we operate, one cycle at a time.

Now that we’ve introduced pre-deployment visibility and validation to the alert lifecycle, our next step is to introduce that same rigor to on-call analysis.

If this type of work interests you, check out some of our open roles.

Acknowledgements

Thank you to the Reliability Experience team — Kevin Goodier, Harry Shoff, Rich Unger, and Vlad Vassiliouk — and our partners across the company who helped make this a reality.

Improving Search Ranking for Maps

laurenmackevich — Wed, 25 Feb 2026 19:13:07 +0000

Malay Haldar, Hongwei Zhang, Kedar Bellare Sherry Chen

Search is the core mechanism that connects guests with Hosts at Airbnb. Results from a guest’s search for listings are displayed through two interfaces: (1) as a list of rectangular cards that contain the listing image, price, rating, and other details on it, referred to as list-results and (2) as oval pins on a map showing the listing price, called map-results. Since its inception, the core of the ranking algorithm that powered both these interfaces was the same — ordering listings by their booking probabilities and selecting the top listings for display.

But some of the basic assumptions underlying ranking, built for a world where search results are presented as lists, simply break down for maps.

What Is Different About Maps?

The central concept that drives ranking for list-results is that user attention decays starting from the top of the list, going down towards the bottom. A plot of rank vs click-through rates in Figure 1 illustrates this concept. X-axis represents the rank of listings in search results. Y-axis represents the click-through rate (CTR) for listings at the particular rank.

Figure 1: Click-through rates by listing search rank

To maximize the connections between guests and Hosts, the ranking algorithm sorts listings by their booking probabilities based on a number of factors and sequentially assigns their position in the list-results. This often means that the larger a listing’s booking probability, the more attention it receives from searchers.

But in map-results, listings are scattered as pins over an area (see Figure 2). There is no ranked list, and there is no decay of user attention by ranking position. Therefore, for listings that are shown on the map, the strategy of sorting by booking probabilities is no longer applicable.

Figure 2: Map results

Uniform User Attention

To adapt ranking to the map interface, we look at new ways of modeling user attention flow across a map. We start with the most straightforward assumption that user attention is spread equally across the map pins. User attention is a very precious commodity and most searchers only click through a few map pins (see Figure 3). A large number of pins on the map means those limited clicks may miss discovering the best options available. Conversely, limiting the number of pins to the topmost choices increases the probability of the searcher finding something suitable, but runs the risk of removing their preferred choice.

Figure 3: Number of distinct map pins clicked by percentage of searchers

We test this hypothesis, controlled by a parameter . The parameter serves as an upper bound on the ratio of the highest booking probability vs the lowest booking probability when selecting the map pins. The bounds set by the parameter controls the booking probability of the listings behind the map pins. The more restricted the bounds, the higher the average booking probability of the listings presented as map pins. Figure 4 summarizes the results from A/B testing a range of parameters.

The reduction in the average impressions to discovery metric in Figure 4 denotes the fewer number of map pins a searcher has to process before clicking the listing that they eventually book. Similarly, the reduction in average clicks to discovery shows the fewer number of map pins a searcher has to click through to find the listing they booked.

Figure 4: Exploring through online A/B experiments

Launching the restricted version resulted in one of the largest bookings improvement in Airbnb ranking history. More importantly, the gains were not only for bookings, but for quality bookings. This could be seen by the increase in trips that resulted in 5-star rating after the stay from the treatment group, in comparison to trips from the control group.

Tiered User Attention

In our next iteration of modeling user attention, we separate the map pins into two tiers. The listings with the highest booking probabilities are displayed as regular oval pins with price. Listings with comparatively lower booking probabilities are displayed as smaller ovals without price, referred to as mini-pins (Figure 5). By design, mini-pins draw less user attention, with click-through rates about 8x less than regular pins.

Figure 5: Oval pins with price and mini-pins

This comes in handy particularly for searches on desktop where 18 results are shown in a grid on the left, each of them requiring a map pin on the right (Figure 6).

Figure 6: Search results on desktop

The number of map pins is fixed in this case, and limiting them, as we did in the previous section, is not an option. Creating the two tiers prioritizes user attention towards the map pins with the highest probabilities of getting booked. Figure 7 shows the results of testing the idea through an online A/B experiment.

Figure 7: Experiment results for tiered map pins

Discounted User Attention

In our final iteration, we refine our understanding of how user attention is distributed over the map by plotting the click-through rate of map pins located at different coordinates on the map. Figure 8 shows these plots for the mobile (top) and the desktop apps (bottom).

Figure 8: Click-through rates of map pins across map coordinates.

To maximize the chances that a searcher will discover the listings with the highest booking probabilities, we design an algorithm that re-centers the map such that the listings with the highest booking probabilities appear closer to the center. The steps of this algorithm are illustrated in Figure 9, where a range of potential coordinates are evaluated and the one which is closer to the listings with the highest booking probabilities is chosen as the new center.

Figure 9: Algorithm for finding optimal center

When tested in an online A/B experiment, the algorithm improved uncancelled bookings by 0.27%. We also observed a reduction of 1.5% in map moves, indicating less effort from the searchers to use the map.

Conclusion

Users interact with maps in a way that’s fundamentally different from interacting with items in a list. By modeling the user interaction with maps in a progressively sophisticated manner, we were able to improve the user experience for guests in the real world. However, the current approach has a challenge that remains unsolved: how can we represent the full range of available listings on the map? This is part of our future work. A more in-depth discussion of the topics covered here, along with technical details, is presented in our research paper that was published at the KDD ’24 conference. We welcome all feedback and suggestions.

If this type of work interests you, we encourage you to apply for an open position today.

Building a Next-Generation Key-Value Store at Airbnb

laurenmackevich — Wed, 25 Feb 2026 18:46:09 +0000

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

By Shravan Gaonkar, Chandramouli Rangarajan, Yanhan Zhang

How we completely rearchitected Mussel, our storage engine for derived data, and lessons learned from the migration from Mussel V1 to V2.

Airbnb’s core key-value store, internally known as Mussel, bridges offline and online workloads, providing highly scalable bulk load capabilities combined with single-digit millisecond reads.

Since first writing about Mussel in a 2022 blog post, we have completely deprecated the storage backend of the original system (what we now call Mussel v1) and have replaced it with a NewSQL backend which we are referring to as Mussel v2. Mussel v2 has been running successfully in production for a year, and we wanted to share why we undertook this rearchitecture, what the challenges were, and what benefits we got from it.

Why rearchitect

Mussel v1 reliably supported Airbnb for years, but new requirements — real-time fraud checks, instant personalization, dynamic pricing, and massive data — demand a platform that combines real-time streaming with bulk ingestion, all while being easy to manage.

Key Challenges with v1

Mussel v2 solves a number of issues with v1, delivering a scalable, cloud-native key-value store with predictable performance and minimal operational overhead.

Operational complexity: Scaling or replacing nodes required multi-step Chef scripts on EC2; v2 uses Kubernetes manifests and automated rollouts, reducing hours of manual work to minutes.
Capacity & hotspots: Static hash partitioning sometimes overloaded nodes, leading to latency spikes. V2’s dynamic range sharding and presplitting keep reads fast (p99 < 25ms), even for 100TB+ tables.
Consistency flexibility: v1 offered limited consistency control. v2 lets teams choose between immediate or eventual consistency based on their SLA needs.
Cost & Transparency: Resource usage in v1 was opaque. v2 adds namespace tenancy, quota enforcement, and dashboards, providing cost visibility and control.

New architecture

Mussel v2 is a complete re-architecture addressing v1’s operational and scalability challenges. It’s designed to be automated, maintainable, and scalable, while ensuring feature parity and an easy migration for 100+ existing user cases.

Dispatcher

In Mussel v2, the Dispatcher is a stateless, horizontally-scalable Kubernetes service that replaces the tightly coupled, protocol-specific design of v1. It translates client API calls into backend queries/mutations, supports dual-write and shadow-read modes for migration, manages retries and rate limits, and integrates with Airbnb’s service mesh for security and service discovery.

Reads are simplified: Each dataname maps to a logical table, enabling optimized point lookups, range/prefix queries, and stale reads from local replicas to reduce latency. Dynamic throttling and prioritization maintain performance under changing traffic.

Writes are persisted in Kafka for durability first, with the Replayer and Write Dispatcher applying them in order to the backend. This event-driven model absorbs bursts, ensures consistency, and removes v1’s operational overhead. Kafka also underpins upgrades, bootstrapping, and migrations until CDC and snapshotting mature.

The architecture suits derived data and replay-heavy use cases today, with a long-term goal of shifting ingestion and replication fully to the distributed backend database to bring down latency and simplify operations.

Bulk load
Bulk load remains essential for moving large datasets from offline warehouses into Mussel for low-latency queries. v2 preserves v1 semantics, supporting both “merge” (add to existing tables) and “replace” (swap datasets) semantics.

To maintain a familiar interface, v2 keeps the existing Airflow-based onboarding and transforms warehouse data into a standardized format, uploading to S3 for ingestion. Airflow is an open-source platform for authoring, scheduling, and monitoring data pipelines. Created at Airbnb, it lets users define workflows in code as directed acyclic graphs (DAGs), enabling quick iteration and easy orchestration of tasks for data engineers and scientists worldwide.

A stateless controller orchestrates jobs, while a distributed, stateful worker fleet (Kubernetes StatefulSets) performs parallel ingestion, loading records from S3 into tables. Optimizations — like deduplication for replace jobs, delta merges, and insert-on-duplicate-key-ignore — ensure high throughput and efficient writes at Airbnb scale.

TTL

Automated data expiration (TTL) can help support data governance goals and storage efficiency. In v1, expiration relied on the storage engine’s compaction cycle, which struggled at scale.

Mussel v2 introduces a topology-aware expiration service that shards data namespaces into range-based subtasks processed concurrently by multiple workers. Expired records are scanned and deleted in parallel, minimizing sweep time for large datasets. Subtasks are scheduled to limit impact on live queries, and write-heavy tables use max-version enforcement with targeted deletes to maintain performance and data hygiene.

These enhancements provide the same retention functionality as v1 but with far greater efficiency, transparency, and scalability, meeting Airbnb’s modern data platform demands and enabling future use cases.

The migration process

Challenge

Mussel stores vast amounts of data and serves thousands of tables across a wide array of Airbnb services, sustaining mission-critical read and write traffic at high scale. Given the criticality of Mussel to Airbnb’s online traffic, our migration goal was straightforward but challenging: Move all data and traffic from Mussel v1 to v2 with zero data loss and no impact on availability to our customers.

Process

We adopted a blue/green migration strategy, but with notable complexities. Mussel v1 didn’t provide table-level snapshots or CDC streams, which are standard in many datastores. To bridge this gap, we developed a custom migration pipeline capable of bootstrapping tables to v2, selected by usage patterns and risk profiles. Once bootstrapped, dual writes were enabled on a per-table basis to keep v2 in sync as the migration progressed.

The migration itself followed several distinct stages:

Blue Zone: All traffic initially flowed to v1 (“Blue”). This provided a stable baseline as we migrated data behind the scenes.
Shadowing (Green): Once tables were bootstrapped, v2 (“Green”) began shadowing v1 — handling reads/writes in parallel, but only v1 responded. This allowed us to check v2’s correctness and performance without risk.
Reverse: After building confidence, v2 took over active traffic while v1 remained on standby. We built automatic circuit breakers and fallback logic: If v2 showed elevated error rates or lagged behind v1, we could instantly return traffic to v1 or revert to shadowing.
Cutover: When v2 passed all checks, we completed the cutover on a dataname-by-dataname basis, with Kafka serving as a robust intermediary for write reliability throughout.

To further de-risk the process, migration was performed one table at a time. Every step was reversible and could be fine-tuned per table or group of tables based on their risk profile. This granular, staged approach allowed for rapid iteration, safe rollbacks, and continuous progress without impacting the business.

Migration pipeline

As described in our previous blog post, the v1 architecture uses Kafka as a replication log — data is first written to Kafka, then consumed by the v1 backend. During the data migration to v2, we leveraged the same Kafka stream to maintain eventual consistency between v1 and v2.

To migrate any given table from v1 to v2, we built a custom pipeline consisting of the following steps:

Source data sampling: We download backup data from v1, extract the relevant tables, and sample the data to understand its distribution.
Create pre-split table on v2: Based on the sampling results, we create a corresponding v2 table with a pre-defined shard layout to minimize data reshuffling during migration.
Bootstrap: This is the most time-consuming step, taking hours or even days depending on table size. To bootstrap efficiently, we use Kubernetes StatefulSets to persist local state and periodically checkpoint progress.
Checksum verification: We verify that all data from the v1 backup has been correctly ingested into v2.
Catch-up: We apply any lagging messages that accumulated in Kafka during the bootstrap phase.
Dual writes: At this stage, both v1 and v2 consume from the same Kafka topic. We ensure eventual consistency between the two, with replication lag typically within tens of milliseconds.

Once data migration is complete and we enter dual write mode, we can begin the read traffic migration phase. During this phase, our dispatcher can be dynamically configured to serve read requests for specific tables from v1, while sending shadow requests to v2 for consistency checks. We then gradually shift to serving reads from v2, accompanied by reverse shadow requests to v1 for consistency checks, which also enables quick fallback to v1 responses if v2 becomes unstable. Eventually, we fully transition to serving all read traffic from v2.

Lessons learned

Several key insights emerged from this migration:

Consistency complexity: Migrating from an eventually consistent (v1) to a strongly consistent (v2) backend introduced new challenges, particularly around write conflicts. Addressing these required features like write deduplication, hotkey blocking, and lazy write repair — sometimes trading off storage cost or read performance.
Presplitting is critical: As we shifted from hash-based (v1) to range-based partitioning (v2), inserting large consecutive data could cause hotspots and disrupt our v2 backend. To prevent this, we needed to accurately sample the v1 data and presplit it into multiple shards based on v2’s topology, ensuring balanced ingestion traffic across backend nodes during data migration.
Query model adjustments: v2 doesn’t push down range filters as effectively, requiring us to implement client-side pagination for prefix and range queries.
Freshness vs. cost: Different use cases required different tradeoffs. Some prioritized data freshness and used primary replicas for the latest reads, while others leveraged secondary replicas to balance staleness with cost and performance.
Kafka’s role: Kafka’s proven stable p99 millisecond latency made it an invaluable part of our migration process.
Building in flexibility: Customer retries and routine bulk jobs provided a safety net for the rare inconsistencies, and our migration design allowed for per-table stage assignments and instant reversibility — key for managing risk at scale.

As a result, we migrated more than a petabyte of data across thousands of tables with zero downtime or data loss, thanks to a blue/green rollout, dual-write pipeline, and automated fallbacks — so the product teams could keep shipping features while the engine under them evolved.

Conclusion and next steps

What sets Mussel v2 apart is the way it fuses capabilities that are usually confined to separate, specialized systems. In our deployment of Mussel V2, we observe that this system can simultaneously

ingest tens of terabytes in bulk data upload,
sustain 100 k+ streaming writes per second in the same cluster, and
keep p99 reads under 25 ms

— all while giving callers a simple dial to toggle stale reads on a per-namespace basis. By pairing a NewSQL backend with a Kubernetes-native control plane, Mussel v2 delivers the elasticity of object storage, the responsiveness of a low-latency cache, and the operability of modern service meshes — rolled into one platform. Engineers no longer need to stitch together a cache, a queue, and a datastore to hit their SLAs; Mussel provides those guarantees out of the box, letting teams focus on product innovation instead of data plumbing.

Looking ahead, we’ll be sharing deeper insights into how we’re evolving quality of service (QoS) management within Mussel, now orchestrated cleanly from the Dispatcher layer. We’ll also describe our journey in optimizing bulk loading at scale — unlocking new performance and reliability wins for complex data pipelines. If you’re passionate about building large-scale distributed systems and want to help shape the future of data infrastructure at Airbnb, take a look at our Careers page — we’re always looking for talented engineers to join us on this mission.

References

https://medium.com/airbnb-engineering/mussel-airbnbs-key-value-store-for-derived-data-406b9fa1b296

From Static Rate Limiting to Adaptive Traffic Management in Airbnb’s Key-Value Store

laurenmackevich — Wed, 25 Feb 2026 18:44:11 +0000

How Airbnb hardened Mussel, our key-value store, with smarter traffic controls to stay fast and reliable during traffic spikes.

By Shravan Gaonkar, Casey Getz, Wonhee Cho

Introduction

Every request lookup on Airbnb, from stays, experiences, and services search to customer support inquiries ultimately hits Mussel, our multi-tenant key-value store for derived data. Mussel operates as a proxy service, deployed as a fleet of stateless dispatchers — each a Kubernetes pod. On a typical day, this fleet handles millions of predictable point and range reads. During peak events, however, it must absorb several-fold higher volume, terabyte-scale bulk uploads, and sudden bursts from automated bots or DDoS attacks. Its ability to reliably serve this volatile mix of traffic is therefore critical to both the Airbnb user experience and the stability of the many services that power our platform.

Given Mussel’s traffic volume and its role in core Airbnb flows, quality of service (QoS) is one of the product’s defining features. The first-generation QoS system was primarily an isolation tool. It relied on a Redis-backed counter, client quota based rate-limiter, that checked a caller’s requests per second (QPS) against a configurable fixed quota. The goal was to prevent a single misbehaving client from overwhelming the service and causing a complete outage. For this purpose, it was simple and effective.

However, as the service matured, our goal shifted from merely preventing meltdowns to maximizing goodput — that is, getting the most useful work done without degrading performance. A system of fixed, manually configured quotas can’t achieve this, as it can’t adapt in real time to shifting traffic patterns, new query shapes, or sudden threats like a DDoS attack. A truly effective QoS system needs to be adaptive, automatically exerting prioritized backpressure when it senses the system has reached its useful capacity.

To better match our QoS system to the realities of online traffic and maximize goodput, over time we evolved it to add several new layers.

Resource-aware rate control (RARC): Charges each request in request units (RU) that reflect rows, bytes, and latency, not just counts.
Load shedding with criticality tiers: Guarantees that high-priority traffic (e.g., customer support, trust and safety) stays responsive when capacity evaporates.
Hot-key detection & DDoS mitigation: Detects skewed access patterns in real time and then shields the backend — whether the surge is legitimate or a DDoS burst — by caching or coalescing the duplicate requests before they reach the storage layer.

What follows is an engineer’s view of how these layers were designed, deployed, and battle-tested, and why the same ideas may apply to any multi-tenant system that has outgrown simple QPS limits.

Progression Timeline

Background: Life with Client Quota Rate Limiter

When Mussel launched, rate-limiting was entirely handled via simple QPS rate-limiting using a Redis-based distributed counter service. Each caller received a static, per-minute quota, and the dispatcher incremented a Redis key for every incoming request. If the key’s value exceeded the caller’s quota, the dispatcher returned an HTTP 429. The design was simple, predictable, and easy to operate.

Two architectural details made this feasible. First, Mussel and its storage engine were tightly coupled; backend effort correlated reasonably well with the number of calls at the front door. Second, the traffic mix was modest in size and variety, so a single global limit per caller rarely caused trouble.

As adoption grew, two limitations became clear.

Cost variance: A one-row lookup and a 100,000-row scan were treated equally, even though their load on the backend differed by orders of magnitude. The system couldn’t distinguish high-value cheap work from low-value expensive work.
Traffic skew: Per-caller rate limits provided isolation at the client level, but were blind to the data’s access pattern. When a single key became “hot” — for example, a popular listing accessed by thousands of different callers simultaneously — the aggregate traffic could overwhelm the underlying storage shard, even if each individual caller remained within its quota. This created a localized bottleneck that degraded performance for the entire cluster, impacting clients requesting completely unrelated data. Isolation by caller was insufficient to prevent this kind of resource contention.

Addressing these gaps meant shifting from a request-counting mindset to a resource-accounting mindset and designing controls that reflect the real cost of each operation.

Resource-aware rate control

A fair quota system must account for the real work a request imposes on the storage layer. Resource-aware rate control (RARC) meets this need by charging operations in request units (RU) rather than raw requests per second.

A request unit blends four observable factors: fixed per-call overhead, rows processed, payload bytes, and — crucially — latency. Latency captures effects that rows and bytes alone miss: two one-megabyte reads can differ greatly in cost if one hits cache and the other triggers disk. In practice, we use a linear model. For both reads and writes, the cost is:


RU_read = 1 + w_r × bytes_read + w_l × latency_ms
RU_write = 6 + w_b × bytes_written / 4096 bytes + w_l × latency_ms

Weight factors w_r, w_b, and w_l come from load-test calibration 
based on the compute, network and disk I/O. 
bytes_read, bytes_written and latency is measured per request

Although approximate, the formula separates operations whose surface metrics look similar yet load the backend very differently.

Impact of Latency on RU computation

Each dispatcher continues to rely on rate-limiter for distributed counting, but the counter now represents request-unit tokens instead of raw QPS. At the start of every epoch, the dispatcher adds the caller’s static RU quota to a local token bucket and immediately debits that bucket by the RU cost of each incoming request. When the bucket is empty, the request is rejected with HTTP 419. Because all dispatchers follow the same procedure and epochs are short, their buckets remain closely aligned without additional coordination.

Adaptive protection is handled in the separate load-shedding layer; backend latency influences which traffic is dropped or delayed, not the size of the periodic RU refill. This keeps rate accounting straightforward — static quotas expressed in request units — while still reacting quickly when the storage layer shows signs of stress.

Load shedding: Staying healthy when capacity evaporates or develops hotspots

Rate limits based on request units excel at smoothing normal traffic, but they adjust on a scale of seconds. When the workload shifts faster — a bot floods a key, a shard stalls, or a batch job begins a full-table scan — those seconds are enough for queues to balloon and service-level objectives to slip. To bridge this reaction-time gap, Mussel uses a load-shedding safety net that combines three real-time signals: (1) traffic criticality, (2) a latency ratio, and (3) a CoDel-inspired queueing policy.

The latency ratio is a ratio that serves as a real-time indicator of stress on the system stress. Each dispatcher computes this ratio by dividing the long-term p95 latency by the short-term p95 latency. A stable system has a ratio near 1.0; a value dropping towards 0.3 indicates that latency is rising sharply. When that threshold is crossed, the dispatcher temporarily increases the RU cost applied to a designated client class so that its token bucket drains faster and the request rate naturally backs off. If the ratio keeps falling, the same penalty can be expanded to additional classes until latency returns to a safe range.

The estimate uses the constant-memory P² algorithm [1], requiring no raw sample storage or cross-node coordination.

Latency response over time and illustration of throttling

The Control-Delay (CoDel) thread pool tackles the second hazard: queue buildup inside the dispatcher itself [2]. It monitors the time a request waits in the queue. If that sojourn time proves the system is already saturated, the request fails early, freeing up memory and threads for higher-priority work. An optional latency penalty can also be applied to RU accounting, charging more for queries from callers that persistently trigger the latency ratio.

Together, these layers — criticality, a real-time latency ratio, and adaptive queueing — form a shield that lets guest-facing traffic ride out backend hiccups. In practice, this system has cut recovery times by about half and keeps dispatchers stable without human intervention.

Hot-key detection and DDoS defence

Request-unit limits and load shedding keep client usage fair, but they cannot stop a stampede of identical reads aimed at one record. Imagine a listing that hits the front page of a major news outlet: tens of thousands of guests refresh their browser, all asking for the same key. A misconfigured crawler — or a deliberate botnet — can generate the same access pattern, only faster. The result is shard overload, a full dispatcher queue, and rising latency for unrelated work.

Mussel neutralises this amplification with a three-step hot-key defence layer: real-time detection, local caching, and request coalescing.

Real-time detection in constant space

Every dispatcher streams incoming keys into an in-memory top-k counter. The counter is a variant of the Space-Saving algorithm [2] popularized in Brian Hayes’s “Britney Spears Problem” essay [4]. In just a few megabytes, it tracks approximate hit counts, maintains a frequency-ordered heap, and surfaces the hottest keys in real time in each individual dispatcher.

Local caching and request coalescing

When a key crosses the hot threshold, the dispatcher serves it from a process-local LRU cache. Entries expire after roughly three seconds, so they vanish as soon as demand cools; no global cache is required. A cache miss can still arrive multiple times in the same millisecond, so the dispatcher tracks in-flight reads for hot keys. New arrivals attach to the pending future; the first backend response then fans out to all waiters. In most cases only one request per hot key per dispatcher pod ever reaches the storage layer.

Impact in production

In a controlled DDoS drill that targeted a small set of keys at ≈ million-QPS scale, the hot-key layer collapsed the burst to a trickle — each dispatcher forwarded only an occasional request, well below the capacity of any individual shard — so the backend never felt the surge.

Hotkeys detected and served from dispatcher cache in real time

Retrospective and key takeaways

The journey from a single QPS counter to a layered, cost-aware QoS stack has reshaped how Mussel handles traffic and, just as importantly, how engineers think about fairness and resilience. A few themes surface when we look back across the stages described above.

The first is the value of early, visible impact. The initial release of request-unit accounting went live well before load shedding or hot-key defence. Soon after deployment it automatically throttled a caller whose range scans had been quietly inflating cluster latency. That early win validated the concept and built momentum for the deeper changes that followed.

A second lesson is to prefer to keep control loops local. All the key signals — P² latency quantiles, the Space-Saving top-k counter, and CoDel queue delay — run entirely inside each dispatcher. Because no cross-node coordination is required, the system scales linearly and continues to protect capacity even if the control plane is itself under stress.

Third, effective protection works on two different time-scales. Per-call RU pricing catches micro-spikes; the latency ratio and CoDel queue thresholds respond to macro slow-downs. Neither mechanism alone would have kept latency flat during the last controlled DDoS drill, but in concert they absorbed the shock and recovered within seconds.

Finally, QoS is a living system. Traffic patterns evolve, back-end capabilities improve, and new workloads appear. Planned next steps include database-native resource groups and automatic quota tuning from thirty-day usage curves. The principles that guided this project — measure true cost, react locally and quickly, layer defences — form a durable template, but the implementation will continue to grow with the platform it protects.

Does this type of work interest you? We’re hiring, check out open roles here.

References

Raj Jain and Imrich Chlamtac. 1985. The P² algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications of the ACM, 28(10), 1076–1085. https://doi.org/10.1145/4372.4378
Erik D. Demaine, Alejandro López-Ortiz, and J. Ian Munro. 2002. Frequency estimation of internet packet streams with limited space. In Algorithms — ESA 2002: 10th Annual European Symposium, Rome, Italy, September 17–21, 2002. Rolf H. Möhring and Rajeev Raman (Eds.). Lecture Notes in Computer Science, Vol. 2461. Springer, 348–360.
Kathleen M. Nichols and Van Jacobson. 2012. Controlling queue delay. Communications of the ACM, 55(7), 42–50. https://doi.org/10.1145/2209249.2209264
Brian Hayes. 2008. Computing science: The Britney Spears problem. American Scientist, 96(4), 274–279. https://www.americanscientist.org/article/the-britney-spears-problem

Pay as a Local

laurenmackevich — Wed, 25 Feb 2026 18:40:17 +0000

How Airbnb rolled out 20+ locally relevant payment methods worldwide in just 14 months

By: Gerum Haile, Bo Shi, Yujia Liu, Yanwei Bai, Bo Yuan, Rory MacQueen, Yixia Mao

Across the more than 220 global markets that Airbnb operates in, cards are the primary way that guests pay for stays, experiences, and services. However, to help make our platform accessible to more people, reduce friction at checkout, and drive more adoption, we introduced trusted, locally preferred payment methods — called local payment methods or LPMs. By offering and supporting these payment methods, Airbnb enables guests everywhere to choose what works best for them.

In this blog post, we’ll discuss the implementation details behind our Pay as a Local initiative, which allowed us to launch 20+ local payment methods across multiple markets in just over one year.

LPMs: What they are, why they matter, and our discovery and selection process

Local payment methods go beyond traditional cards and include:

Country or region-specific digital wallets (such as M-Pesa or MTN, MoMo)
Online bank transfers (such as Online Banking Czech, Online Banking Slovakia)
Real-time or instant bank payments (such as PIX, UPI)
Local payment schemes (such as EFTPOS, Cartes Bancaires)

By embracing LPMs, Airbnb helps make travel more inclusive and seamless for people around the world. LPMs help the platform to:

Boost conversion and bookings by offering guests familiar, trusted payment options.
Unlock new markets where credit card usage is low or non-existent.
Build accessibility for guests without credit cards or traditional banking access.

Through our research on local payment methods (LPMs), we identified over 300 unique payment options worldwide. For the initial phase of the LPM initiative, we used a structured qualification framework to select which local payment methods we would support. We evaluated the top 75 travel markets and selected the top one to two payment methods per market — excluding those without a clear travel use case — and arrived at a shortlist of just over 20 LPMs best suited for integration into our payment platform.

Background on Airbnb’s payment platform

Airbnb’s payments platform is designed to decouple payment logic from the core business (i.e., stays, experiences, and services), allowing for greater flexibility and scalability. The platform efficiently coordinates both guest pay-ins and host payouts by working with regulated payment service providers and financial partners.

Beyond payment processing, the system also supports robust payment trust and compliance functions.

Modernization

As part of a multi-year replatforming initiative for our payments architecture called Payments LTA (long-term architecture), we shifted from a monolithic system to a capability-oriented services system structured by domains, using a domain-driven decomposition approach. This modernization approach reduced our time to market, increased reusability and extensibility, and empowered greater team autonomy.

The core payment domain delivers essential capabilities for pay-in, payout, and payment intermediation. It consists of multiple subdomains, including Pay-in, Payout, Transaction Fulfillment, Processing, Wallet & instruments, Ledger, Incentives & Stored Value, Issuing, and Settlement & Reconciliation.

Replatforming as an enabler for local payment method expansion

The processing subdomain enables integration with third-party payment service providers (PSPs) and supports API and file-based vendor integration, as well as switching and routing capabilities. As part of our replatforming initiative, we adopted a connector and plugin-based architecture for onboarding new third-party payment service providers. This strategy has significantly reduced the time required to integrate new PSPs in different markets.

During this replatforming effort, we also introduced Multi-Step Transactions (MST): a processor-agnostic framework that supports payment flows completed across multiple stages. MST defines a PSP-agnostic transaction language to describe the intermediate steps required in a payment, such as submitting supplemental data or handling dynamic interactions. These steps, called Actions, can include:

Redirects
Strong customer authentication (SCA) frictions (challenges, fingerprinting)
Payment method — specific flows

When a PSP indicates that an additional user action is required, its vendor plugin normalizes the request into an ActionPayload and returns it with a transaction intent status of ACTION_REQUIRED. This architecture ensures consistent handling of complex, multi-step payment experiences across diverse PSPs and markets.

LPM integration architecture

While our modernized payment platform laid the foundation for enabling LPMs, these payment methods come with a unique set of challenges. Many local methods require users to complete transactions in third-party wallet apps. This introduces complexity in app switching, session hand-off, and synchronization between Airbnb and external digital wallets.

Each local payment vendor also exposes different APIs and behaviors across charge, refund, and settlement flows, making integration and standardization difficult.

Technical approach

We analyzed the end-to-end behavior of our 20+ LPMs, and identified three foundational payment flows that capture the full spectrum of user and system interactions. By distilling LPM behaviors into these standardized payment flow archetypes, we established a unified framework for integration:

Redirect flow: Guests are redirected to a third-party site or app to complete the payment, then return to Airbnb to finalize their booking (e.g., Naver Pay, GoPay, FPX).
Async flow: Guests complete payment externally after receiving a prompt (such as a QR code or push notification), and Airbnb receives payment confirmation asynchronously via webhooks (e.g., Pix, MB Way, Blik).
Direct flow: Guests enter their payment credentials directly within Airbnb’s interface, allowing real-time processing similar to traditional card payments (e.g., Carte Bancaires, Apple Pay).

This standardized approach has enabled significant reusability across integrations and substantially reduced the engineering effort required to support new payment methods.

Asynchronous payment orchestration

Since many guests complete payments through external providers, we redesigned our payment orchestration — building on top of MST — to support payment flows that require user actions outside Airbnb (redirect flows and async flows).

For redirect flows, where guests complete the payment on a third-party app or website:

Airbnb’s payments platform sends a charge request to the local payment vendor, whose response includes a redirectUrl.
Our platform redirects the user to the external app or website to complete the payment.
Once the payment is successfully completed, the user is redirected back to Airbnb with a result token. Airbnb’s payments platform then uses this token to securely confirm and finalize the payment with the local processor.

For async flows (which typically involve scanning a QR code):

Airbnb’s payments platform sends a charge request to the local payment vendor, whose response includes a qrCodeData.
The checkout page displays the QR code for the user to scan and complete the payment in their wallet app.
After the payment succeeds, the vendor sends a webhook notification to Airbnb’s payments platform, which updates the payment status to success and confirms the user’s order.

Naver Pay: Redirect To Naver Pay Website

Naver Pay is one of the fastest-growing digital payment methods in South Korea. As of early 2025, it has reached over 30.6 million active users, representing approximately 60% of the South Korean population. Enabling Naver Pay in the South Korean market not only helps deliver a more seamless and familiar payment experience for local guests, but also expands Airbnb’s reach to new users who prefer using Naver Pay as their primary payment method.

Pix: Scan A QR Code

Pix is an instant payment system developed by the Central Bank of Brazil, enabling 24/7 real-time money transfers through methods such as QR codes or Pix keys. Its adoption has been extraordinary — by late 2024, more than 76% of Brazil’s population was using Pix, making it the country’s most popular payment method, surpassing cash, credit, and debit cards. In 2024 alone, Pix processed over BRL 26.4 trillion (approximately USD 4.6 trillion) in transaction volume, underscoring its pivotal role in Brazil’s digital payment ecosystem.

Config-driven payment method integration

Airbnb embraced a config-driven approach, powered by a central YAML-based Payment Method Config that acts as a single source of truth for flows, eligibility, input fields, refund rules, and more. Instead of scattering payment method logic across the frontend, backend, and various services, we consolidate all relevant details in this config. Both core payment services and frontend experiences dynamically reference this single source of truth, ensuring consistency for eligibility checks, UI rendering, and business rules. This unified approach dramatically reduces duplication, manual updates, and errors across the stack, making integration and maintenance faster and more reliable.

These configs also drive automated code generation for backend services using code generation tools, producing Java classes, DTOs, enums, schema, and integration scaffolding. As a result, integrating or updating a payment method is largely declarative — just a config change. This streamlines launches from months to weeks and makes ongoing maintenance far simpler.

Payment widget

Our payment widget — the payment method UI embedded into the checkout page — includes the list of available payment methods and handles the user’s inputs. Local payment methods often require specialized input forms (such as CPF for Pix) and have unique country/currency eligibility.

Rather than hardcoding forms and rules into the client, we centralize both form-field specification and eligibility checks in the backend. Servers send configuration payloads to clients defining exactly which fields to collect, which validation rules to apply, and which payment options to render. This empowers the frontend to dynamically adapt UI and validation for each payment method, accelerating launches and keeping user experiences fresh without frequent client releases.

For example, Pix in Brazil requires the guest’s first name, last name, and CPF (tax ID), which we collect and transmit as required to complete the payment.

Below is a diagram illustrating how dynamic payment method configurations are delivered from the backend to the frontend, enabling tailored checkout presentations for each payment method.

Building confidence through better testability

Testing local payment methods can be difficult, because developers often don’t have access to local wallets. Yet with such a broad range of payment methods and complex flows, comprehensive testing is essential to prevent regressions and ensure seamless functionality.

To address this, we enhanced Airbnb’s in-house Payment Service Provider (PSP) Emulator, enabling realistic simulation of PSP interactions for both redirect and asynchronous payment methods. The Emulator allows developers to test end-to-end payment scenarios without relying on unstable (or nonexistent) PSP sandboxes. For redirect payments, the Emulator provides a simple UI mirroring PSP acquirer pages, allowing testers to explicitly approve or decline transactions for precise scenario control. For async methods, it returns QR code details and automatically schedules webhook emission tasks upon receiving a /payments request — delivering a complete, reliable testing environment across diverse LPMs.

Scaling observability for local payment methods

Maintaining high reliability and availability is critical for Airbnb’s global payment system. As we expand to support many new local payment methods, we face increasing complexity: greater dependencies on external PSPs and wide variations in payment behaviors. For example, a real-time card payment and a redirect flow like Naver Pay follow completely different technical paths. That diversity makes observability difficult — a single “payment success rate” may represent card health well, but say little about an asynchronous LPM. Without proper visibility, regressions can go unnoticed until they affect real users. As dozens of new LPMs go live, observability has become the foundation of reliability.

To address this, we built a centralized monitoring framework that unifies metrics across all layers, from client to PSP. When launching a new LPM, onboarding now requires a single config change; add the method name, and metrics begin streaming automatically:

Client metrics — user-level flow health from clients
Payment backend metrics — API-level metrics for payment flows
PSP metrics — API-level visibility between Airbnb and the PSP
Webhook metrics — async completion status for redirect methods or refunds

We have also standardized the alerting rules across our platform’s Client, Backend, PSP, and Webhook layers using composite alerts and anomaly detection. Each alert follows a consistent pattern (failure count, rate, time window), e.g., “Naver Pay resume failures > 5 and failure rate > 20% in 30 minutes.” This design minimizes false positives during low-traffic periods.

This framework scales effectively, providing end-to-end visibility from user click to PSP confirmation. It enables engineers to trace issues in minutes rather than hours, whether those issues were caused by internal changes or external outages. By turning observability into a shared, automated layer, we were able to strengthen the backbone of payment reliability while accelerating the rollout of new LPMs worldwide.

Impact

The Pay as a Local initiative delivered significant business and technical impact:

Meaningful booking uplift: We observed meaningful uplift in bookings and new users in markets where we launched local payment methods
Faster integrations: Reduced integration time significantly through reusable flows and config-driven automation.
Stronger reliability: Improved observability for early outage detection, standardized testing to prevent regressions, and streamlined vendor escalation and on-call processes for global resilience.

Conclusion

Supporting local payment methods helps Airbnb to stay competitive and relevant in the global travel industry. These payment options help improve checkout conversion, drive adoption, and unlock new growth opportunities.

This post outlined how the Airbnb payment platform has evolved to support local payment methods at scale — through asynchronous payment orchestration, config-driven onboarding, centralized observability, and robust testability. Together, these capabilities enable faster integrations, lower maintenance overhead, and offer a more seamless, localized checkout experience for guests worldwide.

As Airbnb continues to expand globally, our payments platform will keep evolving with the same principles of extensibility, reliability, and scalability, ensuring that guests everywhere can pay confidently, using the methods they know and trust.

Acknowledgments

We had many people at Airbnb contributing to this big rearchitecture, but countless thanks to Mini Atwal, Ashish Singla, Musaab At-Taras,Linmin Yang, Yong Rhyu, Yohannes Tsegay, Livar Cunha, Praveena Subrahmanyam, Steve Ickes, Vijaykumar Borkar, Vibhu Ramani, Aashna Jain, Abhishek Ghosh, Abhishek Patel, Adithya Tammavarapu, Akai Hsieh, Akash Budhia, Amar Parkash, Amee Mewada, Ankita Balakrushan Tate, Bharath Kumar Chandramouli, Bo Shi, Bo Yuan. Callum Li. Carlos Townsend Pico, Chanakya Daparthy, Charles Tang, Cibi Pari, Cindy Jaimez, Cindy Shi, Dan Yo, Daniela Nobre, Danielle Zegelstein, David Cordoba, David Drinan, Dawei Wang, Dechuan Xu, Denise Francisco, Denny Liang, Dimi Matcovschi, Divya Verma, Feifeng Yang, Gabriel Siqueira, Sunny Wallia, Prashant Jamlakar, Daniel Kriske, Giovanni Iniguez, Haojie Zhang, Haokun Chen, Haoti Zhong, Harriet Russell, Harshit Gupta, Henrique Moreira Indio do Brasil, Ishan Ishan, Jenny Shen, Jerroid Marks, Jiafang Jiang, Joey Yin, Jon Chew, Karen Kuo, Katie Turley, Letian Zhang, Maneesh Lall, Manish Singhal, Maria Daneri, Mark Jang, Mengfei Ren, Michelle Desiderio, Mohit Dhawan, Nam Kim, Nerea Ruiz Alvarez, Nikita Kapoor, Oliver Zhang, Omer Faruk Gul, Pallavi Sharma, Prateek Sri, Rae Huang, Rohit Krishnan Dandayudham, Rory MacQueen, Ruize Liu, Sam Bitter, Sam Tang, Saran Singh. Sardana Sai Anil, Serdar Yildirim, Shwetha Saibanna, Silvia Crespo Sanchez, Simon Xia, Stella Dong, Stella Su, Stephanie Leung, Steve Cao, Sumit Ranjan, Tay Rauch, Thanigaivelan Manickavelu, Tiffany Selby, Toland Hon, Trish Burgess,Vishal Garg, Vivian Lue, Vyom Rastogi, William Betz, Xi Wen, Xing Xing, Xuanxuan Wu, Yangguang Li, Yanwei Bai, Yeung Song, Yixia Mao, Yujia Liu. Yun Cho, Zhenhui Zhu, Ziyun Ye

****************

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.

Academic Publications & Airbnb Tech: 2025 Year in Review

laurenmackevich — Wed, 25 Feb 2026 18:35:57 +0000

2025 was a big year for research at Airbnb, as we made significant progress toward our mission to use AI, data science, and machine learning to become the best travel and living platform.

Specifically, we doubled down on our presence at long-standing venues like KDD and CIKM — two of the most selective conferences in machine learning. At the same time, we expanded our research footprint by sharing our work in NLP, optimization, and measurement science at conferences such as COLING, LION, and VLDB.

Across these conferences, Airbnb researchers engaged directly with academic and industry peers by publishing and presenting papers, learning about the latest innovations, launching new collaborations, and mentoring emerging researchers. In this blog post, we’ll recap the conferences and key papers we presented in 2025, organized by research themes.

Applied machine learning for search, ranking, and personalization

KDD (Knowledge and Data Mining)

KDD is a flagship conference in data science research. Hosted annually by a special interest group of the Association for Computing Machinery (ACM), it’s where researchers learn about some of the most groundbreaking developments in data mining, knowledge discovery, and large-scale data analytics, which are critical to Airbnb’s efforts to improve core products like search and recommendations.

Our participation

We’ve been presenting at KDD since 2018, and 2025 was another strong year for us. We received multiple contributions across the applied data science track and workshops, which were well-received by the broader community and even inspired us to consider open-sourcing some of our technology. We were also inspired by the related research in this area and are eager to explore these methods through new collaborations.

Research highlights

Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking: While A/B tests are crucial for developing ranking algorithms and recommender systems, they’re difficult to set up and can take extensive time to reach statistical significance (especially for products with long conversion cycles, like accommodation booking). In this paper, we shared techniques for rapid pre-A/B online assessments that help teams identify the most promising experiments, streamlining the overall process without sacrificing accuracy.
High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace: Airbnb search balances diverse global inventory with varied guest preferences for location, amenities, style, and price. This process requires efficient location retrieval to find the listings guests might realistically book by determining which geographic areas to query. We introduce a new approach to location retrieval by using a set of relevant, high-precision categorical location cells.

Link to all papers

Harnessing the Power of Interleaving and Counterfactual Evaluation for Airbnb Search Ranking (Qing Zhang, Alex Deng, Michelle Du, Huiji Gao, Liwei He, Sanjeev Katariya)
High Precision Audience Expansion via Extreme Classification in a Two-Sided Marketplace (Dillon Davis, Huiji Gao, Thomas Legrand, Juan Manuel Caicedo Carvajal, Malay Haldar, Kedar Bellare, Moutupsi Paul, Soumyadip Banerjee, Liwei He, Stephanie Moyerman, and Sanjeev Katariya)
TSMO: Two-sided Marketplace Optimization

CIKM (Conference on Information and Knowledge Management)

CIKM is a premier forum for discussing and presenting research at the intersection of information and knowledge management, including topics like AI, data mining, database systems, and information retrieval. Many of these topics directly intersect with our core product challenges, such as search, ranking, and recommendations.

Our participation

At CIKM 2025, Airbnb’s Relevance and Personalization team had five peer-reviewed papers accepted for publication, building on our participation in 2023 and 2024. These papers focused on advanced AI/ML techniques for search and recommendations, and sharing real-world insights from using these technologies at Airbnb’s scale. Industry and academic researchers, especially those working on two-sided marketplaces, engaged with our work and provided valuable feedback.

Research highlights

Augmenting Guest Search Results with Recommendations at Airbnb: When guests use overly narrow criteria to search for accommodations, they often receive insufficient results, leading to a frustrating experience. This paper introduces a recommendation system that dynamically suggests alternatives — different dates, relaxed amenities, or adjusted price ranges — to help guests find suitable accommodations and improve the platform’s booking rate. Authors: Haowei Zhang, Philbert Lin, Dishant Ailawadi, Soumyadip Banerjee, Shashank Dabriwal, Hao Li, Kedar Bellare, Liwei He, Sanjeev Katariya
Maps Ranking Optimization in Airbnb: Maps play a crucial role in Airbnb search and bookings, accounting for roughly 80% of search interactions. Yet map ranking has traditionally reused feed-ranking assumptions, which break down when we examine the NDCG (Normalized Discounted Cumulative Gain) metric. This paper explains why list-based NDCG fails to model user attention on maps, introduces a map-specific NDCG, and reports experiments showing that optimizing it yields booking gains. Authors: Hongwei Zhang, Malay Haldar, Kedar Bellare, Sherry Chen, Soumyadip Banerjee, Xiaotang Wang, Mustafa Abdool, Huiji Gao, Pavan Tapadia, Liwei He, Sanjeev Katariya, Stephanie Moyerman
BListing: Modality Alignment for Listings: To improve search ranking, we introduce BiListing (Bimodal Listing) embeddings to use unstructured text and photo listing data as ranking signals. BiListing leverages large-language models and pretrained language-image models to create unified representations of diverse unstructured data into a single embedding vector per list and modality. Our experiment results show a 0.425% increase in NDCB (Normalized Discounted Cumulative Booking) gain and drove tens of millions in incremental revenue. Authors: Guillaume Guy, Mihajlo Grbovic, Chun How Tan, Han Zhao
Beyond Pairwise Learning-To-Rank At Airbnb: In this paper, we introduce a method to improve the accuracy of pairwise learning-to-rank algorithms, the bedrock of modern search stacks. This approach captures interactions between items during pairwise comparisons, thereby giving us a better sense of what searchers truly want. We also share ways to implement this algorithm performantly, and results from online and offline experiments. Authors: Malay Haldar, Daochen Zha, Huiji Gao, Liwei He, Sanjeev Katariya
Learning to Comparison-Shop: Traditional ranking models often evaluate items in isolation, disregarding the context in which users compare multiple items on a search results page. In this paper, we propose a novel ranking architecture, the Learning-to-Comparison-Shop (LTCS) System, that explicitly models and learns users’ comparison-shopping behaviors. Our experiments show statistically significant improvements of 1.7% in Normalized Discounted Cumulative Gain (NDCG) and 0.6% in booking conversion rate. Authors: Jie Tang, Daochen Zha, Xin Liu, Huiji Gao, Liwei He, Stephanie Moyerman, Sanjeev Katariya

NLP & building LLM systems in production

EMNLP (Empirical Methods in Natural Language Processing)

EMNLP is a top-tier NLP conference that brings together practitioners and researchers to discuss new architectures and training strategies for language models, safety and evaluation strategies for LLMs, and real-world NLP applications. These research areas directly intersect with many of Airbnb’s product surfaces, such as customer support, search & discovery, and trust & safety. Additionally, each EMNLP cycle includes the release of new datasets, evaluation suites, and open-source libraries to help teams benchmark their progress against community standards.

Our participation

In 2025, we sponsored EMNLP and presented two papers on humans-in-the-loop in AI systems and advanced summarization techniques. We also used EMNLP’s community datasets to benchmark our system, which showcased where we excel and where we can build upon our success with additional best practices. The conference deepened academic collaborations through discussions on LLM evaluation, safety, and agentic AI design, including mentoring students and early-career researchers.

Research highlights

Agent-in-the-Loop, A Data Flywheel for Continuous Improvement in LLM-based Customer Support: To improve our LLM-based customer support system, this paper introduces an Agent-in-the-Loop (AITL) framework that leverages new interaction data to continuously enhance model performance. This flywheel can help the system stay up to date with new product features, shifting user preferences, and updated support policies and procedures. We launched a pilot in the US, and the results demonstrate significant improvement in accuracy and helpfulness. Authors: Cen Mia Zhao, Tiantian Zhang, Hanchen Su, Yufeng Wayne Zhang, Shaowei Su, Mingzhi Xu, Yu Elaine Liu, Wei Han, Jeremy Werner, Claire Na Cheng, Yashar Mehdad
Incremental Summarization for Customer Support via Progressive Note-Taking and Agent Feedback: Customer service agents multitask during support interactions, identifying core issues, tracking prior actions, and producing accurate notes. To streamline this workflow, we introduced an incremental summarization system that intelligently determines when to generate concise bullet notes during conversations, reducing agents’ context-switching effort without sacrificing quality. To improve the system over time, we also introduced a learning framework that enables agents to make real-time edits, immediately refining online note generation. Authors: Yisha Wu, Cen Mia Zhao, Yuanpei Cao, Xiaoqing Su, Yashar Mehdad, Mindy Ji, Claire Na Cheng

COLING (International Conference on Computational Linguistics)

COLING is a top-tier NLP conference that covers both foundational research and industry applications of language models, including reasoning, evaluation, multilingual NLP, and real-world LLM systems. The work presented at this conference helps validate Airbnb’s technical direction and directly informs future investments.

Our participation

In 2025, Airbnb presented at COLING for the first time, sharing a paper titled “LLM-Friendly Knowledge Representation for Customer Support” by Hanchen Su, Wei Luo, Wei Han, Yu Elaine Liu, Yufeng Wayne Zhang, Cen Mia Zhao, Ying Joy Zhang, and Yashar Mehdad. The paper presents a new format, Intent, Context, and Action (ICA), for structuring business knowledge in LLM-based QA and customer support workflows. Initial experiments in production show promising results. We also discovered relevant research in knowledge retrieval, LLM evaluation, and hallucination detection that will inspire future projects.

Optimization, causal inference, and measurement science

MIT CODE (Conference on Digital Experimentation)

MIT CODE is one of the premier venues for researchers and practitioners to discuss topics in online digital experimentation, causal inference, and data-driven product innovation. The conference supports our commitment to data-driven decision-making and using experimentation to understand the long-term impacts on guests, hosts, and marketplace health.

Our participation

In 2025, we had another strong showing at CODE, with a cohort of 6 data scientists and 3 academic collaborators. We gave talks in two sessions and presented a poster, which led to meaningful discussions with peer companies and interest in collaborating with academic research groups.

Research highlights

Beyond the Experiment Window: Prospective Impacts Under Long-Term Ranking Dynamics: Product teams frequently leverage A/B tests to assess different rankers. While these experiments are typically conducted over shorter periods, we also recognize the value of understanding longer-term dynamics (such as seasonality and user evolution) to further support sustained business objectives, like marketplace health. To solve this problem, we developed a causal framework that allows us to estimate the long-term impacts of ranking changes with strategic goals (like marketplace health) using A/B testing data.
Trustworthy Bayesian Inference in Batch-Adaptive Experimentation: Adaptive experimentation, like multi-arm bandit methods, can improve experiment efficiency by reallocating traffic toward promising treatments. Continued advancements in these approaches are expanding our ability to maintain high standards of statistical validity. This paper introduces a practical Bayesian framework for inference in batch-adaptive experiments, specifically tailored to the operational realities of online platforms.

Link to all papers

Beyond the Experiment Window: Prospective Impacts Under Long-Term Ranking Dynamics (Lo-Hua Yuan)
Trustworthy Bayesian Inference in Batch-Adaptive Experimentation (Yicheng Li)
Experimental Design for Product Launches with Collaborative User Networks (Monu Kala)

INFORMS (Institute for Operations Research and the Management Sciences)

INFORMS brings together academics and industry professionals to discuss and share research across data science, machine learning, economics, behavioral science, and analytics.

Our participation

In 2025, our data science team was invited to INFORMS to present two talks in a session about bridging the gap between statistical methods and industry applications.

Research highlights

Beyond Multi-Arm Bandits: Tackling Challenges in Adaptive Experiments at Airbnb. In this talk, we walked through the metrics and infrastructure challenges when using classic bandit algorithms, which make it difficult to operationalize adaptive experiments. We propose a hybrid approach that incorporates bandit algorithms into A/B experiments to enable adaptive testing. We also discussed how we onboard and validate adaptive experiments across individual product domains at Airbnb.

Link to all papers

Beyond the Experiment Window: Prospective Impacts Under Long-Term Ranking Dynamics (Lo-Hua Yuan)
Beyond Multi-Arm Bandits: Tackling Challenges in Adaptive Experiments at Airbnb (Yicheng Li)

LION (Learning and Intelligent Optimization)

The LION conference is a premier gathering of researchers exploring the intersection of machine learning, artificial intelligence, and mathematical optimization.

Our participation

While Airbnb has attended LION in the past, 2025 was the first time we presented at the conference. Nathan Brixius presented “Optimal Matched Block Design For Multi-Arm

Experiments,” which introduces a new optimization formula using mixed-integer programming (MIP) to group subjects in multi-armed experiments, leading to more balanced groups and, in turn, more accurate experimental results. We also connected with leading experts in metaheuristics and AI fairness to help shape our future roadmap and sponsored the awards for the best papers presented at the conference.

Data systems

VLDB (Very Large Data Bases)

The VLDB Conference is one of the top 2 flagship conferences in data management and large-scale data systems, with over 1,500 researchers and practitioners attending.

Our participation

“In 2025, we published our first paper at VLDB: ‘SQL:Trek Automated Index Design at Airbnb’ by Sam Lightstone and Ping Wang. The paper presents a novel approach for automated index design (code-named SQL:Trek). It uses query compiler cost models to identify effective indexes across many relational databases, including most MySQL and PostgreSQL derivatives. Additionally, the Airbnb team attended sessions on system efficiency, graph computing, and AI databases, and had the opportunity to meet other researchers.

Conclusion

Conferences remain a big part of our research program at Airbnb, helping us validate and refine our ideas through community feedback and providing a forum to share real-world insights that advance the field. In 2025, we doubled down on this vision by publishing papers for the first time at conferences in domains such as NLP, optimization, causal inference, and data systems, reflecting our ongoing commitment to using these technologies to create the best possible travel experiences.

As we look to 2026, we’re eager to expand our presence at these conferences and discover new ways to use AI, machine learning, and data science to build a best-in-class travel and living platform. If you’re interested in doing this type of work with us, consider joining us. Apply for one of our open positions.

Safeguarding Dynamic Configuration Changes at Scale

laurenmackevich — Wed, 25 Feb 2026 18:34:20 +0000

By Cosmo Qiu, Bo Teng, Siyuan Zhou, Ankur Soni, Willis Harvey

Dynamic configuration is a core infrastructure capability in modern systems. It allows developers to change runtime behavior without restarting or redeploying services, even as the number of services and requests grows. In practice, that might mean rolling out a new address form for a region launch, tightening an authorization rule, or adjusting timeouts when a dependency is slow.

Like any powerful tool, dynamic configuration is a double-edged sword. While it enables fast iteration and rapid incident response, a bad change can cause regressions or even outages. This is a common challenge across the industry: balancing developer flexibility with system reliability.

In this post, we will outline the expectations of a modern dynamic configuration platform, then walk through the high-level architecture of Airbnb’s dynamic config platform and how its core components work together to enable safe, flexible config changes.

Modern config platform essentials

As Airbnb’s business grows, our expectations for the dynamic config platform have evolved over time through our own learnings as well as industry best practices. These shape our view of what a good dynamic config platform should provide, including:

A coherent experience for config management: The platform provides a streamlined, end-to-end experience for defining, reviewing, testing, and rolling out config changes. It covers the most common needs out of the box with rich built-in features, while still offering escape hatches for edge cases.
Strong reliability, availability and safety guarantees: All config changes are validated, reviewed, and rolled out progressively, with clear ownership and well-defined access control. Treating config as code is a key focus: config changes are versioned, reviewed, and auditable like service code, but remain dynamic at runtime. The platform itself must be highly available so that services can reliably fetch and apply configs. Changes should be observable, with support for fast rollbacks when needed.
Safe testing in isolated environments: Developers can validate config changes in isolated local or canary environments before they reach production.
Flexible multi-tenant support: In a multi-tenant platform, different tenants have different risk profiles. The platform should allow config owners to customize how their configs behave per tenant, including deployment triggers, guardrails, and rollout strategy (for example, AWS zone or Kubernetes pod percentage-based rollouts).
Fast and controlled incident response: During an incident, responders can ship emergency configs as needed with clear auditability. The platform also provides observability for config changes, so incident responders can tell what changed, who was affected, when the change was made, and who made the change. This enables them to effectively identify the culprit and take action.

High-level architecture

At Airbnb, Sitar is the internal name for our dynamic config platform. It provides a common way for teams to manage runtime behavior safely. At a high level, Sitar has four main parts: a developer-facing layer, a control plane, a data plane, and the clients and agents that run alongside application code.

The developer-facing layer is where config changes are created and reviewed. By default, configs are managed through a Git-based workflow, while a few exceptions are managed in the web interface (sitar-portal), which is also used for admin operations such as emergency deployments.

The control plane is responsible for orchestrating config changes. It enforces schema validation, ownership, and access control, and decides how each change should be rolled out: for example, which environments or AWS zone to target, what percentage of Kubernetes pods to start with, and how to progress the rollout over time. The control plane also specifies how to roll back the changes when needed, and supports routing in-flight configs to specific environments or slices of subscribers for fast testing.

The data plane provides scalable storage and efficient distribution of configs. It acts as the source of truth for config values and versions, and propagates updates to services reliably, consistently, and quickly.

On the product services side, an agent sidecar running alongside each service fetches the subscribed configs from the data plane and maintains a local cache. Client libraries inside the service then read from this cache and expose configs to application logic with fast, in-process access and optional fallbacks.

Putting these together, a typical change starts from a Git flow, proceeds through control-plane validation and rollout decisions, into the data plane for distribution, and finally to agents and client libraries that apply the config updates to application logic.

Key design choices

In this section, we highlight a few key design choices that shape how the platform looks and is operated.

Configs as code with a Git-based workflow

Config changes are by default managed by a Git-centric workflow. We use GitHub as the primary interface for managing configs, because we have an established and responsive internal team to manage GitHub Enterprise. GitHub integrates naturally with our existing CI/CD tooling, so we can reuse rich validation and deployment pipelines without re-inventing the wheel. This approach gives developers a consistent experience to make code changes: open a pull request, get reviews, merge, and deploy. GitHub also brings additional benefits such as mandatory reviewers, review and approval flows, and a change history. Configs under the same theme are grouped into tenants, with clear owners, customizable tests, and a dedicated CD pipeline.

While the Git-based flow is the default, we keep a UI portal for teams that prefer a portal-based experience and as a shortcut for specific operational needs, such as fast emergency config updates that can bypass the normal CI/CD pipeline.

Staged rollouts and fast rollbacks

When a change is proposed, schema validation (checking that the config matches the expected structure and types) and other automated checks run in CI. The change is always reviewed and approved before rollout.

Once merged in the main branch, the control plane performs a staged rollout where the change is first deployed to a limited scope, then gradually expanded to a larger scope if things look good. At each stage of this rollout, the change is evaluated, the author and the stakeholders are notified if regressions are detected, and a fast rollback can be triggered if needed. Staged rollouts can greatly reduce the blast radius of bad changes and improve the overall reliability of the platform.

Separated control and data planes

We separate the “decide” and “deliver” responsibilities. The control plane focuses on validation, authorization, and rollout decisions, while the data plane focuses on storing configs and distributing them reliably at scale. This separation allows us to evolve rollout strategies and policies without disrupting the underlying storage and delivery mechanisms, and vice versa.

Local caching and resilient clients

On the product services side, we introduced a local caching layer between the agent sidecar and the client library to improve resilience and availability. The agent sidecar runs alongside the main service container, regardless of which language the service is written in, and periodically fetches subscribed configs from the backend and persists them locally. The client libraries then read from this local cache. Even if the backend is temporarily unavailable or degraded, services can continue operating on the last known good configs from the local cache.

Impact on product teams

It is essential for the Sitar system to make life easier for product teams. In practice, its architecture changes how teams ship and operate in a few ways:

Rollouts become safer and more predictable. New behaviors, such as refined authorization rules, can be introduced gradually, verified on a small slice of traffic or in a specific environment, and rolled back quickly if needed. Teams spend less time worrying about “big bang” releases and more time iterating on behavior.
Teams get more flexibility in how their configs are managed and rolled out. Each team can tailor a config flow to its own risk profile and release schedules. For example, teams can choose between automatic, manual, or cron rollouts, select the rollout strategy, and add extra checks. This lets teams keep their existing ways of working while still benefiting from a common platform and shared guardrails.
Incident mitigation becomes faster and more controlled. When something goes wrong in production, incident responders can use observability tools that integrate config events to quickly locate the culprit change, then take quick action using the portal’s emergency flow. These emergency updates are fully auditable for future review.

Besides these examples, the platform includes other improvements in usability, safety, and observability that we will not cover in detail here. Together, they contribute to a smoother day-to-day experience for teams that rely on dynamic configuration.

Conclusions and next steps

Dynamic configuration is a foundational capability of modern infrastructure. It enables fast iteration and rapid incident response, but only when it is equipped with strong safety features and provides a good developer experience. In this post, we shared how we think about a modern dynamic config platform at Airbnb, and how we developed Sitar’s architecture to meet those expectations.

The work is ongoing. As Airbnb’s business grows, we are continuing to refine rollout strategies, improve config testing, invest in observability and smart incident response tooling, and evolve other platform components.

In future posts, we plan to dive deeper into specific areas of the platform, such as how we optimize the Kubernetes sidecar that delivers config updates and how we design the developer experience around config management.

If this type of work interests you check out our open roles.

Acknowledgments

Our progress with Sitar would not have been possible without the support and contributions of many people. We’d like to thank Craig Sosin, Nikolaj Nielsen, Daniel Fagnan, Alex Edwards, Xian Gao, Nick Morgan, Carolina Calderon, Hanfei Lin, Joyce Li, Yunong Liu, Alex Berghage, Brian Wolfe, Yann Ramin, Denis Sheahan, Richa Khandelwal, Swetha Vaidy, Abhishek Parmar, Adam Kocoloski, Adam Miskiewicz, and all the other engineers and teams at Airbnb who joined design reviews and offered valuable feedback, as this work would not have been possible without them.

My Journey to Airbnb — Anna Sulkina

laurenmackevich — Wed, 25 Feb 2026 18:29:27 +0000

Anna Sulkina has always been a traveler, and we’re lucky her travels have brought her to Airbnb. Anna is a Senior Director of Engineering, and she’s responsible for Application & Cloud infrastructure. She brings over two decades of industry experience to Airbnb, including work spanning the stack from the frontend to the backend to the plumbing that makes everything come together. Anna is a mother, a passionate trail runner, and an accomplished leader. Here’s Anna’s story in her own words.

Discovering a passion after the Soviet Union

I grew up in Eastern Ukraine, and the year I was graduating from high school, the Soviet Union collapsed. Despite the political turmoil, it was an interesting time to get into technology, and I have my brother to thank for that.

I was always a nerdy kid, at school and at home, and my older brother really stoked that curiosity. He was studying computer hardware in Moscow, and he’d bring home computer parts to play with. I still remember the first computer he’d assembled, which required a cassette player to load programs. Only after many minutes of buzzing and clicking would the computer finally whirr to life.

Thinking back, that was really my first inspiration to work in technology. Seeing the inner workings of this new thing, a computer, and watching how the parts came together to form a whole — that’s what made me realize I wanted to work with computers, too.

Of course, I didn’t know the Soviet Union would end, which made studying in Moscow impossible. But technology was still my future.

English: Harder than any programming language

I got my start learning programming in a local Ukrainian university, and after four years of studying, I immigrated to America.

When I arrived, I knew how to program, and I knew how to write and read English, but I couldn’t communicate well. I took ESL classes at a community college and, in parallel, enrolled in Berkeley Extension classes to advance my C++ knowledge and learn Java, which was still very new at the time.

Throughout my first couple of jobs, I was more likely to run into challenges with the English language rather than with programming languages.

My first job was in computer hardware diagnostics at a tiny company with only five engineers, where we communicated directly with hardware manufacturers. This was right before the dot-com bubble burst.

I almost didn’t get the job, though. The interview process for this job included a written portion that tested my knowledge of key computer science terms before getting to solve the coding problems. Given my prior education, I knew all the terms, but I ran out of time because the language gap slowed me down. Luckily, my interviewer happened to be taking the same Java class at Berkeley, and when I explained what happened, he gave me the chance to come back. I finished the test, got the job, and the rest is history.

In subsequent jobs, I transitioned fully from C++ to Java, which became my primary programming language for many years. I eventually got the hang of speaking English more confidently, but for a while, it still felt like Russian was my first language, Java was my second, and English was only my third.

Going deeper in the stack and taking on leadership roles

At various times, my career often felt all over the place. But looking back, I see a trajectory I wasn’t aware of at the time. I started with a brief stint in hardware diagnostics, but after that, I worked in the frontend and, over time, descended the software stack from frontend to backend to the deeper infrastructure I work with today.

Parallel to this trajectory down the stack was an upward trajectory in responsibility. Leadership wasn’t an obvious path for me at first — I had to be pitched multiple times — but the more I tried it, the more interesting and enjoyable it felt.

When I worked at Caymas Systems, a telecom startup, my manager was quick to recognize my leadership potential. He was really encouraging, but even more encouraging was witnessing the difference between teams with good leaders and those without.

After Caymas Systems, I worked at Comcast, where I eventually switched from an IC to an engineering manager. Once I experienced the joys of coaching people, building cool software together, and developing high-performing teams, I knew this was the path I wanted to take.

Fail whales and distributed systems

This path took me through a formative time in my career: the almost nine years I spent working at Twitter. I began as a first-line manager and, over time, worked through some of Twitter’s biggest events, including the “fail whale” era and the Ellen DeGeneres “selfie that broke Twitter” moment.

This was an exciting time. I was working at the heart of Twitter’s tech stack, supporting teams that powered its consumer and revenue verticals. This is where I grew into a senior manager and, eventually, a director. Looking back over nearly a decade of work, two major lessons stand out: one technical and one cultural.

The technical lesson was about failure — namely, its inevitability.

Over my tenure, the Twitter stack transitioned from a monolith to a microservices architecture. This resulted in a set of robust, high-scale, low-latency distributed systems, and it was here that I learned that, when building resilient distributed systems, you need to design for failure, not hope to avoid it.

I often think back to How Complex Systems Fail — the more complex a system, the more likely it is to fail. I remembered that lesson every time we were called to the Twitter command center to deal with an incident; it was all hands on deck until everything was back online.

The cultural lesson was about adoption and what it takes to fuel great ideas.

Today, almost two-thirds of enterprises use GraphQL in production, but in its early days, it was a new, largely untested idea. During a hack week, a couple of engineers laid the groundwork for using this technology at Twitter. I worked closely with them and bootstrapped the team that eventually built Twitter’s GraphQL API, replacing the legacy REST services.

I still think about this experience today. It required convincing leadership and building consensus across numerous teams and stakeholders, but once we did, the payoff was significant: this one technical choice accelerated the velocity of product feature teams across the company.

Why I picked Airbnb

When Airbnb reached out in 2022, I realized my time at Twitter was coming to a close. By that point, my organization was well-run and high-performing — a success, but also a sign that I was ready for my next adventure.

Airbnb immediately stood out because the company offered, for the first time in my career, a true alignment between my personal and technical interests. I love traveling, and I have been a long-time Airbnb guest since 2013. I had always wanted to work for a company that built a product I truly cared about, and this was my chance.

I only got more excited when I learned about the people and teams I’d be working with. The Developer Platform organization, which was responsible for supporting all of Airbnb’s engineers, faced challenges I’d seen before. There was a lot of good work happening in silos, and folks were longing for a clear strategy and direction. Also, I saw an opportunity to not only improve developer experience but also build trust with the rest of the engineers and stakeholders.

So, I started at the beginning. We focused on setting up the organization, coaching leadership, and building internal alignment within the team, as well as external alignment across all the teams we supported. Fundamental questions like “Why are we here together?” and “Where are we going?” all had to be answered.

After a year or two of this work, we had a high-performing team with a clear strategy and strong execution, consistently delivering business value and improving the developer experience and productivity at Airbnb. Even more importantly, we earned the rest of the engineers’ trust, and we enabled our technical teams to perform better.

We saw this reflected in the bi-annual DevX surveys (which we built out), and the results showed overall developer satisfaction increasing about 10% year over year during my time on the team.

Solving new problems while working from anywhere

Today, I’m Senior Director of Engineering for Application & Cloud Infrastructure, which includes compute, networking, core services, and the GraphQL application platform. Our mission is to deliver reliable, secure, and efficient platforms for building, operating, and scaling applications, services, and workloads at Airbnb.

My primary users are still the engineers at Airbnb. When they need to compute, they don’t wrangle AWS themselves — we provide a layer of abstraction that helps them use low-level infrastructure. Similarly, if they need authorization, authentication, configuration management, and a host of other services, they come to us rather than starting from scratch.

I’m excited to come to work every day because of the people I get to work with and the opportunities we face together. The culture is excellent, the people are smart and collaborative, and the engineers we support appreciate the work we do.

The setup is empowering, too, and as you solve problems, you can grow and expand to tackle bigger problems that span teams and organizations. Add in the ability to work from anywhere, and for me, it feels like the sky is the limit.

As I look back on my career, and really, my entire life, I tend to see it now through the lens of long-distance trail running — a major hobby of mine.

After working at a startup, having twins, and running my first marathon, I felt like I could do anything. At work and on the trails, I think about how to prepare for the journeys ahead and how to maintain a pace that allows me and the people around me to thrive in the long run. Recovery is necessary, but so is strategy, drive, discipline, and finding the people who will go with you as well as cheer you on along the way.

I’m happy this path, as unpredictable as it has been, has taken me to Airbnb. Airbnb is in that ideal position between a startup and a long-established company. The systems and workflows are mature, but there are still many interesting problems to solve and opportunities to pursue. If that’s of interest to you, I encourage you to check out openings at Airbnb.

My Journey to Airbnb: Peter Coles

laurenmackevich — Fri, 30 Jan 2026 17:51:36 +0000

Public school to PhD

The story of Airbnb’s Head Economist for Policy and Director of Data Science involves geology, co-teaching with a Nobel Prize winner, and CSI. (No, not the hit TV franchise.)

Peter Coles was born and raised in Milwaukee, Wisconsin. He studied math at Princeton, earned his PhD in economics at Stanford, and taught at Harvard Business School before joining eBay and becoming a Data Science leader at Airbnb.

As you’ll see from his story, Peter has a deep interest in how marketplaces work. By transitioning from academia to the business world, he not only gets to study first-hand data about millions of guests and hosts, but also to influence product and policy decisions. And he still gets to hang out with academics. Check out all the research Peter and his team are doing here.

My fascination with marketplaces goes back a long time.

Sometime around second grade in Milwaukee, Wisconsin (where I grew up), my friends and I had the great idea to run a rock stand. It was like a lemonade stand, but instead we would sell rocks. Rocks we found in the street. Neighborhood kids could find their own rocks and sell them, and we’d take 25%. Nobody got any sales. Fortunately I’ve learned a bit more about marketplaces since then — more about that shortly.

From kindergarten through high school, I was a public school kid. My parents valued education, and helping others — my father was a doctor, my mother a nutritionist — and those are values I still hold dear.

While I played soccer and tennis and was moderately social, this period was probably best defined by an obsession with competitions. Math, Chess, Science Olympiad, Quiz Bowl, Academic Decathlon, puzzle races with my younger brother — there was hardly a nerdy competition offered where I didn’t compete. Time well spent? Let’s just say I missed a lot of high school parties while studying to become the five-time Wisconsin State Rocks, Minerals, and Fossil Identification champion — so you can be the judge.

By the time college came around, it seemed time to rebel. I wanted to be done with the nerdy stuff. I applied and was accepted to Princeton, and started studying ancient history. After one semester and (in my view) an underappreciated essay on the Hittites, I was back to majoring in math. At least I was good at that! I figured I could work on practical skills later.

After graduating, I accepted a fellowship in Germany and continued to Stanford for a PhD in economics — a somewhat more applied science, though I focused on game theory, at the intersection of math and strategy. I had the good fortune there to be the second graduate student of Jon Levin, now the President of Stanford University, who taught me the importance of simplification in research — even when the subject matter itself is complex.

Even while in this still-theoretical space, I kept my feet on the ground — or at least on the pedals. During a monthlong break in my classes in Germany, I biked around Europe, crashing with friends and family members of my classmates — people I had never met before staying with them. In a sense, I was prototyping Airbnb well before it existed!

My time in graduate school was Silicon Valley in the 2000s, after the dot-com crash, so tech was in the midst of a renaissance. Many of my friends were at growing companies like Google and Amazon. It was very tempting to stay in California to be a part of this, but I ended up with one more stop in academia.

Markets in theory, markets in practice

Harvard Business School, known for its focus on managerial science, was perhaps the most compelling place in the academic world that would allow me to stay close to the tech industry. I got a double stroke of good fortune: not only was I offered an assistant professorship there, but I also got to co-teach with Al Roth, a founder of the field of Market Design. Al is still an important mentor to me, and later won a Nobel Prize!

In my time researching and teaching graduate students, I was exposed to many examples of market design, conducting research on the topic of “Matching”; that is, mechanisms to pair users from two groups, often when price cannot be used to clear the market. This covered strategy of participants, signaling in markets, and I even had a chance to improve the market for PhD economists. I also wrote a number of case studies, including on Zillow, Microsoft, Craigslist, and more. The teaching and writing was a lot of fun, but I also came to realize I wasn’t a fit for academia in the long term. My attention span was too short to dedicate most of my time to research papers (and especially peer reviews), but I was enormously appreciative of this phase of my career.

By this point it was 2013, and two simultaneous and interrelated phenomena were exploding in tech: mobile and the sharing economy. It was a perfect time to head back west and finally enter the tech world.

I landed at eBay, which for a student of marketplaces was an ideal company: just about everything is for sale, and it was ripe for market design. Steve Tadelis, a mentor from my Stanford days, had created one of the first economics teams inside a tech company, which I took over when Steve left. At the same time, eBay was getting on the data science train — this was before every company had a DS team — and my group joined another to form eBay’s Data Labs. One of my favorite projects there was a project called “What’s it Worth” (which I worked on with Airbnb colleague Dean Chen), where we developed a methodology for determining the fair market value of items. Some hands-on practical work, some modeling — this was just what I was hoping for.

In 2015 Riley Newman, one of Airbnb’s first employees and then its Head of Data Science, presented an even more enticing opportunity. The Airbnb platform was growing quickly, and for the first time attracting substantial regulatory attention. They needed an economics team to partner with the growing policy team, to jointly address the question of Airbnb’s relationship to cities. This was a new way for me to apply economics. I was all in.

Addressing Airbnb’s critical questions with data

When I think back to my eight and a half years so far at Airbnb, I view this as entailing three “phases.” In the first, I worked to address economic questions by establishing a global team of data scientists and economists to analyze the relationship of short-term rentals to the world.

Meanwhile, as Airbnb continued to grow, execs were asking big questions that couldn’t be answered by any specific data science team. They needed a group with visibility across the whole organization. So in this second phase, Jackson Wang and I founded a team called Central Strategy & Insights, or CSI.

The acronym was no coincidence: we saw ourselves as forensic investigators, piecing together stories as we collected evidence. One important period of CSI’s work addressed changes brought on by the pandemic — in particular a major adjustment in where guests were looking to stay, and the supply we’d need to accommodate them. We also led the company’s business reviews, and generated analyses to describe the business to shareholders ahead of the IPO.

My third phase at Airbnb started a lot like the first, but supersized: developing models to inform a well-considered approach to policy considerations, this time as travel rebounded after the pandemic and governments were no longer fully occupied with a public health emergency. Our newly expanded group of economics PhDs and analysts also came up with ways to evaluate Airbnb’s impact on guests, hosts, and society, including via our US Economic Impact Report.

A great balance of academic and applied science

Almost all of my first several years at Airbnb had been internally facing. That’s changed in recent times, as we’ve spun up and expanded a program to collaborate with academic researchers to analyze Airbnb’s data and improve the experience for users.

The first step was to figure out how to collaborate with external researchers, while respecting privacy and legal limitations. Collaboration interest then came quickly. We have now published well-received papers with professors from MIT, Berkeley, Stanford, UCLA, NYU and more, with others in progress. One paper I wrote with colleagues and academic partners develops foundations for what “quality” means in platforms, from an economic perspective.

We’ve also launched a monthly seminar where we invite our academic collaborators to discuss research with Airbnb data scientists and technologists. Developing research is great, but there’s nothing like live discussions to cross-pollinate and foster ideas. This builds on a strong collaborative learning tradition at Airbnb, with internal classes and reading groups to grow our skills and keep up with tech developments.

Alongside engaging with academia, I’m so excited my data science colleagues and I have a mandate to be innovative and proactive. We have the space and encouragement to work on big ideas, even if they might take a year or two to prove out — and perhaps more importantly, even if some of the ideas fail. But nothing is more important than the people. I am proud of the students, scientists, and even professors I have hired here over the years, and love seeing them grow and find success.

A license to tackle big topics, continual education, research on the product as well as its relationship to the outside world, amazing colleagues, and a direct connection to academia all make Airbnb a unique place to be a market designer, economist, and data scientist. Whether or not you spent your free time in eighth grade studying rocks.

If you want to learn more about the research happening at Airbnb you can read our published papers here. If this type of work interests you, check out our open roles.

Code of conduct

laurenmackevich — Wed, 07 Jan 2026 22:16:54 +0000

Our Pledge

We as members, contributors, and leaders pledge to make participation in our community a harassment-free experience for everyone, regardless of age, body size, visible or invisible disability, ethnicity, sex characteristics, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

We pledge to act and interact in ways that contribute to an open, welcoming, diverse, inclusive, and healthy community.

Our Standards

Examples of behavior that contributes to a positive environment for our community include:

Demonstrating empathy and kindness toward other people
Being respectful of differing opinions, viewpoints, and experiences
Giving and gracefully accepting constructive feedback
Accepting responsibility and apologizing to those affected by our mistakes, and learning from the experience
Focusing on what is best not just for us as individuals, but for the overall community

Examples of unacceptable behavior include:

The use of sexualized language or imagery, and sexual attention or advances of any kind
Trolling, insulting or derogatory comments, and personal or political attacks
Public or private harassment
Publishing others’ private information, such as a physical or email address, without their explicit permission
Other conduct which could reasonably be considered inappropriate in a professional setting

Enforcement Responsibilities

Community leaders are responsible for clarifying and enforcing our standards of acceptable behavior and will take appropriate and fair corrective action in response to any behavior that they deem inappropriate, threatening, offensive, or harmful.

Community leaders have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, and will communicate reasons for moderation decisions when appropriate.

Scope

This Code of Conduct applies within all community spaces, and also applies when an individual is officially representing the community in public spaces. Examples of representing our community include using an official email address, posting via an official social media account, or acting as an appointed representative at an online or offline event.

Enforcement

Instances of abusive, harassing, or otherwise unacceptable behavior may be reported to the community leaders responsible for enforcement at [email protected]. All complaints will be reviewed and investigated promptly and fairly.

All community leaders are obligated to respect the privacy and security of the reporter of any incident.

Enforcement Guidelines

Community leaders will follow these Community Impact Guidelines in determining the consequences for any action they deem in violation of this Code of Conduct:

1. Correction

Community Impact: Use of inappropriate language or other behavior deemed unprofessional or unwelcome in the community.

Consequence: A private, written warning from community leaders, providing clarity around the nature of the violation and an explanation of why the behavior was inappropriate. A public apology may be requested.

2. Warning

Community Impact: A violation through a single incident or series of actions.

Consequence: A warning with consequences for continued behavior. No interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, for a specified period of time. This includes avoiding interactions in community spaces as well as external channels like social media. Violating these terms may lead to a temporary or permanent ban.

3. Temporary Ban

Community Impact: A serious violation of community standards, including sustained inappropriate behavior.

Consequence: A temporary ban from any sort of interaction or public communication with the community for a specified period of time. No public or private interaction with the people involved, including unsolicited interaction with those enforcing the Code of Conduct, is allowed during this period. Violating these terms may lead to a permanent ban.

4. Permanent Ban

Community Impact: Demonstrating a pattern of violation of community standards, including sustained inappropriate behavior, harassment of an individual, or aggression toward or disparagement of classes of individuals.

Consequence: A permanent ban from any sort of public interaction within the community.

Attribution

This Code of Conduct is adapted from the Contributor Covenant, version 2.0, available at https://www.contributor-covenant.org/version/2/0/code_of_conduct.html.

Community Impact Guidelines were inspired by Mozilla’s code of conduct enforcement ladder.

For answers to common questions about this code of conduct, see the FAQ at https://www.contributor-covenant.org/faq. Translations are available at https://www.contributor-covenant.org/translations.