<![CDATA[Netflix TechBlog - Medium]]> https://netflixtechblog.com?source=rss----2615bd06b42e---4 https://cdn-images-1.medium.com/proxy/1*TGH72Nnw24QL3iV9IOm4VA.png Netflix TechBlog - Medium https://netflixtechblog.com?source=rss----2615bd06b42e---4 Medium Mon, 16 Mar 2026 13:18:50 GMT <![CDATA[Scaling Global Storytelling: Modernizing Localization Analytics at Netflix]]> https://netflixtechblog.com/scaling-global-storytelling-modernizing-localization-analytics-at-netflix-816f47290641?source=rss----2615bd06b42e---4 https://medium.com/p/816f47290641 Fri, 06 Mar 2026 15:01:27 GMT 2026-03-06T15:01:28.509Z Valentin Geffrier, Tanguy Cornuau

Each year, we bring the Analytics Engineering community together for an Analytics Summit — a multi-day internal conference to share analytical deliverables across Netflix, discuss analytic practice, and build relationships within the community. This post is one of several topics presented at the Summit highlighting the breadth and impact of Analytics work across different areas of the business.

At Netflix, our goal is to entertain the world, which means we must speak the world’s languages. Given the company’s growth to serving 300 million+ members in more than 190+ countries and 50+ languages, the Localization team has had to scale rapidly in creating more dubs and subtitle assets than ever before. However, this growth created technical debt within our systems: a fragmented landscape of analytics workflows, duplicated pipelines, and siloed dashboards that we are now actively modernizing.

The Challenge: “Who Made This Dub?”

Historically, business logic for localization metrics was replicated across isolated domains. A question as simple as “Who made this dub/subtitle?” is actually complex — it requires mapping multiple data sources through intricate and constantly changing logic, which varies depending on the specific language asset type and creation workflow.

When this logic is copied into isolated pipelines for different use cases it creates two major risks: inconsistency in reporting and a massive maintenance burden whenever upstream logic changes. We realized we needed to move away from these vertical silos.

Our Modernization Strategy

To address this, we defined a vision centered on consolidation, standardization, and trust, executed through three strategic pillars:

1. The Audit and Consolidation Playbook

We initiated a comprehensive audit of over 40 dashboards and tools to assess usage and code quality. Our focus has shifted from patching frontend visualizations to consolidating backend pipelines. For example, we are currently merging three legacy dashboards related to dubbing partner KPIs (around operational performance, capacity, and finances), focusing first on a unified data and backend layer that can support a variety of future frontend iterations.

2. Reducing “Not-So-Tech” Debt

Technical debt isn’t just about code; it is also about the user experience. We define “Not-So-Tech Debt” as the friction stakeholders feel when tools are hard to interpret or can benefit from better storytelling. To fix this, we revamped our Language Asset Consumption tool — instead of reporting dub and subtitle metrics independently, we combine audio and text languages into one consumption language that helps differentiate Original Language versus Localized Consumption and measure member preferences between subtitles, dubs, or a combination of both for a given language. This unlocks more intuitive insights based on actual recurring stakeholder use cases.

3. Investing in Core Building Blocks

We are shifting to a write once, read many architecture. By centralizing business logic into unified tables — such as a “Language Asset Producer” table — we solve the “Who made this dub?” problem once. This centralized source now feeds into multiple downstream domains, including our Dub Quality and Translation Quality metrics, ensuring that any logic update propagates instantly across the ecosystem.

The Future: Event-Level Analytics

Looking ahead, we are moving beyond asset-level metrics to event-level analytics. We are building a generic data model to capture granular timed-text events, such as individual subtitle lines. This data helps us understand how subtitle characteristics (e.g. reading speed) affect member engagement and, in turn, refine the style guidelines we provide to our subtitle linguists to improve the member experience with localized content.

Ultimately, this modernization effort is about scaling our ability to measure and enhance the joy and entertainment we deliver to our diverse global audience, ensuring that every member, regardless of their language, has the best possible Netflix experience.


Scaling Global Storytelling: Modernizing Localization Analytics at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[Optimizing Recommendation Systems with JDK’s Vector API]]> https://netflixtechblog.com/optimizing-recommendation-systems-with-jdks-vector-api-30d2830401ec?source=rss----2615bd06b42e---4 https://medium.com/p/30d2830401ec Tue, 03 Mar 2026 01:36:46 GMT 2026-03-03T18:17:06.424Z By Harshad Sane

Ranker is one of the largest and most complex services at Netflix. Among many things, it powers the personalized rows you see on the Netflix homepage, and runs at an enormous scale. When we looked at CPU profiles for this service, one feature kept standing out: video serendipity scoring — the logic that answers a simple question:

“How different is this new title from what you’ve been watching so far?”

This single feature was consuming about 7.5% of total CPU on each node running the service. What started as a simple idea — “just batch the video scoring feature” — turned into a deeper optimization journey. Along the way we introduced batching, re-architected memory layout and tried various libraries to handle the scoring kernels.

Read on to learn how we achieved the same serendipity scores, but at a meaningfully lower CPU per request, resulting in a reduced cluster footprint.

Problem: The Hotspot in Ranker

At a high level, serendipity scoring works like this: A candidate title and each item in a member’s viewing history are represented as embeddings in a vector space. For each candidate, we compute its similarity against the history embeddings, find the maximum similarity, and convert that into a “novelty” score. That score becomes an input feature to the downstream recommendation logic.

The original implementation was straightforward but expensive. For each candidate we fetch its embedding, loop over the history to compute cosine similarity one pair at a time and track the maximum similarity score. Although it is easy to reason about, at Ranker’s scale, this results in significant sequential work, repeated embedding lookups, scattered memory access, and poor cache locality. Profiling confirmed this.

Flamegraph showing inefficient scoring

A flamegraph made it clear: One of the top hotspots in the service was Java dot products inside the serendipity encoder. Algorithmically, the hotspot was a nested loop structure of M candidates × N history items where each pair generates its own cosine similarity i.e. O(M×N) separate dot product operations.

Solution

The Original Implementation: Single video cosine loop

In simplified form the code looked like this:

for (Video candidate : candidates) {
Vector c = embedding(candidate); // D-dimensional
double maxSim = -1.0;

for (Video h : history) {
Vector v = embedding(h); // D-dimensional
double sim = cosine(c, v); // dot(c, v) / (||c|| * ||v||)
maxSim = Math.max(maxSim, sim);
}

double serendipity = 1.0 - maxSim;
emitFeature(candidate, serendipity);
}

The nested for loop with O(M×N) separate dot products brought upon its own overheads. One interesting detail we learned by instrumenting traffic shapes: most requests (about 98%) were single-video, but the remaining 2% were large batch requests. Because those batches were so large, the total volume of videos processed ended up being roughly 50:50 between single and batch jobs. This made batching worth pursuing even if it didn’t help the median request.

Step 1 : Batching, from Nested Loops to Matrix Multiply

The first idea was to stop thinking in terms of “many small dot products” and instead treat the work as a matrix operation. i.e. For batch candidates, implement a data layout to parallelize the math in a single operation i.e. matrix multiply. If D is the embedding dimension:

  1. Pack all candidate embeddings into a matrix A of shape M x D
  2. Pack all history embeddings into a matrix B of shape N x D
  3. Normalize all rows to unit length.
  4. Compute: cosine similarities as
    [ C = A x B^T ]; where C is an M x N matrix of cosine similarities.

In pseudo‑code:

// Build matrices
double[][] A = new double[M][D]; // candidates
double[][] B = new double[N][D]; // history

for (int i = 0; i < M; i++) {
A[i] = embedding(candidates[i]).toArray();
}
for (int j = 0; j < N; j++) {
B[j] = embedding(history[j]).toArray();
}

// Normalize rows to unit vectors
normalizeRows(A);
normalizeRows(B);

// Compute C = A * B^T
double[][] C = matmul(A, B);
C[i][j] = cosine(candidates[i], history[j])

// Derive serendipity
for (int i = 0; i < M; i++) {
double maxSim = max(C[i][0..N-1]);
double serendipity = 1.0 - maxSim;
emitFeature(candidates[i], serendipity);
}

This turns M×N separate dot products into a single matrix multiply, which is exactly what CPUs and optimized kernels are built for. We integrated this into the existing framework by supporting both, encode()for single videos and batchEncode() for batches, while maintaining backward compatibility. At this point it seemed like we were “done”, but we weren't.

Step 2: When Batching Isn’t Enough

Once we had a batched implementation, we ran canaries and saw something surprising: about a 5% performance regression. The algorithm wasn’t the issue — turning M×N separate dot products into a matrix multiplication is mathematically sound. The problem was the overhead we introduced in the first implementation.

  1. Our initial version built double[][] matrices for candidates, history, and results on every batch. Those large, short-lived allocations created GC pressure, and the double[][] layout itself is non-contiguous in memory, which meant extra pointer chasing and worse cache behavior.
  2. On top of that, the first-cut Java matrix multiply was a straightforward scalar implementation, so it couldn’t take advantage of SIMD. In other words, we paid the cost of batching without getting the compute efficiency we were aiming for.

The lesson was immediate: algorithmic improvements don’t matter if the implementation details—memory layout, allocation strategy, and the compute kernel—work against you. That set up the next step for making the data layout cache-friendly and eliminating per-batch allocations before revisiting the matrix multiply kernel.

Step 3: Flat Buffers & ThreadLocal Reuse

We reworked the data layout to be cache-friendly and allocation-light. Instead of double[m][n], we moved to flat double[] buffers in row-major order. That gave us contiguous memory and predictable access patterns. Then we introduced a ThreadLocal<BufferHolder> that owns reusable buffers for candidates, history, and any other scratch space. Buffers grow as needed but never shrink, which avoids per-request allocation while keeping each thread isolated (no contention). A simplified sketch:

class BufferHolder {  
double[] candidatesFlat = new double[0];
double[] historyFlat = new double[0];

double[] getCandidatesFlat(int required) {
if (candidatesFlat.length < required) {
candidatesFlat = new double[required];
}
return candidatesFlat;
}

double[] getHistoryFlat(int required) {
if (historyFlat.length < required) {
historyFlat = new double[required];
}
return historyFlat;
}
}

private static final ThreadLocal<BufferHolder> threadBuffers =
ThreadLocal.withInitial(BufferHolder::new);

This change alone made the batched path far more predictable: fewer allocations, less GC pressure, and better cache locality.

Now the remaining question was the one we originally thought we were answering: what’s the best way to do the matrix multiply?

Step 4: BLAS: Great in Tests, Not in Production

The obvious next step was BLAS (Basic Linear Algebra Subprograms). In isolation, microbenchmarks looked promising. But once integrated into the real batch scoring path, the gains didn’t materialize. A few things were working against us:

  • The default netlib-java path was using F2J (Fortran-to-Java) BLAS rather than a truly native implementation.
  • Even with native BLAS, we paid overhead for setup and JNI transitions.
  • Java’s row-major layout doesn’t match the column-major expectations of many BLAS routines, which can introduce conversion and temporary buffers.
  • Those extra allocations and copies mattered in the full pipeline, especially alongside TensorFlow embedding work.

BLAS was still a useful experiment — it clarified where time was being spent, but it wasn’t the drop-in win we wanted. What we needed was something that stayed pure Java, fit our flat-buffer architecture, and could still exploit SIMD.

Step 5: JDK Vector API to the rescue

A Short Note on the JDK Vector API: The JDK Vector API is an incubating feature that provides a portable way to express data-parallel operations in Java — think “SIMD without intrinsics”. You write in terms of vectors and lanes, and the JIT maps those operations to the best SIMD instructions available on the host CPU (SSE/AVX2/AVX-512), with a scalar fallback when needed. More crucially for us, it’s pure Java: no native dependencies, no JNI transitions, and a development model that looks like normal Java code rather than platform-specific assembly or intrinsics.

This was a particularly good match for our workload because we had already moved embeddings into flat, contiguous double[] buffers, and the hot loop was dominated by large numbers of dot products. The final step was to replace BLAS with a pure-Java SIMD implementation using the JDK Vector API. By this point we already had the right shape for high performance — batching, flat buffers, and ThreadLocal reuse. So the remaining work was to swap out the compute kernel without introducing JNI overhead or platform-specific code. We did that behind a small factory. At class load time, MatMulFactory selects the best available implementation:

  • If jdk.incubator.vector is available, use a Vector API implementation.
  • Otherwise, fall back to a scalar implementation with a highly optimized loop-unrolled dot product (implemented by my colleague Patrick Strawderman, inspired by patterns used in Lucene)

In the Vector API implementation, the inner loop computes a dot product by accumulating a * b into a vector accumulator using fma() (fused multiply-add). DoubleVector.SPECIES_PREFERRED lets the runtime pick an appropriate lane width for the machine. Here’s a simplified sketch of the inner loop:

// Vector API path (simplified)  
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {

DoubleVector acc = DoubleVector.zero(SPECIES);
int k = 0;
// SPECIES.length() (e.g. often 4 doubles on AVX2 and 8 doubles on AVX-512).
for (; k + SPECIES.length() <= D; k += SPECIES.length()) {
DoubleVector a = DoubleVector.fromArray(SPECIES, candidatesFlat, i*D + k);
DoubleVector b = DoubleVector.fromArray(SPECIES, historyFlat, j*D + k);
acc = a.fma(b, acc); // fused multiply-add
}
double dot = acc.reduceLanes(VectorOperators.ADD);
// handle tail k..D-1
similaritiesFlat[i*N + j] = dot;
}
}

Figure below shows how the Vector API utilizes SIMD hardware to process multiple doubles per instruction (e.g., 4 lanes on AVX2 and 8 lanes on AVX‑512). What used to be many scalar multiply-adds becomes a smaller number of vector fma() operations plus a reduction—same algorithm, much better use of the CPU’s vector units.

Vectorization with SIMD

Fallbacks & Safety: When the Vector API Isn’t Available

Because the Vector API is still incubating, it requires a runtime flag: --add-modules=jdk.incubator.vector We didn’t want correctness or availability to depend on that flag. So we designed the fallback behavior explicitly: At startup, we detect Vector API support and use the SIMD batched matmul when available; otherwise we fall back to an optimized scalar path, with single-video requests continuing to use the per-item implementation.

That gives us a clean operational story: services can opt in to the Vector API for maximum performance, but the system remains safe and predictable without it.

Results in Production:

With the full design in place with batching, flat buffers, ThreadLocal reuse, and the Vector API, we ran canaries that run production traffic. We observed a ~7% drop in CPU utilization and ~12% drop in average latency. To normalize across any small throughput differences, we also tracked CPU/RPS (CPU consumed per request-per-second). That metric improved by roughly 10%, meaning we could handle the same traffic with about 10% less CPU, and we saw similar numbers hold after full production rollout.

CPU/RPS on Ranker

At the function operator level, we saw the CPU drop from the initial 7.5% to a merely ~1% with the optimization in place. At the assembly level, the shift was clear: from loop-unrolled scalar dot products to a vectorized matrix multiply on AVX-512 hardware.

Assembly snippet from batchEncode

Closing Thoughts

This optimization ended up being less about finding the “fastest library” and more about getting the fundamentals right: choosing the right computation shape, keeping data layout cache-friendly, and avoiding overheads that can erase theoretical wins. Once those pieces were in place, the JDK Vector API was a great fit, as it let us express SIMD-style math in pure Java, without JNI, while still keeping a safe fallback path. Another bonus was the low developer overhead: compared to lower-level approaches, the Vector API let us replace a much larger, more complex implementation with a relatively small amount of readable Java code, which made it easier to review, maintain, and iterate on.

Have you tried the Vector API in a real service yet? I’d love to hear what workloads it helped (or didn’t), and what you learned about benchmarking and rollout in production.

Special thanks to Jason Koch, Patrick Strawderman, Daniel Huang, Fan Yang, and the Performance Engineering team at Netflix


Optimizing Recommendation Systems with JDK’s Vector API was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[Mount Mayhem at Netflix: Scaling Containers on Modern CPUs]]> https://netflixtechblog.com/mount-mayhem-at-netflix-scaling-containers-on-modern-cpus-f3b09b68beac?source=rss----2615bd06b42e---4 https://medium.com/p/f3b09b68beac Sat, 28 Feb 2026 22:55:53 GMT 2026-02-28T22:55:54.664Z Authors: Harshad Sane, Andrew Halaney

Imagine this — you click play on Netflix on a Friday night and behind the scenes hundreds of containers spring to action in a few seconds to answer your call. At Netflix, scaling containers efficiently is critical to delivering a seamless streaming experience to millions of members worldwide. To keep up with responsiveness at this scale, we modernized our container runtime, only to hit a surprising bottleneck: the CPU architecture itself.

Let us walk you through the story of how we diagnosed the problem and what we learned about scaling containers at the hardware level.

The Problem

When application demand requires that we scale up our servers, we get a new instance from AWS. To use this new capacity efficiently, pods are assigned to the node until its resources are considered fully allocated. A node can go from no applications running to being maxed out within moments of being ready to receive these applications.

As we migrated more and more from our old container platform to our new container platform, we started seeing some concerning trends. Some nodes were stalling for long periods of time, with a simple health check timing out after 30 seconds. An initial investigation showed that the mount table length was increasing dramatically in these situations, and reading it alone could take upwards of 30 seconds. Looking at systemd’s stack it was clear that it was busy processing these mount events as well and could lead to complete system lockup. Kubelet also timed out frequently talking to containerd in this period. Examining the mount table made it clear that these mounts were related to container creation.

The affected nodes were almost all r5.metal instances, and were starting applications whose container image contained many layers (50+).

Challenge

Mount Lock Contention

The flamegraph in Figure 1 clearly shows where containerd spent its time. Almost all of the time is spent trying to grab a kernel-level lock as part of the various mount-related activities when assembling the container’s root filesystem!

Figure 1: Flamegraph depicting lock contention

Looking closer, containerd executes the following calls for each layer if using user namespaces:

  1. open_tree() to get a reference to the layer / directory
  2. mount_setattr() to set the idmap to match the container’s user range, shifting the ownership so this container can access the files
  3. move_mount() to create a bind mount on the host with this new idmap applied

These bind mounts are owned by the container’s user range and are then used as the lowerdirs to create the overlayfs-based rootfs for the container. Once the overlayfs rootfs is mounted, the bind mounts are then unmounted since they are not necessary to keep around once the overlayfs is constructed.

If a node is starting many containers at once, every CPU ends up busy trying to execute these mounts and umounts. The kernel VFS has various global locks related to the mount table, and each of these mounts requires taking that lock as we can see in the top of the flamegraph. Any system trying to quickly set up many containers is prone to this, and this is a function of the number of layers in the container image.

For example, assume a node is starting 100 containers, each with 50 layers in its image. Each container will need 50 bind mounts to do the idmap for each layer. The container’s overlayfs mount will be created using those bind mounts as the lower directories, and then all 50 bind mounts can be cleaned up via umount. Containerd actually goes through this process twice, once to determine some user information in the image and once to create the actual rootfs. This means the total number of mount operations on the start up path for our 100 containers is 100 * 2 * (1 + 50 + 50) = 20200 mounts, all of which require grabbing various global mount related locks!

Diagnosis

What’s Different In The New Runtime?

As alluded to in the introduction, Netflix has been undergoing a modernization of its container runtime. In the past a virtual kubelet + docker solution was used, whereas now a kubelet + containerd solution is being used. Both the old runtime and the new runtime used user namespaces, so what’s the difference here?

  1. Old Runtime:
    All containers shared a single host user range. UIDs in image layers were shifted at untar time, so file permissions matched when containers accessed files. This worked because all containers used the same host user.
  2. New Runtime:
    Each container gets a unique host user range, improving security — if a container escapes, it can only affect its own files. To avoid the costly process of untarring and shifting UIDs for every container, the new runtime uses the kernel’s idmap feature. This allows efficient UID mapping per container without copying or changing file ownership, which is why containerd performs many mounts.

Figure 2 below is a simplified example of how this idmap feature looks like:

Figure 2: idmap feature

Why Does Instance Type Matter?

As noted earlier, the issue was predominantly occurring on r5.metal instances. Once we identified the root issue we could easily reproduce by creating a container image with many layers and sending hundreds of workloads using the image to a test node.

To better understand why this bottleneck was more profound on some instances compared to others, we benchmarked container launches on different AWS instance types:

  • r5.metal (5th gen Intel, dual-socket, multiple NUMA domains)
  • m7i.metal-24xl (7th gen Intel, single-socket, single NUMA domain)
  • m7a.24xlarge (7th gen AMD, single-socket, single NUMA domain)

Baseline Results

Figure 3 shows the baseline results from scaling containers on each instance type

  • At low concurrency (≤ ~20 containers), all platforms performed similarly
  • As concurrency increased, r5.metal began to fail around 100 containers
  • 7th generation AWS instances maintained lower launch times and higher success rates as concurrency grew
  • m7a instances showed the most consistent scaling behavior with the lowest failure rates even at high concurrency

Deep Dive

Using perf record and custom microbenchmarks, we can see the hottest code path was in the Linux kernel’s Virtual Filesystem (VFS) path lookup code — specifically, a tight spin loop waiting on a sequence lock in path_init(). The CPU spent most of its time executing the pause instruction, indicating many threads were spinning, waiting for the global lock, as shown in the disassembly snippet below

path_init():

mov mount_lock,%eax
test $0x1,%al
je 7c
pause

Using Intel’s Topdown Microarchitecture Analysis (TMA), we observed:

  • 95.5% of pipeline slots were stalled on contested accesses (tma_contested_accesses).
  • 57% of slots were due to false sharing (multiple cores accessing the same cache line).
  • Cache line bouncing and lock contention were the primary culprits.

Given a high amount of time being spent in contested accesses, the natural thinking from a perspective of hardware variations led to investigation of NUMA and Hyperthreading impact coming from the architecture to this subset

NUMA Effects

Non-Uniform Memory Access (NUMA) is a system design where each processor has its own local memory for faster access but relies on an interconnect to access the memory attached to a remote processor. Introduced in the 1990s to improve scalability in multiprocessor systems, NUMA boosts performance but also introduces higher latency when a CPU needs to access memory attached to another processor. Figure 4 is a simple image describing local vs remote access patterns of a NUMA architecture

Figure 4: Source: https://pmem.io/images/posts/numa_overview.png

AWS instances come in a variety of shapes and sizes. To obtain the largest core count, we tested the 2-socket 5th generation metal instances (r5.metal), on which containers were orchestrated by the titus agent. Modern dual-socket architectures implement NUMA design, leading to faster local but higher remote access latencies. Although container orchestration can maintain locality, global locks can easily run into high latency effects due to remote synchronization. In order to test the impact of NUMA, we tested an AWS 48xl sized instance with 2 NUMA nodes or sockets versus an AWS 24xl sized instance, which represents a single NUMA node or socket. As seen from Figure 5, the extra hop introduces high latencies and hence failures very quickly.

Figure 5: Numa Impact

Hyperthreading Effects

  • Hyperthreading (HT): Disabling HT on m7i.metal-24xl (Intel) improved container launch latencies by 20–30% as seen in Figure 6, since hyperthreads compete for shared execution resources, worsening the lock contention. When hyperthreading is enabled, each physical CPU core is split into two logical CPUs (hyperthreads) that share most of the core’s execution resources, such as caches, execution units, and memory bandwidth. While this can improve throughput for workloads that are not fully utilizing the core, it introduces significant challenges for workloads that rely heavily on global locks. By disabling hyperthreading, each thread runs on its own physical core, eliminating this competition for shared resources between hyperthreads. As a result, threads can acquire and release global locks more quickly, reducing overall contention and improving latency for operations that generally share underlying resources.
Figure 6: Hyperthreading impact

Why Does Hardware Architecture Matter?

Centralized Cache Architectures

Some modern server CPUs use a mesh-style interconnect to link cores and cache slices, with each intersection managing cache coherence for a subset of memory addresses. In these designs, all communication passes through a central queueing structure, which can only handle one request for a given address at a time. When a global lock (like the mount lock) is under heavy contention, all atomic operations targeting that lock are funneled through this single queue, causing requests to pile up and resulting in memory stalls and latency spikes.

In some well-known mesh-based architectures as shown in Figure 7 below, this central queue is called the “Table of Requests” (TOR), and it can become a surprising bottleneck when many threads are fighting for the same lock. If you’ve ever wondered why certain CPUs seem to “pause for breath” under heavy contention, this is often the culprit.

Figure 7: Public document from one of the major CPU vendors Source:https://www.intel.com/content/dam/developer/articles/technical/ddio-analysis-performance-monitoring/Figure1.png

Distributed Cache Architectures

Some modern server CPUs use a distributed, chiplet-based architecture (Figure 8), where multiple core complexes, each with their own local last-level cache — are connected via a high-speed interconnect fabric. In these designs, cache coherence is managed within each core complex, and traffic between complexes is handled by a scalable control fabric. Unlike mesh-based architectures with centralized queueing structures, this distributed approach spreads contention across multiple domains, making severe stalls from global lock contention less likely. For those interested in the technical details, public documentation from major CPU vendors provides deeper insight into these distributed cache and chiplet designs.

Figure 8: Public document from one of the major CPU vendors, Source: (AMD EPYC 9004 Genoa Chiplet Architecture 8x CCD — ServeTheHome)

Here is a comparison of the same workload run on m7i (centralized cache architecture) vs m7a (distributed cache architecture). Note that, in order to make it closely comparable, Hyperthreading (HT) was disabled on m7i, given previous regression seen in Figure 6, and experiments were run using same core counts. The result clearly shows a fairly consistent difference in performance of approximately 20% as shown in Figure 9

Figure 9: Architectural impact between m7i and m7a

Microbenchmark Results

To prove the above theory related to NUMA, HT and micro-architecture, we developed a small microbenchmark which basically invokes a given number of threads that then spins on a globally contended lock. Running the benchmark at increasing thread counts reveals the latency characteristics of the system under different scenarios. For example, Figure 10 below is the microbenchmark results with NUMA, HT and different microarchitectures.

Figure 10: Global lock contention benchmark results

Results from this custom synthetic benchmark (pause_bench) confirmed:

  • On r5.metal, eliminating NUMA by only using a single socket significantly drops latency at high thread counts
  • On m7i.metal-24xl, disabling hyperthreading further improves scaling
  • On m7a.24xlarge, performance scales the best, demonstrating that a distributed cache architecture handles cache-line contention in this case of global locks more gracefully.

Improving Software Architecture

While understanding the impacts of the hardware architecture is important for assessing possible mitigations, the root cause here is contention over a global lock. Working with containerd upstream we came to two possible solutions:

  1. Use the newer kernel mount API’s fsconfig() lowerdir+ support to supply the idmap’ed lowerdirs as fd’s instead of filesystem paths. This avoids the move_mount() syscall mentioned prior which requires global locks to mount each layer to the mount table
  2. Map the common parent directory of all the layers. This makes the number of mount operations go from O(n) to O(1) per container, where n is the number of layers in the image

Since using the newer API requires using a new kernel, we opted to make the latter change to benefit more of the community. With that in place, no longer do we see containerd’s flamegraph being dominated by mount-related operations. In fact, as seen in Figure 11 below we had to highlight them in purple below to see them at all!

Figure 11: Optimized solution

Conclusion

Our journey migrating to a modern kubelet + containerd runtime at Netflix revealed just how deeply intertwined software and hardware architecture can be when operating at scale. While kubelet/containerd’s usage of unique container users brought significant security gains, it also surfaced new bottlenecks rooted in kernel and CPU architecture — particularly when launching hundreds of many layered container images in parallel. Our investigation highlighted that not all hardware is created equal for this workload: centralized cache management amplified cache contention while distributed cache design smoothly scaled under load.

Ultimately, the best solution combined hardware awareness with software improvements. For an immediate mitigation we chose to route these workloads to CPU architectures that scaled better under these conditions. By changing the software design to minimize per-layer mount operations, we eliminated the global lock as a launch-time bottleneck — unlocking faster, more reliable scaling regardless of the underlying CPU architecture. This experience underscores the importance of holistic performance engineering: understanding and optimizing both the software stack and the hardware it runs on is key to delivering seamless user experiences at Netflix scale.

We trust these insights will assist others in navigating the evolving container ecosystem, transforming potential challenges into opportunities for building robust, high-performance platforms.

Special thanks to the Titus and Performance Engineering teams at Netflix.


Mount Mayhem at Netflix: Scaling Containers on Modern CPUs was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix]]> https://netflixtechblog.com/mediafm-the-multimodal-ai-foundation-for-media-understanding-at-netflix-e8c28df82e2d?source=rss----2615bd06b42e---4 https://medium.com/p/e8c28df82e2d Mon, 23 Feb 2026 18:24:32 GMT 2026-02-24T01:29:37.245Z Avneesh Saluja, Santiago Castro, Bowei Yan, Ashish Rastogi

Introduction

Netflix’s core mission is to connect millions of members around the world with stories they’ll love. This requires not just an incredible catalog, but also a deep, machine-level understanding of every piece of content in that catalog, from the biggest blockbusters to the most niche documentaries. As we onboard new types of content such as live events and podcasts, the need to scalably understand these nuances becomes even more critical to our productions and member-facing experiences.

Many of these media-related tasks require sophisticated long-form video understanding e.g., identifying subtle narrative dependencies and emotional arcs that span entire episodes or films. Previous work has found that to truly grasp the content’s essence, our models must leverage the full multimodal signal. For example, the audio soundtrack is a crucial, non-visual modality that can help more precisely identify clip-level tones or when a new scene starts. Can we use our collection of shows and movies to learn how to a) fuse modalities like audio, video, and subtitle text together and b) develop robust representations that leverage the narrative structure that is present in long form entertainment? Consisting of tens of millions of individual shots across multiple titles, our diverse yet entertainment-specific dataset provides the perfect foundation to train multimodal media understanding models that enable many capabilities across the company such as ads relevancy, clip popularity prediction, and clip tagging.

For these reasons, we developed the Netflix Media Foundational Model (MediaFM), our new, in-house, multimodal content embedding model. MediaFM is the first tri-modal (audio, video, text) model pretrained on portions of the Netflix catalog. Its core is a multimodal, Transformer-based encoder designed to generate rich, contextual embeddings¹ for shots from our catalog by learning the temporal relationships between them through integrating visual, audio, and textual information. The resulting shot-level embeddings are powerful representations designed to create a deeper, more nuanced, and machine-readable understanding of our content, providing the critical backbone for effective cold start of newly launching titles in recommendations, optimized promotional assets (like art and trailers), and internal content analysis tools.

Figure 1: MediaFM Architecture

Input Representation & Preprocessing

The model’s fundamental unit of input is a shot, derived by segmenting a movie or episode (collectively referred to as “title”) using a shot boundary detection algorithm. For each shot, we generate three distinct embeddings from its core modalities:

  • Video: an internal model called SeqCLIP (a CLIP-style model fine-tuned on video retrieval datasets) is used to embed frames sampled at uniform intervals from segmented shots
  • Audio: the audio samples from the same shots are embedded using Meta FAIR’s wav2vec2
  • Timed Text: OpenAI’s text-embedding-3-large model is used to encode the corresponding timed text (e.g., closed captions, audio descriptions, or subtitles) for each shot

For each shot, the three embeddings² are concatenated and unit-normed to form a single 2304-dimensional fused embedding vector. The transformer encoder is trained on sequences of shots, so each example in our dataset is a temporally-ordered sequence of these fused embeddings from the same movie or episode (up to 512 shots per sequence). We also have access to title-level metadata which is used to provide global context for each sequence (via the [GLOBAL]token). The title-level embedding is computed by passing title-level metadata (such as synopses and tags) through the text-embedding-3-large model.

Model Architecture and Training Objective

The core of our model is a transformer encoder, architecturally similar to BERT. A sequence of preprocessed shot embeddings is passed through the following stages:

  1. Input Projection: The fused shot embeddings are first projected down to the model’s hidden dimension via a linear layer.
  2. Sequence Construction & Special Tokens: Before entering the Transformer, two special embeddings are prepended to the sequence:
    • a learnable [CLS] embedding is added at the very beginning.
    • the title-level embedding is projected to the model’s hidden dimension and inserted after the [CLS] token as the [GLOBAL] token, providing title-level context to every shot in the sequence and participating in the self-attention process.
  3. Contextualization: The sequence is enhanced with positional embeddings and fed through the Transformer stack to provide shot representations based on their surrounding context.
  4. Output Projection: The contextualized hidden states from the Transformer are passed through a final linear layer, projecting them from the hidden layers back up to the 2304-dimensional fused embedding space for prediction.

We train the model using a Masked Shot Modeling (MSM) objective. In this self-supervised task, we randomly mask 20% of the input shot embeddings in each sequence by replacing them with a learnable [MASK] embedding. The model’s objective is to predict the original, unmasked fused embedding for these masked positions. The model is optimized by minimizing the cosine distance between its predicted embedding and the ground-truth embedding for each masked shot.

We optimized the hidden parameters with Muon and the remaining parameters with AdamW. It’s worth noting that the switch to Muon resulted in noticeable improvements.

Evaluation

To evaluate the learned embeddings, we learn task-specific linear layers on top of frozen representations (i.e., linear probes). Most of the tasks are clip-level, i.e., each example is a short clip ranging from a few seconds to a minute which are often presented to our members while recommending a title to them. When embedding these clips, we find that “embedding in context”, namely extracting the embeddings from within a larger sequence (e.g., the episode containing the clip), naturally does much better than embedding only the shots from a clip.

Tasks

Our embeddings are foundational and we find that they bring value to applications across Netflix. Here are a few:

  • Ad Relevancy: A multilabel classification task to categorize Netflix clips for relevant ad placement, measured by Average Precision. In this task, these representations operate at the retrieval stage, where they help in identifying the candidate set and in turn are fed into the ad serving system for relevance optimization.
  • Clip Popularity Ranking: A ranking task to predict the relative performance (in click-through rate, CTR) of a media clip relative to other clips from that show or movie, measured by a ten-fold with Kendall’s tau correlation coefficient.
  • Clip Tone: A multi-label classification of hook clips into 100 tone categories (e.g., creepy, scary, humorous) from our internal Metadata & Ratings team, measured by micro Average Precision (averaged across tone categories).
  • Clip Genre: A multi-label classification of clips into eleven core genres (Action, Anime, Comedy, Documentary, Drama, Fantasy, Horror, Kids, Romance, Sci-fi, Thriller) derived from the genre of the parent title, measured by macro Average Precision (averaged across genres).
  • Clip Retrieval: a binary classification of clips from movies or episodes into “clip-worthy” (i.e., a good clip to showcase the title) or not, as determined by human annotators, and as measured by Average Precision. The positive to negative clip ratio is 1:3, and for each title we select 6–10 positive clips and the corresponding number of negatives.

It’s worth noting that for the tasks above (as well as other tasks that use our model), the model outputs are utilized as information that the relevant teams use when driving to a decision rather than being used in a completely end-to-end fashion. Many of the improvements are also in various stages of deployment.

Results

Figure 2³ compares MediaFM to several strong baselines:

Figure 2: Performance of MediaFM vs. external and internal models.

On all tasks, MediaFM is better than the baselines. Improvements seem to be larger for tasks that require more detailed narrative understanding e.g., predicting the most relevant ads for an ad break given the surrounding context. We look further into this next.

Ablations

MediaFM’s primary improvements over previous Netflix work stem from two key areas: combining multiple modalities and learning to contextualize shot representations. To determine the contribution of each factor across different tasks, we compared MediaFM to a baseline. This baseline concatenates the three input embeddings, essentially providing the same complete, shot-level input as MediaFM but without the contextualization step. This comparison allows us to isolate which tasks benefit most from the contextualization aspect.

Additional modalities help somewhat for tone but the main improvement comes from contextualization.

Oddly, multiple uncontextualized modalities hurts the clip popularity ranking model, but adding contextualization significantly improves performance.

For clip retrieval we see a natural progression of around 15% for each improvement.

Next Steps

MediaFM presents a way to learn how to fuse and/or contextualize shot-level information by leveraging Netflix’s catalog in a self-supervised manner. With this perspective, we are actively investigating how pretrained multimodal (audio, video/image, text) LLMs like Qwen3-Omni, where the modality fusion has already been learned, can provide an even stronger starting point for subsequent model generations.

Next in this series of blog posts, we will present our method to embed title-level metadata and adapt it to our needs. Stay tuned!

Footnotes

  1. We chose embeddings over generative text outputs to prioritize modular design. This provides a tighter, cleaner abstraction layer: we generate the representation once, and it is consumed across our entire suite of services. This avoids the architectural fragility of fine-tuning, allowing us to enhance our existing embedding-based workflows with new modalities more flexibly.
  2. All of our data has audio and video; we zero-pad for missing timed text data, which is relatively likely to occur (e.g., in shots without dialogue).
  3. The title-level tasks couldn’t be evaluated with the VertexAI MM and Marengo embedding models as the videos exceed the length limit set by the APIs.

Acknowledgements

We would like to thank Matt Thanabalan and Chaitanya Ekanadham for their contributions to this work.


MediaFM: The Multimodal AI Foundation for Media Understanding at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[Scaling LLM Post-Training at Netflix]]> https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194?source=rss----2615bd06b42e---4 https://medium.com/p/0046f8790194 Fri, 13 Feb 2026 08:05:33 GMT 2026-02-13T08:05:35.311Z Baolin Li, Lingyi Liu, Binh Tang, Shaojing Li

Introduction

Pre-training gives Large Language Models (LLMs) broad linguistic ability and general world knowledge, but post-training is the phase that actually aligns them to concrete intents, domain constraints, and the reliability requirements of production environments. At Netflix, we are exploring how LLMs can enable new member experiences across recommendation, personalization, and search, which requires adapting generic foundation models so they can better reflect our catalog and the nuances of member interaction histories. At Netflix scale, post-training quickly becomes an engineering problem as much as a modeling one: building and operating complex data pipelines, coordinating distributed state across multi-node GPU clusters, and orchestrating workflows that interleave training and inference. This blog describes the architecture and engineering philosophy of our internal Post-Training Framework, built by the AI Platform team to hide infrastructure complexity so researchers and model developers can focus on model innovation — not distributed systems plumbing.

A Model Developer’s Post-Training Journey

Post-training often starts deceptively simply: curate proprietary domain data, load an open-weight model from Hugging Face, and iterate batches through it. At the experimentation scale, that’s a few lines of code. But when fine-tuning production-grade LLMs at scale, the gap between “running a script” and “robust post-training” becomes an abyss of engineering edge cases.

Figure 1. Simple steps to post-train an open-weight model.

Getting the data right

On paper, post-training is straightforward: choose a tokenizer, preprocess the dataset, and build a dataloader. In practice, data preparation is where things break. High-quality post-training — instruction following, multi-turn dialogue, Chain-of-Thought — depends on precisely controlling which tokens contribute to the loss. Hugging Face chat templates serialize conversations, but don’t specify what to train on versus ignore. The pipeline must apply explicit loss masking so only assistant tokens are optimized; otherwise the model learns from prompts and other non-target text, degrading quality.

Variable sequence length is another pitfall. Padding within a batch can waste compute, and uneven shapes across FSDP workers can cause GPU synchronization overhead. A more GPU-efficient approach is to pack multiple samples into fixed-length sequences and use a “document mask” to prevent cross-attention across samples, reducing padding and keeping shapes consistent.

Setting up the model

Loading an open-source checkpoint sounds simple until the model no longer fits on one GPU. At that point you need a sharding strategy (e.g., FSDP, TP) and must load partial weights directly onto the device mesh to avoid ever materializing the full model on a single device.

After loading, you still need to make the model trainable: choose full fine-tuning vs. LoRA, and apply optimizations like activation checkpointing, compilation, and correct precision settings (often subtle for RL, where rollout and policy precision must align). Large vocabularies (>128k) add a further memory trap: logits are [batch, seq_len, vocab] and can spike peak memory. Common mitigations include dropping ignored tokens before projection and computing logits/loss in chunks along the sequence dimension.

Starting the training

Even with data and models ready, production training is not a simple “for loop”. The system must support everything from SFT’s forward/backward pass to on-policy RL workflows that interleave rollout generation, reward/reference inference, and policy updates.

At Netflix scale, training runs as a distributed job. We use Ray to orchestrate workflows via actors, decoupling modeling logic from hardware. Robust runs also require experiment tracking (model quality metrics like loss and efficiency metrics like MFU) and fault tolerance via standardized checkpoints to resume cleanly after failures.

These challenges motivate a post-training framework that lets developers focus on modeling rather than distributed systems and operational details.

The Netflix Post-Training Framework

We built Netflix’s LLM post-training framework so Netflix model developers can turn ideas like those in Figure 1 into scalable, robust training jobs. It addresses the engineering hurdles described above, and also constraints that are specific to the Netflix ecosystem. Existing tools (e.g., Thinking Machines’ Tinker) work well for standard chat and instruction-tuning, but their structure can limit deeper experimentation. In contrast, our internal use cases often require architectural variation (for example, customizing output projection heads for task-specific objectives), expanded or nonstandard vocabularies driven by semantic IDs or special tokens, and even transformer models pre-trained from scratch on domain-specific, non-natural-language sequences. Supporting this range requires a framework that prioritizes flexibility and extensibility over a fixed fine-tuning paradigm.

Figure 2. The post-training library within Netflix stack

Figure 2 shows the end-to-end stack from infrastructure to trained models. At the base is Mako, Netflix’s internal ML compute platform, which provisions GPUs on AWS. On top of Mako, we run robust open-source components — PyTorch, Ray, and vLLM — largely out of the box. Our post-training framework sits above these foundations as a library: it provides reusable utilities and standardized training recipes for common workflows such as Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), Reinforcement Learning (RL), and Knowledge Distillation. Users typically express jobs as configuration files that select a recipe and plug in task-specific components.

Figure 3. Main components developed for the post-training framework

Figure 3 summarizes the modular components we built to reduce complexity across four dimensions. As with most ML systems, training success hinges on three pillars — Data, Model, and Compute — and the rise of RL fine-tuning adds a fourth pillar: Workflow, to support multi-stage execution patterns that don’t fit a simple training loop. Below, we detail the specific abstractions and features the framework provides for each of these dimensions:

  • Data: Dataset abstractions for SFT, reward modeling, and RL; high-throughput streaming from cloud and disk for datasets that exceed local storage; and asynchronous, on-the-fly sequence packing to overlap CPU-heavy packing with GPU execution and reduce idle time.
  • Model: Support for modern architectures (e.g., Qwen3, Gemma3) and Mixture-of-Experts variants (e.g., Qwen3 MoE, GPT-OSS); LoRA integrated into model definitions; and high-level sharding APIs so developers can distribute large models across device meshes without writing low-level distributed code.
  • Compute: A unified job submission interface that scales from a single node to hundreds of GPUs; MFU (Model FLOPS Utilization) monitoring that remains accurate under custom architectures and LoRA; and comprehensive checkpointing (states of trained parameters, optimizer, dataloader, data mixer, etc.) to enable exact resumption after interruptions.
  • Workflow: Support for training paradigms beyond SFT, including complex online RL. In particular, we extend Single Program, Multiple Data (SPMD) style SFT workloads to run online RL with a hybrid single-controller + SPMD execution model, which we’ll describe next.

Today, this framework supports research use cases ranging from post-training large-scale foundation models to fine-tuning specialized expert models. By standardizing these workflows, we’ve lowered the barrier for teams to experiment with advanced techniques and iterate more quickly.

Learnings from Building the Post-Training Framework

Building a system of this scope wasn’t a linear implementation exercise. It meant tracking a fast-moving open-source ecosystem, chasing down failure modes that only appear under distributed load, and repeatedly revisiting architectural decisions as the post-training frontier shifted. Below are three engineering learnings and best practices that shaped the framework.

Scaling from SFT to RL

We initially designed the library around Supervised Fine-Tuning (SFT): relatively static data flow, a single training loop, and a Single Program, Multiple Data (SPMD) execution model. That assumption stopped holding in 2025. With DeepSeek-R1 and the broader adoption of efficient on-policy RL methods like GRPO, SFT became table stakes rather than the finish line. Staying close to the frontier required infrastructure that could move from “offline training loop” to “multi-stage, on-policy orchestration.”

SFT’s learning signal is dense and immediate: for each token position we compute logits over the full vocabulary and backpropagate a differentiable loss. Infrastructure-wise, this looks a lot like pre-training and maps cleanly to SPMD — every GPU worker runs the same step function over a different shard of data, synchronizing through Pytorch distributed primitives.

On-policy RL changes the shape of the system. The learning signal is typically sparse and delayed (e.g., a scalar reward at the end of an episode), and the training step depends on data generated by the current policy. Individual sub-stages — policy updates, rollout generation, reference model inference, reward model scoring — can each be implemented as SPMD workloads, but the end-to-end algorithm needs explicit coordination: you’re constantly handing off artifacts (prompts, sampled trajectories, rewards, advantages) across stages and synchronizing their lifecycle.

In our original SFT architecture, the driver node was intentionally “thin”: it launched N identical Ray actors, each encapsulating the full training loop, and scaling meant launching more identical workers. That model breaks down for RL. RL required us to decompose the system into distinct roles — Policy, Rollout Workers, Reward Model, Reference Model, etc. — and evolve the driver into an active controller that encodes the control plane: when to generate rollouts, how to batch and score them, when to trigger optimization, and how to manage cluster resources across phases.

Figure 4. Architectural differences of SFT and RL framework

Figure 4 highlights this shift. To add RL support without reinventing distributed orchestration from scratch, we integrated the core infrastructure from the open-source Verl library to manage Ray actor lifecycle and GPU resource allocation. Leveraging Verl’s backend let us focus on the “modeling surface area” — our Data/Model/Compute abstractions and internal optimizations — while keeping orchestration concerns decoupled. The result is a hybrid design: a unified user interface where developers can move between SFT and RL workflows without adopting an entirely different mental model or API set.

Hugging Face-Centric Experience

The Hugging Face Hub has effectively become the default distribution channel for open-weight LLMs, tokenizers, and configs. We designed the framework to stay close to that ecosystem rather than creating an isolated internal standard. Even when we use optimized internal model representations for speed, we load and save checkpoints in standard Hugging Face formats. This avoids “walled garden” friction and lets teams pull in new architectures, weights, and tokenizers quickly.

This philosophy also shaped our tokenizer story. Early on, we bound directly to low-level tokenization libraries (e.g., SentencePiece, tiktoken) to maximize control. In practice, that created a costly failure mode: silent training–serving skew. Our inference stack (vLLM) defaults to Hugging Face AutoTokenizer, and tiny differences in normalization, special token handling, or chat templating can yield different token boundaries — exactly the kind of mismatch that shows up later as inexplicable quality regressions. We fixed this by making Hugging Face AutoTokenizer the single source of truth. We then built a thin compatibility layer (BaseHFModelTokenizer) to handle post-training needs — setting padding tokens, injecting generation markers to support loss masking, and managing special tokens / semantic IDs — while ensuring the byte-level tokenization path matches production.

We do take a different approach for model implementations. Rather than training directly on transformers model classes, we maintain our own optimized, unified model definitions that can still load/save Hugging Face checkpoints. This layer is what enables framework-level optimizations — e.g., FlexAttention, memory-efficient chunked cross-entropy, consistent MFU accounting, and uniform LoRA extensibility — without re-implementing them separately for every model family. A unified module naming convention also makes it feasible to programmatically locate and swap components (Attention, MLP, output heads) across architectures, and provides a consistent surface for Tensor Parallelism and FSDP wrapping policies.

The trade-off is clear: supporting a new model family requires building a bridge between the Hugging Face reference implementation and our internal definition. To reduce that overhead, we use AI coding agents to automate much of the conversion work, with a strict logit verifier as the gate: given random inputs, our internal model must match the Hugging Face logits within tolerance. Because the acceptance criterion is mechanically checkable, agents can iterate autonomously until the implementation is correct, dramatically shortening the time-to-support for new architectures.

Today, this design means we can only train architectures we explicitly support — an intentional constraint shared by other high-performance systems like vLLM, SGLang, and torchtitan. To broaden coverage, we plan to add a fallback Hugging Face backend, similar to the compatibility patterns these projects use: users will be able to run training directly on native transformers models for rapid exploration of novel architectures, with the understanding that some framework optimizations and features may not apply in that mode.

Providing Differential Value

A post-training framework is only worth owning if it delivers clear value beyond assembling OSS components. We build on open source for velocity, but we invest heavily where off-the-shelf tools tend to be weakest: performance tuned to our workload characteristics, and integration with Netflix-specific model and business requirements. Here are some concrete examples:

First, we optimize training efficiency for our real use cases. A representative example is extreme variance in sequence length. In FSDP-style training, long-tail sequences create stragglers: faster workers end up waiting at synchronization points for the slowest batch, lowering utilization. Standard bin-packing approaches help, but doing them offline at our data scale can add substantial preprocessing latency and make it harder to keep datasets fresh. Instead, we built on-the-fly sequence packing that streams samples from storage and dynamically packs them in memory. Packing runs asynchronously, overlapping CPU work with GPU compute. Figure 5 shows the impact: for our most skewed dataset, on-the-fly packing improved the effective token throughput by up to 4.7x.

Figure 5. Training throughput on two of our internal datasets on A100 and H200 GPUs

We also encountered subtler performance cliffs around vocabulary expansion. Our workloads frequently add custom tokens and semantic IDs. We found that certain vocabulary sizes could cause the language model head to fall back from a highly optimized cuBLAS kernel to a much slower CUTLASS path, tripling that layer’s execution time. The framework now automatically pads vocabulary sizes to multiples of 64 so the compiler selects the fast kernel, preserving throughput without requiring developers to know these low-level constraints.

Second, owning the framework lets us support “non-standard” transformer use cases that generic LLM tooling rarely targets. For example, some internal models are trained on member interaction event sequences rather than natural language, and may require bespoke RL loops that integrate with highly-customized inference engines and optimize business-defined metrics. These workflows demand custom environments, reward computation, and orchestration patterns — while still needing the same underlying guarantees around performance, tracking, and fault tolerance. The framework is built to accommodate these specialized requirements without fragmenting into one-off pipelines, enabling rapid iteration.

Wrap up

Building the Netflix Post-Training Framework has been a continual exercise in balancing standardization with specialization. By staying anchored to the open-source ecosystem, we’ve avoided drifting into a proprietary stack that diverges from where the community is moving. At the same time, by owning the core abstractions around Data, Model, Compute, and Workflow, we’ve preserved the freedom to optimize for Netflix-scale training and Netflix-specific requirements.

In the process, we’ve moved post-training from a loose collection of scripts into a managed, scalable system. Whether the goal is maximizing SFT throughput, orchestrating multi-stage on-policy RL, or training transformers over member interaction sequences, the framework provides a consistent set of primitives to do so reliably and efficiently. As the field shifts toward more agentic, reasoning-heavy, and multimodal architectures, this foundation will help us translate new ideas into scalable GenAI prototypes — so experimentation is constrained by our imagination, not by operational complexity.

Acknowledgements

This work builds on the momentum of the broader open-source ML community. We’re especially grateful to the teams and contributors behind Torchtune, Torchtitan, and Verl, whose reference implementations and design patterns informed many of our training framework choices — particularly around scalable training recipes, distributed execution, and RL-oriented orchestration. We also thank our partner teams in Netflix AI for Member Systems for close collaboration, feedback, and shared problem-solving throughout the development and rollout of the Post-Training Framework, and the Training Platform team for providing the robust infrastructure and operational foundation that makes large-scale post-training possible.


Scaling LLM Post-Training at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[Automating RDS Postgres to Aurora Postgres Migration]]> https://netflixtechblog.com/automating-rds-postgres-to-aurora-postgres-migration-261ca045447f?source=rss----2615bd06b42e---4 https://medium.com/p/261ca045447f Thu, 12 Feb 2026 14:07:19 GMT 2026-02-12T18:21:50.401Z Ram Srivasta Kannan, Wale Akintayo, Jay Bharadwaj, John Crimmins, Shengwei Wang, Zhitao Zhu

Introduction

In 2024, the Online Data Stores team at Netflix conducted a comprehensive review of the relational database technologies used across the company. This evaluation examined functionality, performance, and total cost of ownership across our database ecosystem. Based on this analysis, we decided to standardize on Amazon Aurora PostgreSQL as the primary relational database offering for Netflix teams.

Several key factors influenced this decision:

  • PostgreSQL already underpinned the majority of our relational workloads, which made it a natural foundation for standardization. Internal evaluations revealed that Aurora PostgreSQL had supported over 95% of the applications and workloads running on other relational databases across our internal services.
  • Industry momentum had continued to shift toward PostgreSQL, driven by its open ecosystem, strong community support, and broad adoption across modern data platforms.
  • Aurora’s cloud-native, distributed architecture provided clear advantages in scalability, high availability, and elasticity compared to traditional single-node PostgreSQL deployments.
  • Aurora PostgreSQL offered a rich feature set, along with a strong, forward-looking roadmap aligned with the needs of large-scale, globally distributed applications.

A Clear Migration Path Forward

As part of this strategic shift, one of our key initiatives for 2024/2025 was migrating existing users to Aurora PostgreSQL. This effort began with RDS PostgreSQL migrations and will expand to include migrations from other relational systems in subsequent phases.

As a data platform organization, our goal is to make this evolution predictable, well-supported, and minimally disruptive. This allows teams to adopt Aurora PostgreSQL at a pace that aligns with their product and operational roadmaps, while we move toward a unified and scalable relational data platform across the organization.

Database Migration: More Than a Simple Transfer

Migrating a database involves far more than copying rows from one system to another. It is a coordinated process of transitioning both data and database functionality while preserving correctness, availability, and performance. At scale, a well-designed migration must minimize disruption to applications and ensure a clean, deterministic handoff from the old system to the new one.

Most database migrations follow a common set of high-level steps:

  1. Data Replication: Data is first copied from the source database to the destination, typically using replication, so that ongoing changes are continuously captured and applied.
  2. Quiescence: Write traffic to the source database is halted, allowing the destination to fully catch up and eliminate any remaining divergence.
  3. Validation: The system verifies that the source and destination databases are fully synchronized and contain identical data.
  4. Cutover: Client applications are reconfigured to point to the destination database, which becomes the new primary source of truth.

Challenges

Operational Challenges

Migrating to a new relational database at Netflix scale presents substantial operational challenges. With a fleet approaching 400 PostgreSQL clusters, manually migrating each one is simply not scalable for the data platform team. Such an approach would require a significant amount of time, introduce the risk of human error, and necessitate considerable hands-on engineering effort. Compounding the problem, coordinating downtime across the many interconnected services that depend on each database is extremely cumbersome at this scale.

To address these challenges, we designed a self-service migration workflow that enables service owners to run their own RDS PostgreSQL to Aurora PostgreSQL migrations. The workflow automatically handles orchestration, safety checks, and correctness guarantees end-to-end, resulting in lower operational overhead and a predictable, reliable migration experience.

Technical challenges

  • Zero data loss — We must guarantee that all data from the source cluster is fully and safely migrated to the destination within a very tight window, with no possibility of data loss.
  • Minimal downtime — Some downtime is unavoidable during migration, as applications must briefly pause write traffic while cutting over to Aurora PostgreSQL. For higher-tier services that power critical parts of the Netflix ecosystem, this window must be kept extremely short to prevent user-facing impact and maintain service reliability.
  • No control over client applications — As the platform team, we manage the databases, but application teams handle the read and write operations. We cannot assume that they have the ability to pause writes on demand, nor do we want to expose such controls to them, as mistakes could lead to data inconsistencies post migration. Therefore, building a self-service migration pipeline requires creative control-plane solutions to halt traffic, ensuring that no writes occur during the validation and cutover phases.
  • No direct access to RDS credentials — The migration automation must perform replication, quiescence, and validation without requesting database credentials from users or relying on manual authentication. Source databases are often tightly secured, allowing access only from client applications, but more importantly, requiring credential access — even if it were possible — would significantly increase operational overhead and risk. At the same time, the migration platform may operate in environments without direct access to the source database, making traditional verification or parity checks impossible.
  • No Degradation in Performance — The migration process must not impact the performance or stability of production databases once they are running in the Aurora PostgreSQL ecosystem.
  • Full Ecosystem Parity — Beyond migrating the core database, associated components such as parameter groups, read replicas, and replication slots must also be migrated to ensure functional equivalence.

Minimal User Effort — Since we rely on teams who are not database experts to perform migrations, the process must be simple, intuitive, and fully self-guided.

AWS recommended migration techniques

Using a snapshot

One of the simplest AWS-recommended approaches for migrating from RDS PostgreSQL to Aurora PostgreSQL is based on snapshots. In this model, write traffic to the source PostgreSQL database is first stopped. A manual snapshot of the RDS PostgreSQL instance is then taken and migrated to Aurora, where AWS converts it into an Aurora-compatible format.

Once the conversion completes, a new Aurora PostgreSQL cluster is created from the snapshot. After the cluster is brought online and validated, application traffic is redirected to the Aurora endpoint, completing the migration.

Reference

Using an Aurora read replica

In the read-replica–based approach, an Aurora PostgreSQL read replica is created from an existing RDS PostgreSQL instance. AWS establishes continuous, asynchronous replication from the RDS source to the Aurora replica, allowing ongoing changes to be streamed in near real time.

Because replication runs continuously, the Aurora replica remains closely synchronized with the source database. This enables teams to provision and validate the Aurora environment — including configuration, connectivity, and performance characteristics — while production traffic continues to flow to the source.

When the replication lag is sufficiently low, write traffic is briefly paused to allow the replica to fully catch up. The Aurora read replica is then promoted to a standalone Aurora PostgreSQL cluster, and application traffic is redirected to the new Aurora endpoint. This approach significantly reduces downtime compared to snapshot-based migrations and is well-suited for production systems that require minimal disruption.

Migration Strategy Trade-Offs

These differences represent the key considerations when choosing a migration strategy from RDS PostgreSQL to Aurora PostgreSQL. For our automation, we opted for the Aurora Read Replica approach, trading increased implementation complexity for a significantly shorter downtime window for client applications.

Netflix RDS PostgreSQL Deployment Architecture

In Netflix’s RDS setup, a Data Access Layer (DAL) sits between applications and backend databases, acting as middleware that centralizes database connectivity, security, and traffic routing on behalf of client applications.

On the client side, applications connect through a forward proxy that manages mutual TLS (mTLS) authentication and establishes a secure tunnel to the Data Gateway service. The Data Gateway, acting as a reverse proxy for database servers, terminates client connections, enforces centralized authentication and authorization, and forwards traffic to the appropriate RDS PostgreSQL instance.

This layered design ensures that applications never handle raw database credentials, provides a consistent and secure access pattern across all datastore types, and delivers isolated, transparent connectivity to managed PostgreSQL clusters. While the primary goal of this architecture is to enforce strong security controls and standardize how applications access external AWS data stores, it also allows backend databases to be switched transparently via configuration, enabling controlled, low-downtime migrations.

Migration Process

The Platform team’s goal is to deliver a fully automated, self-service workflow that helps with the migration of customer RDS PostgreSQL instances to Aurora PostgreSQL clusters. This migration tool orchestrates the entire process — from preparing the source environment, initializing the Aurora read replica, and maintaining continuous synchronization, all the way through to cutover — without requiring any database credentials or manual intervention from the customer.

Designed for minimal downtime and seamless user experience, the workflow ensures full ecosystem parity between RDS and Aurora, preserving performance characteristics and operational behavior while enabling customers to benefit from Aurora’s improved scalability, resilience, and cost efficiency.

Data Replication Phase

Enable Automated Backups

Automated backups must be enabled on the source database because the Aurora read replica is initialized from a consistent snapshot of the source and then kept in sync through continuous replication. Automated backups provide the stable snapshot required to bootstrap the replica, along with the continuous streaming of write-ahead log (WAL) records needed to keep the read replica closely synchronized with the source.

Port RDS parameters to an Aurora parameter group

We create a dedicated Aurora parameter group for each cluster and migrate all RDS-compatible parameters from the source RDS instance. This ensures that the Aurora cluster inherits the same configuration settings — such as memory configuration, connection limits, query planner behavior, and other PostgreSQL engine parameters that have equivalents in Aurora. Parameters that are unsupported or behave differently in Aurora are either omitted or adjusted according to Aurora best practices.

Create an Aurora read replica cluster and instance

Creating an Aurora read replica cluster is a critical step in migrating from RDS PostgreSQL to Aurora PostgreSQL. At this stage, the Aurora cluster is created and attached to the RDS PostgreSQL primary as a replica, establishing continuous replication from the source RDS PostgreSQL instance. These Aurora read replicas stay nearly in sync with ongoing changes by streaming write-ahead logs (WAL) from the source, enabling minimal downtime during cutover. The cluster is fully operational for validation and performance testing, but it is not yet writable — RDS remains the authoritative primary.

Quiescence Phase

The goal of the quiescence phase is to transition client applications from the source RDS PostgreSQL instance to the Aurora PostgreSQL cluster as the new primary database, while preserving data consistency during cutover.

The first step in this process is to stop all write traffic to the source RDS PostgreSQL instance to guarantee consistency. To achieve this, we instruct users to halt application-level traffic, which helps prevent issues such as retry storms, queue backlogs, or unnecessary resource consumption when connectivity changes during cutover. This coordination also gives teams time to prepare operationally, for example, by suppressing alerts, notifying downstream consumers, or communicating planned maintenance to their customers.

However, relying solely on application-side controls is unreliable. Operational gaps, misconfigurations, or lingering connections can still modify the source database state, potentially resulting in changes that are not replicated to the destination and leading to data inconsistency or loss. To enforce a clean and deterministic cutover, we also block traffic at the infrastructure layer. This is done by detaching the RDS instance’s security groups to prevent new inbound connections, followed by a reboot of the instance. With security groups removed, no new SQL sessions can be established, and the reboot forcibly terminates any existing connections.

This approach intentionally avoids requiring database credentials or logging into the PostgreSQL server to manually terminate connections. While it may be slower than application- or database-level intervention, it provides a reliably automated and repeatable mechanism to fully quiesce the source RDS PostgreSQL instance before Aurora promotion, eliminating the risk of divergent writes or an inconsistent WAL state.

Validation Phase

To determine whether the Aurora read replica has fully caught up with the source RDS PostgreSQL instance, we track replication progress using Aurora’s OldestReplicationSlotLag metric. This metric represents how far the Aurora replica is behind the source in applying write-ahead log (WAL) records.

Once client traffic is halted during quiescence, the source RDS PostgreSQL instance stops producing meaningful WAL entries. At that point, the replication lag should converge to zero, indicating that all WAL records corresponding to real writes have been fully replayed on Aurora.

However, in practice, our experiments show that the metric never settles at a steady zero. Instead, it briefly drops to 0, then quickly returns to 64 MB, repeating this pattern every few minutes as shown in the figure below.

OldestReplicationSlotLag

This behavior stems from how OldestReplicationSlotLag is calculated. Internally, the lag is derived using the following query:

SELECT
slot_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS slot_lag_bytes
FROM pg_replication_slots;

Conceptually, this translates to:

OldestReplicationSlotLag = current_WAL_position_on_RDS 
– restart_lsn

See AWS references here and here.

The restart_lsn represents the oldest write-ahead log (WAL) record that PostgreSQL must retain to ensure a replication consumer can safely resume replication.

When PostgreSQL performs a WAL segment switch, Aurora typically catches up almost immediately. At that moment, the restart_lsn briefly matches the source’s current WAL position, causing the reported lag to drop to 0. During idle periods, PostgreSQL performs an empty WAL segment rotation approximately every five minutes, driven by the archive_timeout = 300s setting in the database parameter group.

Immediately afterward, PostgreSQL begins writing to the new WAL segment. Since this new segment has not yet been fully flushed or consumed by Aurora, the WAL position in source RDS PostgreSQL advances ahead of the restart_lsn of Aurora PostgreSQL by exactly one segment. As a result, OldestReplicationSlotLag jumps to 64 MB, which corresponds to the configured WAL segment size at database initialization, and remains there until the next segment switch occurs.

Because idle PostgreSQL performs an empty WAL rotation approximately every five minutes, this zero-then-64 MB oscillation is expected. Importantly, the moment when the lag drops to 0 indicates that all meaningful WAL records have been fully replicated, and the Aurora read replica is fully caught up with the source.

Cutover Phase

Once the Aurora read replica has fully caught up with the source RDS PostgreSQL instance — as confirmed through replication lag analysis — the final step is to promote the replica and redirect application traffic. Promoting the Aurora read replica converts it into an independent, writable Aurora PostgreSQL cluster with its own writer and reader endpoints. At this point, the source RDS PostgreSQL instance is no longer the authoritative primary and is made inaccessible.

Because Netflix’s RDS ecosystem is fronted by a Data Access Layer (DAL), consisting of client-side forward proxies and a centralized Data Gateway, switching databases does not require application code changes or database credential access. Instead, traffic redirection is handled entirely through configuration updates in the reverse-proxy layer. Specifically, we update the runtime configuration of the Envoy-based Data Gateway to route traffic to the newly promoted Aurora cluster. Once this configuration change propagates, all client-initiated database connections are transparently routed through the DAL to the Aurora writer endpoint, completing the migration without requiring any application changes.

This proxy-level cutover, combined with Aurora promotion, enables a seamless transition for service owners, minimizes downtime, and preserves data consistency throughout the migration process.

Customer Experience: Migrating a Business-Critical Partner Platform

One of the critical teams to adopt the RDS PostgreSQL to Aurora PostgreSQL migration workflow was the Enablement Applications team. This team owns a set of databases that model Netflix’s entire ecosystem of partner integrations, including device manufacturers, discovery platforms, and distribution partners. These databases power a suite of enterprise applications that partners worldwide rely on to build, test, certify, and launch Netflix experiences on their devices and services.

Because these databases sit at the center of Netflix’s partner enablement and certification workflows, they are consumed by a diverse set of client applications across both internal and external organizations. Internally, reliability teams use this data to identify streaming failures for specific devices and configurations, supporting quality improvements across the device ecosystem. At the same time, these databases directly serve external partners operating across many regions. Device manufacturers rely on them to configure, test, and certify new hardware, while payment partners use them to set up and launch bundled offerings with Netflix.

Simplified Enablement Applications Overview

Device Lifecycle Management

Netflix works with a wide range of device partners to ensure Netflix streams seamlessly across a diverse ecosystem of consumer devices. A core responsibility of Device Lifecycle Management is to provide tools and workflows that allow partners to develop, test, and certify Netflix integrations on their devices.

As part of the device lifecycle, partners run Netflix-provided test suites against their NRDP implementation. We store signals that represent the current stage for each device in the certification process. This certification data forms the backbone of Netflix’s device enablement program, ensuring that only validated devices can launch Netflix experiences.

Partner Billed Integrations

In addition to device enablement, the same partner metadata is also consumed by Netflix’s Partner Billed Integrations organization. This group enables external partners to offer Netflix as part of bundled subscription and billing experiences.

Any disruption in these databases affects partner integration workflows. If the database is unavailable, partners may be unable to configure or launch service bundles with Netflix. Maintaining high availability and data correctness is essential to preserving smooth integration operations.

The global nature of these workflows makes it difficult to schedule downtime windows. Any disruption would impact partner productivity and risk eroding trust in Netflix’s integration and certification processes.

Preparation

Given the criticality of the Enablement Applications databases, thorough preparation was essential before initiating the migration. The team invested significant effort upfront to understand traffic patterns, identify all consumers, and establish clear communication channels.

Understand Client Fan-Out and Traffic Patterns
The first step was to gain a complete view of how the databases were being used in production. Using observability tools like CloudWatch metrics, the team analyzed PostgreSQL connection counts, read and write patterns, and overall load characteristics. This helped establish a baseline for normal behavior and ensured there were no unexpected traffic spikes or hidden dependencies that could complicate the migration.

Just as importantly, this baseline gave the Enablement Applications team a rough idea of the post-migration behavior on Aurora. For example, they expected to see a similar number of active database connections and comparable traffic patterns after cutover, making it easier to validate that the migration had preserved operational characteristics.

Identify and Enumerate All Database Consumers
Unlike most databases, where the set of consumers is well known to the owning team, these databases were accessed by a wide range of internal services and external-facing systems that were not fully enumerated upfront. To address this, we leveraged a tool called flowlogs, an eBPF-based network attribution tooling was used to capture TCP flow data to identify the services and applications establishing connections to the database(link).

This approach allowed the team to enumerate active consumers, including those that were not previously documented, ensuring no clients were missed during migration planning.

Establish Dedicated Communication Channels
Once all consumers were identified, a dedicated communication channel was created to provide continuous updates throughout the migration process. This channel was used to share timelines, readiness checks, status updates, and cutover notifications, ensuring that all stakeholders remained aligned and could respond quickly if issues arose.

Migration Process

After completing application-side preparation, the Enablement Applications team initiated the data replication phase of the migration workflow. The automation successfully provisioned the Aurora read replica cluster and ported the RDS PostgreSQL parameter group to a corresponding Aurora parameter group, bringing the destination environment up with equivalent configuration.

Unexpected Replication Slot Behavior

However, shortly after replication began, we observed that the OldestReplicationSlotLag metric was unexpectedly high. This was counterintuitive, as Aurora read replicas are designed to remain closely synchronized with the source database by continuously streaming write-ahead logs (WAL).

Further investigation revealed the presence of an inactive logical replication slot on the source RDS PostgreSQL instance. An inactive replication slot can cause elevated OldestReplicationSlotLag because PostgreSQL must retain all WAL records required by the slot’s last known position (restart_lsn), even if no client is actively consuming data from it. Replication slots are intentionally designed to prevent data loss by ensuring that a consumer can resume replication from where it left off. As a result, PostgreSQL will not recycle or delete WAL segments needed by a replication slot until the slot advances. When a slot becomes inactive — such as when a client migration task is stopped or abandoned — the slot’s position no longer moves forward. Meanwhile, the database continues to generate WAL, forcing PostgreSQL to retain increasingly older WAL files. This growing gap between the current WAL position and the slot’s restart_lsn manifests as a high OldestReplicationSlotLag.

Identifying and addressing these inactive replication slots was a critical prerequisite to proceeding safely with the migration and ensuring accurate replication state during cutover.

Successful Migration After Remediation
After identifying the inactive logical replication slot, the team safely cleaned it up on the source RDS PostgreSQL instance and resumed the migration workflow. With the stale slot removed, replication progressed as expected, and the Aurora read replica quickly converged with the source. The migration then proceeded smoothly through the quiescence phase, with no unexpected behavior or replication anomalies observed.

Following promotion, application traffic transitioned seamlessly to the newly writable Aurora PostgreSQL cluster. Through the Data Access Layer, new client connections were automatically routed to Aurora, and observability metrics confirmed healthy behavior — connection counts, read/write patterns, and overall load closely matched pre-migration baselines. From the application and partner perspective, the cutover was transparent, validating both the correctness of the migration workflow and the effectiveness of the preparation steps.

Open questions

How do we select target Aurora PostgreSQL instance types based on the existing production RDS PostgreSQL instance?

When selecting the target Aurora PostgreSQL instance type for a production migration, our guidance is intentionally conservative. We prioritize stability and performance first, and optimize for cost only after observing real workload behavior on Aurora.

In practice, the recommended approach is to adopt Graviton2-based instances (particularly the r6g family) whenever possible, maintain the same instance family and size where feasible, and — at minimum — preserve the memory footprint of the existing RDS instance.

Unlike RDS PostgreSQL, Aurora does not support the m-series, making a direct family match impossible for those instances. In such cases, simply keeping the same “size” (e.g., 2xlarge → 2xlarge) is not meaningful because the memory profiles differ across families. Instead, we map instances by memory equivalence. For example, an Aurora r6g.xlarge provides a memory footprint comparable to an RDS m5.2xlarge, making it a practical replacement. This memory-aligned strategy offers a safer and more predictable baseline for production migrations.

Downtime During RDS → Aurora Cutover?

To achieve minimal downtime during an RDS PostgreSQL → Aurora PostgreSQL migration, we front-load as much work as possible into the preparation phase. By the time we reach cutover, the Aurora read replica is already provisioned and continuously replicating WAL from the source RDS instance. Before initiating downtime, we ensure that the replication lag between Aurora and RDS has stabilized within an acceptable threshold. If the lag is large or fluctuating significantly, forcing a cutover will only inflate downtime.

Downtime begins the moment we remove the security groups from the source RDS instance, blocking all inbound traffic. We then reboot the instance to forcibly terminate existing connections, which typically takes up to a minute. From this point forward, no writes can be performed.

After traffic is halted, the next objective is to verify that Aurora has fully replayed all meaningful WAL records from RDS. We track this using OldestReplicationSlotLag. We first wait for the metric to drop to 0, indicating that Aurora has consumed all WAL with real writes. Under normal idle behavior, PostgreSQL triggers an empty WAL switch every five minutes. After observing one data point at 0, we wait for an additional idle WAL rotation and confirm that the lag oscillates within the expected 0 → 64 MB pattern — signifying that the only remaining WAL segments are empty ones produced during idle time. At this point, we know the Aurora replica is fully caught up and can be safely promoted.

While these validation steps run, we perform the configuration updates on the Envoy reverse proxy in parallel. Once promotion completes and Envoy is restarted with the new runtime configuration, all client-initiated connections begin routing to the Aurora cluster. In practice, the total write-downtime observed across services averages around 10 minutes, dominated largely by the RDS reboot and the idle WAL switch interval.

Optimization: Reducing Idle-Time Wait

For services requiring stricter downtime budgets, waiting the full five minutes for an idle WAL switch can be prohibitively expensive. In such cases, we can force a WAL rotation immediately after traffic is cut off by issuing:

SELECT pg_switch_wal();

Once the switch occurs, OldestReplicationSlotLag will drop to 0 again as Aurora consumes the new (empty) WAL segment. This approach eliminates the need to wait for the default archive_timeout interval, which can significantly reduce overall downtime.

How do we migrate CDC consumers?

As part of the data platform organization in Netflix, we provide a managed Change Data Capture (CDC) service across a variety of datastores. For PostgreSQL, logical replication slots is the way of implementing change data capture. At Netflix, we build a managed abstraction on top of these replication slots called datamesh to manage customers who are leveraging them (link).

Each logical replication slot tracks a consumer’s position in the write-ahead log (WAL), ensuring that WAL records are retained until the consumer has successfully processed them. This guarantees ordered and reliable delivery of row-level changes to downstream systems. At the same time, it tightly couples the lifecycle of replication slots to database operations, making their management a critical consideration during database migrations.

A key challenge in migrating from RDS PostgreSQL to Aurora PostgreSQL is transitioning these CDC consumers safely — without data loss, stalled replication, or extended downtime — while ensuring that replication slots are correctly managed throughout the cutover process.

Each row-level change in PostgreSQL is emitted as a CDC event with an operation type of INSERT, UPDATE, DELETE, or REFRESH. REFRESH events are generated during backfills by querying the database directly and emitting the current state of rows in chunks. Downstream consumers are designed to be idempotent and eventually consistent, allowing them to safely process retries, replays, and backfills.

Handling Replication Slots During Migration

Before initiating database cutover, we temporarily pause CDC consumption by stopping the infrastructure responsible for consuming from PostgreSQL replication slots and writing into datamesh source. This also drops the replication slot from the database and cleans up our internal state around replication slot offsets. This essentially resets the state of the connector to one of a brand new one.

This step is critical for two reasons. First, it prevents replication slots from blocking WAL recycling during migration. Second, it ensures that no CDC consumers are left pointing at the source database once traffic is quiesced and cutover begins. While CDC consumers are paused, downstream systems temporarily stop receiving new change events, but remain stable. Once CDC consumers are paused, we proceed with stopping other client traffic and executing the RDS-to-Aurora cutover.

Reinitializing CDC After Cutover

After the Aurora PostgreSQL cluster has been promoted and traffic has been redirected, CDC consumers are reconfigured to point to the Aurora endpoint and restarted. Because their previous state was intentionally cleared, consumers initialize as if they are starting fresh.

On startup, new logical replication slots are created on Aurora, and a full backfill is performed by querying the database and emitting REFRESH events for all existing rows. These events let the consumer know that a manual refresh was done from Aurora and to treat this as an upsert operation. This establishes a clean and consistent baseline from which ongoing CDC can resume. Consumers are expected to handle these refresh events correctly as part of normal operation.

By explicitly managing PostgreSQL replication slots as part of the migration workflow, we are able to migrate CDC consumers safely and predictably, without leaving behind stalled slots, retained WAL, or consumers pointing to the wrong database. This approach allows CDC pipelines to be cleanly re-established on Aurora while preserving correctness and operational simplicity.

How do we roll back in the middle of the process?

Pre-quiescence
Rolling back before the pre-quienscence phase is quite easy. Your primary RDS database is still the source. Rolling back before the quiescence phase is straightforward. At this stage, the primary RDS PostgreSQL instance continues to serve as the sole source of truth, and no client traffic has been redirected.

If a rollback is required, the migration can be safely aborted by deleting the newly created Aurora PostgreSQL cluster along with its associated parameter groups. No changes are needed on the application side, and normal operations on RDS PostgreSQL can continue without impact.

During-quiescence
Rolling back during the quiescence phase is more involved. At this point, client traffic to the source RDS PostgreSQL instance has already been stopped by detaching its security groups. To roll back safely, access must first be restored by reattaching the original security groups to the RDS instance, allowing client connections to resume. In addition, any logical replication slots removed during the migration must be recreated so that CDC consumers can continue processing changes from the source database.

Once connectivity and replication slots are restored, the RDS PostgreSQL instance can safely resume its role as the primary source of truth.

Post-quiescence
Rolling back after cutover, once the Aurora PostgreSQL cluster is serving production traffic, is significantly more complex. At this stage, Aurora has become the primary source of truth, and client applications may already have written new data to it.

In this scenario, rollback requires setting up replication in the opposite direction, with Aurora as the source and RDS PostgreSQL as the destination. This can be achieved using a service such as AWS Database Migration Service (DMS). AWS provides detailed guidance for setting up this reverse replication flow, which can be followed to migrate data back to RDS if necessary.

Conclusion

Standardizing and reducing the surface area of data technologies is crucial for any large-scale platform. For the Netflix platform team, this strategy allows us to concentrate engineering effort, deliver deeper value on a smaller set of well-understood systems, and significantly cut the operational overhead of running multiple database technologies that serve similar purposes. Within the relational database ecosystem, Aurora PostgreSQL has become the paved-path datastore — offering strong scalability, resilience, and consistent operational patterns across the fleet.

Migrations of this scale demand solutions that are reliable, low-touch, and minimally disruptive for service owners. Our automated RDS PostgreSQL → Aurora PostgreSQL workflow represents a major step forward, providing predictable cutovers, strong correctness guarantees, and a migration experience that works uniformly across diverse workloads.

As we continue this journey, the Relational Data Platform team is building higher-level abstractions and capabilities on top of Aurora, enabling service owners to focus less on the complexities of database internals and more on delivering product value. More to come — stay tuned.

Acknowledgements

Special thanks to our other stunning colleagues/customers who contributed to the success of the RDS PostgreSQL to Aurora PostgreSQL migration. Sumanth Pasupuleti, Cole Perez, Ammar Khaku


Automating RDS Postgres to Aurora Postgres Migration was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[The AI Evolution of Graph Search at Netflix]]> https://netflixtechblog.com/the-ai-evolution-of-graph-search-at-netflix-d416ec5b1151?source=rss----2615bd06b42e---4 https://medium.com/p/d416ec5b1151 Mon, 26 Jan 2026 19:01:27 GMT 2026-01-27T18:50:19.610Z The AI Evolution of Graph Search at Netflix: From Structured Queries to Natural Language

By Alex Hutter and Bartosz Balukiewicz

Our previous blog posts (part 1, part 2, part 3) detailed how Netflix’s Graph Search platform addresses the challenges of searching across federated data sets within Netflix’s enterprise ecosystem. Although highly scalable and easy to configure, it still relies on a structured query language for input. Natural language based search has been possible for some time, but the level of effort required was high. The emergence of readily-available AI, specifically Large Language Models (LLMs), has created new opportunities to integrate AI search features, with a smaller investment and improved accuracy.

While Text-to-Query and Text-to-SQL are established problems, the complexity of distributed Graph Search data in the GraphQL ecosystem necessitates innovative solutions. This is the first in a three-part series where we will detail our journey: how we implemented these solutions, evaluated their performance, and ultimately evolved them into a self-managed platform.

The Need for Intuitive Search: Addressing Business and Product Demands

Natural language search is the ability to use everyday language to retrieve information as opposed to complex, structured query languages like the Graph Search Filter Domain Specific Language (DSL). When users interact with 100’s of various UIs within the suite of Content and Business Products applications, a frequent task is filtering a data table like the one below:

Example Content and Business Products application view

Ideally, a user simply wants to satisfy a query like “I want to see all movies from the 90s about robots from the US.” Because the underlying platform operates on the Graph Search Filter DSL, the application acts as an intermediary. Users input their requirements through UI elements — toggling facets or using query builders — and the system programmatically converts these interactions into a valid DSL query to filter the data.

The Complexity of filtering and DSL generation

This process presents a few issues.

Today, many applications have bespoke components for collecting user input — the experience varies across them and they have inconsistent support for the DSL. Users need to “learn” how to use each application to achieve their goals.

Additionally, some domains have hundreds of fields in an index that could be faceted or filtered by. A subject matter expert (SME) may know exactly what they want to accomplish, but be bottlenecked by the inefficient pace of filling out a large scale UI form and translating their questions in order to encode it in a representation Graph Search needs.

Most importantly, users think and operate using natural language, not technical constructs like query builders, components, or DSLs. By requiring them to switch contexts, we introduce friction that slows them down or even prevents their progress.

With readily-available AI components, our users can now interact with our systems through natural language. The challenge now is to make sure our offering, searching Netflix’s complex enterprise state with natural language, is an intuitive and trustworthy experience.

Natural language queries translated into Graph Search Filter DSL

We’ve made a decision to pursue generating Graph Search Filter statements from natural language to meet this need. Our intention is to augment and not replace existing applications with retrieval augmented generation (RAG), providing tooling and capabilities so that applications in our ecosystem have newly accessible means of processing and presenting their data in their distinct domain flavours. It should be noted that all the work here has direct application to building a RAG system on top of Graph Search in the future.

Under the Hood: Our Approach to Text-to-Query

The core function of the text-to-query process is converting a user’s (often ambiguous) natural language question into a structured query. We primarily achieve this through the use of an LLM.

Before we dive deeper, let’s quickly revisit the structure of Graph Search Filter DSL. Each Graph Search index is defined by a GraphQL query, made up of a collection of fields. Each field has a type e.g. boolean, string, and some have their permitted values governed by controlled vocabularies — a standardized and governed list of values (like an enumeration, or a foreign key). The names of those fields can be used to construct expressions using comparison (e.g. > or ==) or inclusion/exclusion operators (e.g. IN). In turn those expressions can be combined using logical operators (e.g. AND) to construct complex statements.

Graph Search Filter DSL

With that understanding, we can now more rigorously define the conversion process. We need the LLM to generate a Graph Search Filter DSL statement that is syntactically, semantically, and pragmatically correct.

Syntactic correctness is easy — does it parse? To be syntactically correct, the generated statement must be well formed i.e. follow the grammar of the Graph Search Filter DSL.

Semantic correctness adds some additional complexity as it requires more knowledge of the index itself. To be semantically correct:

  • it must respect the field types i.e. only use comparisons that make sense given the underlying type;
  • it must only use fields that are actually present in the index, i.e. does not hallucinate;
  • when the values of a field are constrained to a controlled vocabulary, any comparison must only use values from that controlled vocabulary.

Pragmatic correctness is much more difficult. It asks the question: does the generated filter actually capture the intent of the user’s query?

The following sections will detail how we pre-process the user’s question to create appropriate context for the instructions that we will provide to the LLM — both of which are fundamental to LLM interaction — as well as post-processing we perform on the generated statement to validate it, and help users understand and trust the results they receive.

At a high level that process looks like this:

Graph Search FIlter DSL generation process

Context Engineering

Preparation for the filter generation task is predominantly engineering the appropriate context. The LLM will need access to the fields of an index and their metadata in order to construct semantically correct filters. As the indices are defined by GraphQL queries, we can use the type information from the GraphQL schema to derive much of the required information. For some fields, there is additional information we can provide beyond what’s available in the schema as well, in particular permissible values that pull from controlled vocabularies.

Each field in the index is associated with metadata as seen below, and that metadata is provided as part of the context.

Graph Search index representation
  • The field is derived from the document path as characterized by the GraphQL query.
  • The description is the comment from the GraphQL schema for the field.
  • The type is derived from the GraphQL schema for the field e.g. Boolean, String, enum. We also support an additional controlled vocabulary type we will discuss more of shortly.
  • The valid values are derived from enum values for the enum type or from a controlled vocabulary as we will now discuss.

A controlled vocabulary is a specific field type that consists of a finite set of allowed values, which are defined by a SMEs or domain owners. Index fields can be associated with a particular controlled vocabulary, e.g. countries with members such as Spain and Thailand, and any usage of that field within a generated statement must refer to values from that vocabulary.

Naively providing all the metadata as context to the LLM worked for simple cases but did not scale. Some indices have hundreds of fields and some controlled vocabularies have thousands of valid values. Providing all of those, especially the controlled vocabulary values and their accompanying metadata, expands the context; this proportionally increases latency and decreases the correctness of generated filter statements. Not providing the values wasn’t an option as we needed to ground the LLMs generated statements- without them, the LLM would frequently hallucinate values that did not exist.

Curating the context to an appropriate subset was a problem we addressed using the well known RAG pattern.

Field RAG

As mentioned previously, some indices have hundreds of fields, however, most user’s questions typically refer only to a handful of them. If there was no cost in including them all, we would, but as mentioned prior, there is a cost in terms of the latency of query generation as well as the correctness of the generated query (e.g. needle-in-the-hackstack problem) and non-deterministic results.

To determine which subset of fields to include in the context, we “match” them against the intent of the user’s question.

  • Embeddings are created for index fields and their metadata (name, description, type) and are indexed in a vector store
  • At filter generation time, the user’s question is chunked with an overlapping strategy. For each chunk, we perform a vector search to identify the top K most relevant values and the fields to which they belong.
  • Deduplication: The top K fields from each chunk are both consolidated and deduplicated before being provided as context to the system instructions.
Field RAG process (chunking, merge, deduplicate)

Controlled Vocabularies RAG

Index fields of the controlled vocabulary type are associated with a particular controlled vocabulary, again, countries are one example. Given a user’s question, we can infer whether or not it refers to values of a particular controlled vocabulary. In turn, by knowing which controlled vocabulary values are present, we can identify additional, related index fields that should be included in the context that may not have been identified by the field RAG step.

Each controlled vocabulary value has:

  • a unique identifier within its type;
  • a human readable display name;
  • a description of the value;
  • also-known-as values or AKA display names, e.g. “romcom” for “Romantic Comedy”.

To determine which subset of values to include in the context for controlled vocabulary fields (and also possibly infer additional fields), we “match” them against the user’s question.

  • Embeddings are created for controlled vocabulary values and their metadata, and these are indexed in a vector store. The controlled vocabularies are available via GraphQL and are regularly fetched and reindexed so this system stays up to date with any changes in the domain.
  • At filter generation time, the user’s question is chunked. For each chunk, we perform a vector search to identify the top K most relevant values (but only for the controlled vocabularies that are associated with fields in the index)
  • The top K values from each chunk are deduplicated by their controlled vocabulary type. The associated field definition is then injected into the context along with the matched values.
Controlled Vocabularies RAG

Combining both approaches, the RAG of fields and controlled vocabularies, we end up with the solution that each input question resolves in available and matched fields and values:

Field and CV RAG

The quality of results generated by the RAG tool can be significantly enhanced by tuning its various parameters, or “levers.” These include strategies for reranking, chunking, and the selection of different embedding generation models. The careful and systematic evaluation of these factors will be the focus of the subsequent parts of this series.

The Instructions

Once the context is constructed, it is provided to the LLM with a set of instructions and the user’s question. The instructions can be summarised as follows: Given a natural language question, generate a syntactically, semantically, and pragmatically correct filter statement given the availability of the following index fields and their metadata.”

  • In order to generate a syntactically correct filter statement, the instructions include the syntax rules of the DSL.
  • In order to generate a semantically correct filter statement, the instructions tell the LLM to ground the generated statement in the provided context.
  • In order to generate a pragmatically correct filter statement, so far we focus on better context engineering to ensure that only the most relevant fields and values are provided. We haven’t identified any instructions that make the LLM just “do better” at this aspect of the task.
Graph Search Filter DSL generation

After the filter statement is generated by the LLM, we deterministically validate it prior to returning the values to the user.

Validation

Syntactic Correctness

Syntactic correctness ensures the LLM output is a parsable filter statement. We utilize an Abstract Syntax Tree (AST) parser built for our custom DSL. If the generated string fails to parse into a valid AST, we know immediately that the query is malformed and there is a fundamental issue with the generation.

The other approach to solve this problem could be using the structured outputs modes provided by some LLMs. However, our initial evaluation yielded mixed results, as the custom DSL is not natively supported and requires further work.

Semantic Correctness

Despite careful context engineering using the RAG pattern, the LLM sometimes hallucinates both fields and available values in the generated filter statement. The most straightforward way of preventing this phenomenon is validating the generated filters against available index metadata. This approach does not impact the overall latency of the system, as we are already working with an AST of the filter statement, and the metadata is freely available from the context engineering stage.

DSL verification & hallucinations

If a hallucination is detected it can be returned as an error to a user, indicating the need to refine the query, or can be provided back to the LLM in the form of a feedback loop for self correction.

This increases the filter generation time, so should be used cautiously with a limited number of retries.

Building Confidence

You probably noticed we are not validating the generated filter for pragmatic correctness. That task is the hardest challenge: The filter parses (syntactic) and uses real fields (semantic), but is it what the user meant? When a user searches for “Dark”, do they mean the specific German sci-fi series Dark, or are they browsing for the mood category “dark TV shows”?

The gap between what a user intended and the generated filter statement is often caused by ambiguity. Ambiguity stems from the compression of natural language. A user says “German time-travel mystery with the missing boy and the cave” but the index contains discrete metadata fields like releaseYear, genreTags, and synopsisKeywords.

How do we ensure users aren’t inadvertently led to wrong answers or to answers for questions they didn’t ask?

Showing Our Work

One way we are handling ambiguity is by showing our work. We visualise the generated filters in the UI in a user-friendly way allowing them to very clearly see if the answer we’re returning is what they were looking for so they can trust the results..

We cannot show a raw DSL string (e.g., origin.country == ‘Germany’ AND genre.tags CONTAINS ‘Time Travel’ AND synopsisKeywords LIKE ‘*cave*’) to a non-technical user. Instead, we reflect its underlying AST into UI components.

After the LLM generates a filter statement, we parse it into an AST, and then map that AST to the existing “Chips” and “Facets” in our UI (see below). If the LLM generates a filter for origin.country == ‘Germany’, the user sees the “Country” dropdown pre-selected to “Germany.” This gives users immediate visual feedback and the ability to easily fine-tune the query using standard UI controls when the results need improvement or further experimentation.

Generated filters visualisation

Explicit Entity Selection

Another strategy we’ve developed to remove ambiguity happens at query time. We give users the ability to constrain their input to refer to known entities using “@mentions”. Similar to Slack, typing @ lets them search for entities directly from our specialized UI Graph Search component, giving them easy access to multiple controlled vocabularies (plus other identifying metadata like launch year) to feel confident they’re choosing the entity they intend.

If a user types, “When was @dark produced”, we explicitly know they are referring to the Series controlled vocabulary, allowing us to bypass the RAG inference step and hard-code that context, significantly increasing pragmatic correctness (and building user trust in the process).

Example @mentions usage in the UI

End-to-end architecture

As mentioned previously, the solution architecture is divided into pre-processing, filter statement generation, and then post-processing stages. The pre-processing handles context building and involves a RAG pattern for similarity search, while the post-processing validation stage checks the correctness of the LLM-generated filter statements and provides visibility into the results for end users. This design strategically balances LLM involvement with more deterministic strategies.

End-to-end architecture

The end-to-end process is as follows:

  1. A user’s natural language question (with optional `@mentions` statements) are provided as input, along with the Graph Search index context
  2. The context is scoped by using the RAG pattern on both fields and possible values
  3. The pre-processed context and the question are fed into the LLM with an instruction asking for a syntactically and semantically correct filter statement
  4. The generated filer statement DSL is verified and checked for hallucinations
  5. The final response contains the related AST in order to build “Chips” and “Facets”

Summary

By combining our existing Graph Search infrastructure with the power and flexibility of LLMs, we’ve bridged the gap between complex filter statements and user intent. We moved from requiring users to speak our language (DSL) to our systems understanding theirs.

The initial challenge for our users was successfully addressed. However, our next steps involve transforming this system into a comprehensive and expandable platform, rigorously evaluating its performance in a live production environment, and expanding its capabilities to support GraphQL-first user interfaces. These topics, and others, will be the focus of the subsequent installments in this series. Be sure to follow along!

You may have noticed that we have a lot more to do on this project, including named entity recognition and extraction, intent detection so we can route questions to the appropriate indices, and query rewriting among others. If this kind of work interests you, reach out! We’re hiring in our Warsaw office, check for open roles here.

Credits

Special thanks to Alejandro Quesada, Yevgeniya Li, Dmytro Kyrii, Razvan-Gabriel Gatea, Orif Milod, Michal Krol, Jeff Balis, Charles Zhao, Shilpa Motukuri, Shervine Amidi, Alex Borysov, Mike Azar, Bernardo Gomez Palacio, Haoyuan He, Eduardo Ramirez, Cynthia Xie.


The AI Evolution of Graph Search at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[How Temporal Powers Reliable Cloud Operations at Netflix]]> https://netflixtechblog.com/how-temporal-powers-reliable-cloud-operations-at-netflix-73c69ccb5953?source=rss----2615bd06b42e---4 https://medium.com/p/73c69ccb5953 Mon, 15 Dec 2025 23:51:59 GMT 2025-12-16T00:01:15.478Z By Jacob Meyers and Rob Zienert

Temporal is a Durable Execution platform which allows you to write code “as if failures don’t exist”. It’s become increasingly critical to Netflix since its initial adoption in 2021, with users ranging from the operators of our Open Connect global CDN to our Live reliability teams now depending on Temporal to operate their business-critical services. In this post, I’ll give a high-level overview of what Temporal offers users, the problems we were experiencing operating Spinnaker that motivated its initial adoption at Netflix, and how Temporal helped us reduce the number of transient deployment failures at Netflix from 4% to 0.0001%.

A Crash Course on (some of) Spinnaker

Spinnaker is a multi-cloud continuous delivery platform that powers the vast majority of Netflix’s software deployments. It’s composed of several (mostly nautical themed) microservices. Let’s double-click on two in particular to understand the problems we were facing that led us to adopting Temporal.

In case you’re completely new to Spinnaker, Spinnaker’s fundamental tool for deployments is the Pipeline. A Pipeline is composed of a sequence of steps called Stages, which themselves can be decomposed into one or more Tasks, or other Stages. An example deployment pipeline for a production service may consist of these stages: Find Image -> Run Smoke Tests -> Run Canary -> Deploy to us-east-2 -> Wait -> Deploy to us-east-1.

An example Spinnaker Pipeline
An example Spinnaker Pipeline for a Netflix service

Pipeline configuration is extremely flexible. You can have Stages run completely serially, one after another, or you can have a mix of concurrent and serial Stages. Stages can also be executed conditionally based on the result of previous stages. This brings us to our first Spinnaker service: Orca. Orca is the orca-stration engine of Spinnaker. It’s responsible for managing the execution of the Stages and Tasks that a Pipeline unrolls into and coordinating with other Spinnaker services to actually execute them.

One of those collaborating services is called Clouddriver. In the example Pipeline above, some of the Stages will require interfacing with cloud infrastructure. For example, the canary deployment involves creating ephemeral hosts to run an experiment, and a full deployment of a new version of the service may involve spinning up new servers and then tearing down the old ones. We call these sorts of operations that mutate cloud infrastructure Cloud Operations. Clouddriver’s job is to decompose and execute Cloud Operations sent to it by Orca as part of a deployment. Cloud Operations sent from Orca to Clouddriver are relatively high level (for example: createServerGroup), so Clouddriver understands how to translate these into lower-level cloud provider API calls.

Pain points in the interaction between Orca and Clouddriver and the implementation details of Cloud Operation execution in Clouddriver are what led us to look for new solutions and ultimately migrate to Temporal, so we’ll next look at the anatomy of a Cloud Operation. Cloud Operations in the OSS version of Spinnaker still work as described below, so motivated readers can follow along in source code, however our migration to Temporal is entirely closed-source following a fork from OSS in 2020 to allow Netflix to make larger pivots to the product such as this one.

The Original Cloud Operation Flow

A Cloud Operation’s execution goes something like this:

  1. Orca, in orchestrating a Pipeline execution, decides a particular Cloud Operation needs to be performed. It sends a POST request to Clouddriver’s /ops endpoint with an untyped bag-of-fields.
  2. Clouddriver attempts to resolve the operation Orca sent into a set of AtomicOperation s— internal operations that only Clouddriver understands.
  3. If the payload was valid and Clouddriver successfully resolved the operation, it will immediately return a Task ID to Orca.
  4. Orca will immediately begin polling Clouddriver’s GET /task/<id> endpoint to keep track of the status of the Cloud Operation.
  5. Asynchronously, Clouddriver begins executing AtomicOperations using its own internal orchestration engine. Ultimately, the AtomicOperations resolve into cloud provider API calls. As the Cloud Operation progresses, Clouddriver updates an internal state store to surface progress to Orca.
  6. Eventually, if all went well, Clouddriver will mark the Cloud Operation complete, which eventually surfaces to Orca in its polling. Orca considers the Cloud Operation finished, and the deployment can progress.
A sequence diagram of a Cloud Operation execution

This works well enough on the happy path, but veer off the happy path and dragons begin to emerge:

  1. Clouddriver has its own internal orchestration system independent of Orca to allow Orca to query the progress of Cloud Operation. This is largely undifferentiated lifting relative to Clouddriver’s goal of actuating cloud infrastructure changes, and ultimately adds complexity and surface area for bugs to the application. Additionally, Orca is tightly coupled to Clouddriver’s orchestration system — it must understand how to poll Clouddriver, interpret the status, and handle errors returned by Clouddriver.
  2. Distributed systems are messy — networks and external services are unreliable. While executing a Cloud Operation, Clouddriver could experience transient network issues, or the cloud provider it’s attempting to call into may be having an outage, or any number of issues in between. Despite all of this, Clouddriver must be as reliable as reasonably possible as a core platform service. To deal with this shape of issue, Clouddriver internally evolved complex retry logic, further adding cognitive complexity to the system.
  3. Remember how a Cloud Operation gets decomposed by Clouddriver into AtomicOperations? Sometimes, if there’s a failure in the middle of a Cloud Operation, we need to be able to roll back what was done in AtomicOperations prior to the failure. This led to a homegrown Saga framework being implemented inside Clouddriver. While this did result in a big step forward in reliability of Cloud Operations facing transient failures because the Saga framework also allowed replaying partially-failed Cloud Operations, it added yet more undifferentiated lifting inside the service.
  4. The task state kept by Clouddriver was instance-local. In other words, if the Clouddriver instance carrying out a Cloud Operation crashed, that Cloud Operation state was lost, and Orca would eventually time out polling for the task status. The Saga implementation mentioned above mitigated this for certain operations, but was not widely adopted across all cloud providers supported by Spinnaker.

We introduced a lot of incidental complexity into Clouddriver in an effort to keep Cloud Operation execution reliable, and despite all this deployments still failed around 4% of the time due to transient Cloud Operation failures.

Now, I can already hear you saying: “So what? Can’t people re-try their deployments if they fail?” While true, some pipelines take days to complete for complex deployments, and a failed Cloud Operation mid-way through requires re-running the whole thing. This was detrimental to engineering productivity at Netflix in a non-trivial way. Rather than continue trying to build a faster horse, we began to look elsewhere for our reliable orchestration requirements, which is where Temporal comes in.

Temporal: Basic Concepts

Temporal is an open source product that offers a durable execution platform for your applications. Durable execution means that the platform will ensure your programs run to completion despite adverse conditions. With Temporal, you organize your business logic into Workflows, which are a deterministic series of steps. The steps inside of Workflows are called Activities, which is where you encapsulate all your non-deterministic logic that needs to happen in the course of executing your Workflows. As your Workflows execute in processes called Workers, the Temporal server durably stores their execution state so that in the event of failures your Workflows can be retried or even migrated to a different Worker. This makes Workflows incredibly resilient to the sorts of transient failures Clouddriver was susceptible to. Here’s a simple example Workflow in Java that runs an Activity to send an email once every 30 days:

@WorkflowInterface
public interface SleepForDaysWorkflow {
@WorkflowMethod
void run();
}

public class SleepForDaysWorkflowImpl implements SleepForDaysWorkflow {

private final SendEmailActivities emailActivities =
Workflow.newActivityStub(
SendEmailActivities.class,
ActivityOptions.newBuilder()
.setStartToCloseTimeout(Duration.ofSeconds(10))
.build());

@Override
public void run() {
while (true) {
// Activities already carry retries/timeouts via options.
emailActivities.sendEmail();

// Pause the workflow for 30 days before sending the next email.
Workflow.sleep(Duration.ofDays(30));
}
}
}

@ActivityInterface
public interface SendEmailActivities {
void sendEmail();
}

There’s some interesting things to note about this Workflow:

  1. Workflows and Activities are just code, so you can test them using the same techniques and processes as the rest of your codebase.
  2. Activities are automatically retried by Temporal with configurable exponential backoff.
  3. Temporal manages all the execution state of the Workflow, including timers (like the one used by Workflow.sleep). If the Worker executing this workflow were to have its power cable unplugged, Temporal would ensure another Worker continues to execute it (even during the 30 day sleep).
  4. Workflow sleeps are not compute-intensive, and they don’t tie up the process.

You might already begin to see how Temporal solves a lot of the problems we had with Clouddriver. Ultimately, we decided to pull the trigger on migrating Cloud Operation execution to Temporal.

Cloud Operations with Temporal

Today, we execute Cloud Operations as Temporal workflows. Here’s what that looks like.

  1. Orca, using a Temporal client, sends a request to Temporal to execute an UntypedCloudOperationRunner Workflow. The contract of the Workflow looks something like this:
@WorkflowInterface
interface UntypedCloudOperationRunner {
/**
* Runs a cloud operation given an untyped payload.
*
* WorkflowResult is a thin wrapper around OutputType providing a standard contract for
* clients to determine if the CloudOperation was successful and fetching any errors.
*/
@WorkflowMethod
fun <OutputType : CloudOperationOutput> run(stageContext: Map<String, Any?>, operationType: String): WorkflowResult<OutputType>
}

2. The Clouddriver Temporal worker is constantly polling Temporal for work. A worker will eventually see a task for an UntypedCloudOperationRunner Workflow and start executing it.

3. Similar to before with resolution into AtomicOperations, Clouddriver does some pre-processing of the bag-of-fields in stageContext and resolves it to a strongly typed implementation of the CloudOperation Workflow interface based on the operationType input and the stageContext:

interface CloudOperation<I : CloudOperationInput, O : CloudOperationOutput> {
@WorkflowMethod
fun operate(input: I, credentials: AccountCredentials<out Any>): O
}

4. Clouddriver starts a Child Workflow execution of the CloudOperation implementation it resolved. The child workflow will execute Activities which handle the actual cloud provider API calls to mutate infrastructure.

5. Orca uses its Temporal Client to await completion of the UntypedCloudOperationRunner Workflow. Once it’s complete, Temporal notifies the client and sends the result and Orca can continue progressing the deployment.

Sequence diagram of a Cloud Operation execution with Temporal

Results and Lessons Learned from the Migration

A shiny new architecture is great, but equally important is the non-glamorous work of refactoring legacy systems to fit the new architecture. How did we integrate Temporal into critical dependencies of all Netflix engineers transparently?

The answer, of course, is a combination of abstraction and dynamic configuration. We built a CloudOperationRunner interface in Orca to encapsulate whether the Cloud Operation was being executed via the legacy path or Temporal. At runtime, Fast Properties (Netflix’s dynamic configuration system) determined which path a stage that needed to execute a Cloud Operation would take. We could set these properties quite granularly — by Stage type, cloud provider account, Spinnaker application, Cloud Operation type (createServerGroup), and cloud provider (either AWS or Titus in our case). The Spinnaker services themselves were the first to be deployed using Temporal, and within two quarters, all applications at Netflix were onboarded.

Impact

What did we have to show for it all? With Temporal as the orchestration engine for Cloud Operations, the percentage of deployments that failed due to transient Cloud Operation failures dropped from 4% to 0.0001%. For those keeping track at home, that’s a four and a half order of magnitude reduction. Virtually eliminating this failure mode for deployments was a huge win for developer productivity, especially for teams with long and complex deployment pipelines.

Beyond the improvement in deployment success metrics, we saw a number of other benefits:

  1. Orca no longer needs to directly communicate with Clouddriver to start Cloud Operations or poll their status with Temporal as the intermediary. The services are less coupled, which is a win for maintainability.
  2. Speaking of maintainability, with Temporal doing the heavy lifting of orchestration and retries inside of Clouddriver, we got to remove a lot of the homegrown logic we’d built up over the years for the same purpose.
  3. Since Temporal manages execution state, Clouddriver instances became stateless and Cloud Operation execution can bounce between instances with impunity. We can treat Clouddriver instances more like cattle and enable things like Chaos Monkey for the service which we were previously prevented from doing.
  4. Migrating Cloud Operation steps into Activities was a forcing function to re-write the logic to be idempotent. Since Temporal retries activities by default, it’s generally recommended they be idempotent. This alone fixed a number of issues that existed previously when operations were retried in Clouddriver.
  5. We set the retry timeout for Activities in Clouddriver to be two hours by default. This gives us a long leash to fix-forward or rollback Clouddriver if we introduce a regression before customer deployments fail — to them, it might just look like a deployment is taking longer than usual.
  6. Cloud Operations are much easier to introspect than before. Temporal ships with a great UI to help visualize Workflow and Activity executions, which is a huge boon for debugging live Workflows executing in production. The Temporal SDKs and server also emit a lot of useful metrics.
A Cloud Operation Workflow as seen from the Temporal UI. This operation executes 3 Activities: DescribeAutoScalingGroup, GetHookConfigurations, and ResizeServerGroup
Execution of a resizeServerGroup Cloud Operation as seen from the Temporal UI. This operation executes 3 Activities: DescribeAutoScalingGroup, GetHookConfigurations, and ResizeServerGroup

Lessons Learned

With the benefit of hindsight, there are also some lessons we can share from this migration:

1. Avoid unnecessary Child Workflows: Structuring Cloud Operations as an UntypedCloudOperationRunner Workflow that starts Child Workflows to actually execute the Cloud Operation’s logic was unnecessary and the indirection made troubleshooting more difficult. There are situations where Child Workflows are appropriate, but in this case we were using them as a tool for code organization, which is generally unnecessary. We could’ve achieved the same effect with class composition in the top-level parent Workflow.

2. Use single argument objects: At first, we structured Workflow and Activity functions with variable arguments, much as you’d write normal functions. This can be problematic for Temporal because of Temporal’s determinism constraints. Adding or removing an argument from a function signature is not a backward-compatible change, and doing so can break long-running workflows — and it’s not immediately obvious in code review your change is problematic. The preferred pattern is to use a single serializable class to host all your arguments for Workflows and Activities — these can be more freely changed without breaking determinism.

3. Separate business failures from workflow failures: We like the pattern of the WorkflowResult type that UntypedCloudOperationRunner returns in the interface above. It allows us to communicate business process failures without failing the Workflow itself and have more overall nuance in error handling. This is a pattern we’ve carried over to other Workflows we’ve implemented since.

Temporal at Netflix Today

Temporal adoption has skyrocketed at Netflix since its initial introduction for Spinnaker. Today, we have hundreds of use cases, and we’ve seen adoption double in the last year with no signs of slowing down.

One major difference between initial adoption and today is that Netflix migrated from an on-prem Temporal deployment to using Temporal Cloud, which is Temporal’s SaaS offering of the Temporal server. This has let us scale Temporal adoption while running a lean team. We’ve also built up a robust internal platform around Temporal Cloud to integrate with Netflix’s internal ecosystem and make onboarding for our developers as easy as possible. Stay tuned for a future post digging into more specifics of our Netflix Temporal platform.

Acknowledgement

We all stand on the shoulders of giants in software. I want to call out that I’m retelling the work of my two stunning colleagues Chris Smalley and Rob Zienert in this post, who were the two aforementioned engineers who introduced Temporal and carried out the migration.


How Temporal Powers Reliable Cloud Operations at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[Netflix Live Origin]]> https://netflixtechblog.com/netflix-live-origin-41f1b0ad5371?source=rss----2615bd06b42e---4 https://medium.com/p/41f1b0ad5371 Mon, 15 Dec 2025 17:38:16 GMT 2025-12-15T17:38:14.921Z Xiaomei Liu, Joseph Lynch, Chris Newton

Introduction

Behind the Streams: Building a Reliable Cloud Live Streaming Pipeline for Netflix introduced the architecture of the streaming pipeline. This blog post looks at the custom Origin Server we built for Live — the Netflix Live Origin. It sits at the demarcation point between the cloud live streaming pipelines on its upstream side and the distribution system, Open Connect, Netflix’s in-house Content Delivery Network (CDN), on its downstream side, and acts as a broker managing what content makes it out to Open Connect and ultimately to the client devices.

Live Streaming Distribution and Origin Architecture

Netflix Live Origin is a multi-tenant microservice operating on EC2 instances within the AWS cloud. We lean on standard HTTP protocol features to communicate with the Live Origin. The Packager pushes segments to it using PUT requests, which place a file into storage at the particular location named in the URL. The storage location corresponds to the URL that is used when the Open Connect side issues the corresponding GET request.

Live Origin architecture is influenced by key technical decisions of the live streaming architecture. First, resilience is achieved through redundant regional live streaming pipelines, with failover orchestrated at the server-side to reduce client complexity. The implementation of epoch locking at the cloud encoder enables the origin to select a segment from either encoding pipeline. Second, Netflix adopted a manifest design with segment templates and constant segment duration to avoid frequent manifest refresh. The constant duration templates enable Origin to predict the segment publishing schedule.

Multi-pipeline and multi-region aware origin

Live streams inevitably contain defects due to the non-deterministic nature of live contribution feeds and strict real-time segment publishing timelines. Common defects include:

  • Short segments: Missing video frames and audio samples.
  • Missing segments: Entire segments are absent.
  • Segment timing discontinuity: Issues with the Track Fragment Decode Time.

Communicating segment discontinuity from the server to the client via a segment template-based manifest is impractical, and these defective segments can disrupt client streaming.

The redundant cloud streaming pipelines operate independently, encompassing distinct cloud regions, contribution feeds, encoder, and packager deployments. This independence substantially mitigates the probability of simultaneous defective segments across the dual pipelines. Owing to its strategic placement within the distribution path, the live origin naturally emerges as a component capable of intelligent candidate selection.

The Netflix Live Origin features multi-pipeline and multi-region awareness. When a segment is requested, the live origin checks candidates from each pipeline in a deterministic order, selecting the first valid one. Segment defects are detected via lightweight media inspection at the packager. This defect information is provided as metadata when the segment is published to the live origin. In the rare case of concurrent defects at the dual pipeline, the segment defects can be communicated downstream for intelligent client-side error concealment.

Open Connect streaming optimization

When the Live project started, Open Connect had become highly optimised for VOD content delivery — nginx had been chosen many years ago as the Web Server since it is highly capable in this role, and a number of enhancements had been added to it and to the underlying operating system (BSD). Unlike traditional CDNs, Open Connect is more of a distributed origin server — VOD assets are pre-positioned onto carefully selected server machines (OCAs, or Open Connect Appliances) rather than being filled on demand.

Alongside the VOD delivery, an on-demand fill system has been used for non-VOD assets — this includes artwork and the downloadable portions of the clients, etc. These are also served out of the same nginx workers, albeit under a distinct server block, using a distinct set of hostnames.

Live didn’t fit neatly into this ‘small object delivery’ model, so we extended the proxy-caching functionality of nginx to address Live-specific needs. We will touch on some of these here related to optimized interactions with the Origin Server. Look for a future blog post that will go into more details on the Open Connect side.

The segment templates provided to clients are also provided to the OCAs as part of the Live Event Configuration data. Using the Availability Start Time and Initial Segment number, the OCA is able to determine the legitimate range of segments for each event at any point in time — requests for objects outside this range can be rejected, preventing unnecessary requests going up through the fill hierarchy to the origin. If a request makes it through to the origin, and the segment isn’t available yet, the origin server will return a 404 Status Code (indicating File Not Found) with the expiration policy of that error so that it can be cached within Open Connect until just before that segment is expected to be published.

If the Live Origin knows when segments are being pushed to it, and knows what the live edge is — when a request is received for the immediately next object, rather than handing back another 404 error (which would go all the way back through Open Connect to the client), the Live Origin can ‘hold open’ the request, and service it once the segment has been published to it. By doing this, the degree of chatter within the network handling requests that arrive early has been significantly reduced. As part of this, millisecond grain caching was added to nginx to enhance the standard HTTP Cache Control, which only works at second granularity, a long time when segments are generated every 2 seconds.

Streaming metadata enhancement

The HTTP standard allows for the addition of request and response headers that can be used to provide additional information as files move between clients and servers. The HTTP headers provide notifications of events within the stream in a highly scalable way that is independently conveyed to client devices, regardless of their playback position within the stream.

These notifications are provided to the origin by the live streaming pipeline and are inserted by the origin in the form of headers, appearing on the segments generated at that point in time (and persist to future segments — they are cumulative). Whenever a segment is received at an OCA, this notification information is extracted from the response headers and used to update an in-memory data structure, keyed by event ID; and whenever a segment is served from the OCA, the latest such notification data is attached to the response. This means that, given any flow of segments into an OCA, it will always have the most recent notification data, even if all clients requesting it are behind the live edge. In fact, the notification information can be conveyed on any response, not just those supplying new segments.

Cache invalidation and origin mask

An invalidation system has been available since the early days of the project. It can be used to “flush” all content associated with an event by altering the key used when looking up objects in cache — this is done by incorporating a version number into the cache key that can then be bumped on demand. This is used during pre-event testing so that the network can be returned to a pristine state for the test with minimal fuss.

Each segment published by the Live Origin conveys the encoding pipeline it was generated by, as well as the region it was requested from. Any issues that are found after segments make their way into the network can be remedied by an enhanced invalidation system that takes such variants into account. It is possible to invalidate (that is, cause to be considered expired) segments in a range of segment numbers, but only if they were sourced from encoder A, or from Encoder A, but only if retrieved from region X.

In combination with Open Connect’s enhanced cache invalidation, the Netflix Live Origin allows selective encoding pipeline masking to exclude a range of segments from a particular pipeline when serving segments to Open Connect. The enhanced cache invalidation and origin masking enable live streaming operations to hide known problematic segments (e.g., segments causing client playback errors) from streaming clients once the bad segments are detected, protecting millions of streaming clients during the DVR playback window.

Origin storage architecture

Our original storage architecture for the Live Origin was simple: just use AWS S3 like we do for SVOD. This served us well initially for our low-traffic events, but as we scaled up we discovered that Live streaming has unique latency and workload requirements that differ significantly from on-demand where we have significant time ahead-of-time to pre-position content. While S3 met its stated uptime guarantees, our strict 2-second retry budget inherent to Live events (where every write is critical) led us to explore optimizations specifically tailored for real-time delivery at scale. AWS S3 is an amazing object store, but our Live streaming requirements were closer to those of a global low-latency highly-available database. So, we went back to the drawing board and started from the requirements. The Origin required:

  1. [HA Writes] Extremely high write availability, ideally as close to full write availability within a single AWS region, with low second replication delay to other regions. Any failed write operation within 500ms is considered a bug that must be triaged and prevented from re-occurring.
  2. [Throughput] High write throughput, with hundreds of MiB replicating across regions
  3. [Large Partitions] Efficiently support O(MiB) writes that accumulate to O(10k) keys per partition with O(GiB) total size per event.
  4. [Strong Consistency] Within the same region, we needed read-your-write semantics to hit our <1s read delay requirements (must be able to read published segments)
  5. [Origin Storm] During worst-case load involving Open Connect edge cases, we may need to handle O(GiB) of read throughput without affecting writes.

Fortunately, Netflix had previously invested in building a KeyValue Storage Abstraction that cleverly leveraged Apache Cassandra to provide chunked storage of MiB or even GiB values. This abstraction was initially built to support cloud saves of Game state. The Live use case would push the boundaries of this solution, however, in terms of availability for writes (#1), cumulative partition size (#3), and read throughput during Origin Storm (#5).

High Availability for Writes of Large Payloads

The KeyValue Payload Chunking and Compression Algorithm breaks O(MiB) work down so each part can be idempotently retried and hedged to maintain strict latency service level objectives, as well as spreading the data across the full cluster. When we combine this algorithm with Apache Cassandra’s local-quorum consistency model, which allows write availability even with an entire Availability Zone outage, plus a write-optimized Log-Structured Merge Tree (LSM) storage engine, we could meet the first four requirements. After iterating on the performance and availability of this solution, we were not only able to achieve the write availability required, but did so with a P99 tail latency that was similar to the status quo’s P50 average latency while also handling cross-region replication behind the scenes for the Origin. This new solution was significantly more expensive (as expected, databases backed by SSD cost more), but minimizing cost was not a key objective and low latency with high availability was:

Storage System Write Performance

High Availability Reads at Gbps Throughputs

Now that we solved the write reliability problem, we had to handle the Origin Storm failure case, where potentially dozens of Open Connect top-tier caches could be requesting multiple O(MiB) video segments at once. Our back-of-the-envelope calculations showed worst-case read throughput in the O(100Gbps) range, which would normally be extremely expensive for a strongly-consistent storage engine like Apache Cassandra. With careful tuning of chunk access, we were able to respond to reads at network line rate (100Gbps) from Apache Cassandra, but we observed unacceptable performance and availability degradation on concurrent writes. To resolve this issue, we introduced write-through caching of chunks using our distributed caching system EVCache, which is based on Memcached. This allows almost all reads to be served from a highly scalable cache, allowing us to easily hit 200Gbps and beyond without affecting the write path, achieving read-write separation.

Final Storage Architecture

In the final storage architecture, the Live Origin writes and reads to KeyValue, which manages a write-through cache to EVCache (memcached) and implements a safe chunking protocol that spreads large values and partitions them out across the storage cluster (Apache Cassandra). This allows almost all read load to be handled from cache, with only misses hitting the storage. This combination of cache and highly available storage has met the demanding needs of our Live Origin for over a year now.

Storage System High Level Architecture

Delivering this consistent low latency for large writes with cross-region replication and consistent write-through caching to a distributed cache required solving numerous hard problems with novel techniques, which we plan to share in detail during a future post.

Scalability and scalable architecture

Netflix’s live streaming platform must handle a high volume of diverse stream renditions for each live event. This complexity stems from supporting various video encoding formats (each with multiple encoder ladders), numerous audio options (across languages, formats, and bitrates), and different content versions (e.g., with or without advertisements). The combination of these elements, alongside concurrent event support, leads to a significant number of unique stream renditions per live event. This, in turn, necessitates a high Requests Per Second (RPS) capacity from the multi-tenant live origin service to ensure publishing-side scalability.

In addition, Netflix’s global reach presents distinct challenges to the live origin on the retrieval side. During the Tyson vs. Paul fight event in 2024, a historic peak of 65 million concurrent streams was observed. Consequently, a scalable architecture for live origin is essential for the success of large-scale live streaming.

Scaling architecture

We chose to build a highly scalable origin instead of relying on the traditional origin shields approach for better end-to-end cache consistency control and simpler system architecture. The live origin in this architecture directly connects with top-tier Open Connect nodes, which are geographically distributed across several sites. To minimize the load on the origin, only designated nodes per stream rendition at each site are permitted to directly fill from the origin.

Netflix Live Origin Scalability Architecture

While the origin service can autoscale horizontally using EC2 instances, there are other system resources that are not autoscalable, such as storage platform capacity and AWS to Open Connect backbone bandwidth capacity. Since in live streaming, not all requests to the live origin are of the same importance, the origin is designed to prioritize more critical requests over less critical requests when system resources are limited. The table below outlines the request categories, their identification, and protection methods.

Publishing isolation

Publishing traffic, unlike potentially surging CDN retrieval traffic, is predictable, making path isolation a highly effective solution. As shown in the scalability architecture diagram, the origin utilizes separate EC2 publishing and CDN stacks to protect the latency and failure-sensitive origin writes. In addition, the storage abstraction layer features distinct clusters for key-value (KV) read and KV write operations. Finally, the storage layer itself separates read (EVCache) and write (Cassandra) paths. This comprehensive path isolation facilitates independent cloud scaling of publishing and retrieval, and also prevents CDN-facing traffic surges from impacting the performance and reliability of origin publishing.

Priority rate limiting

Given Netflix’s scale, managing incoming requests during a traffic storm is challenging, especially considering non-autoscalable system resources. The Netflix Live Origin implemented priority-based rate limiting when the underlying system is under stress. This approach ensures that requests with greater user impact are prioritized to succeed, while requests with lower user impact are allowed to fail during times of stress in order to protect the streaming infrastructure and are permitted to retry later to succeed.

Leveraging Netflix’s microservice platform priority rate limiting feature, the origin prioritizes live edge traffic over DVR traffic during periods of high load on the storage platform. The live edge vs. DVR traffic detection is based on the predictable segment template. The template is further cached in memory on the origin node to enable priority rate limiting without access to the datastore, which is valuable especially during periods of high datastore stress.

To mitigate traffic surges, TTL cache control is used alongside priority rate limiting. When the low-priority traffic is impacted, the origin instructs Open Connect to slow down and cache identical requests for 5 seconds by setting a max-age = 5s and returns an HTTP 503 error code. This strategy effectively dampens traffic surges by preventing repeated requests to the origin within that 5-second window.

The following diagrams illustrate origin priority rate limiting with simulated traffic. The nliveorigin_mp41 traffic is the low-priority traffic and is mixed with other high-priority traffic. In the first row: the 1st diagram shows the request RPS, the 2nd diagram shows the percentage of request failure. In the second row, the 1st diagram shows datastore resource utilization, and the 2nd diagram shows the origin retrieval P99 latency. The results clearly show that only the low-priority traffic (nliveorigin_mp41) is impacted at datastore high utilization, and the origin request latency is under control.

Origin Priority Rate Limiting

404 storm and cache optimization

Publishing isolation and priority rate limiting successfully protect the live origin from DVR traffic storms. However, the traffic storm generated by requests for non-existent segments presents further challenges and opportunities for optimization.

The live origin structures metadata hierarchically as event > stream rendition > segment, and the segment publishing template is maintained at the stream rendition level. This hierarchical organization allows the origin to preemptively reject requests with an HTTP 404(not found)/410(Gone) error, leveraging highly cacheable event and stream rendition level metadata, avoiding unnecessary queries to the segment level metadata:

  • If the event is unknown, reject the request with 404
  • If the event is known, but the segment request timing does not match the expected publishing timing, reject the request with 404 and cache control TTL matching the expected publishing time
  • If the event is known, the requested segment is never generated or misses the retry deadline, reject the request with a 410 error, preventing the client from repeatedly requesting

At the storage layer, metadata is stored separately from media data in the control plane datastore. Unlike the media datastore, the control plane datastore does not use a distributed cache to avoid cache inconsistency. Event and rendition level metadata benefits from a high cache hit ratio when in-memory caching is utilized at the live origin instance. During traffic storms involving non-existent segments, the cache hit ratio for control plane access easily exceeds 90%.

The use of in-memory caching for metadata effectively handles 404 storms at the live origin without causing datastore stress. This metadata caching complements the storage system’s distributed media cache, providing a complete solution for traffic surge protection.

Summary

The Netflix Live Origin, built upon an optimized storage platform, is specifically designed for live streaming. It incorporates advanced media and segment publishing scheduling awareness and leverages enhanced intelligence to improve streaming quality, optimize scalability, and improve Open Connect live streaming operations.

Acknowledgement

Many teams and stunning colleagues contributed to the Netflix live origin. Special thanks to Flavio Ribeiro for advocacy and sponsorship of the live origin project; to Raj Ummadisetty, Prudhviraj Karumanchi for the storage platform; to Rosanna Lee, Hunter Ford, and Thiago Pontes for storage lifecycle management; to Ameya Vasani for e2e test framework; Thomas Symborski for orchestrator integration; to James Schek for Open Connect integration; to Kevin Wang for platform priority rate limit; to Di Li, Nathan Hubbard for origin scalability testing.


Netflix Live Origin was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>
<![CDATA[AV1 — Now Powering 30% of Netflix Streaming]]> https://netflixtechblog.com/av1-now-powering-30-of-netflix-streaming-02f592242d80?source=rss----2615bd06b42e---4 https://medium.com/p/02f592242d80 Thu, 04 Dec 2025 20:09:30 GMT 2025-12-04T20:09:25.801Z AV1 — Now Powering 30% of Netflix Streaming

Liwei Guo, Zhi Li, Sheldon Radford, Jeff Watts

Streaming video has become an integral part of our daily lives. At Netflix, our top priority is delivering the best possible entertainment experience to our members, regardless of their devices or network conditions. One of the key technologies enabling this is AV1, a modern, open video codec that is rapidly transforming both how we stream content and how users experience it. Today, AV1 powers approximately 30% of all Netflix viewing, marking a major milestone in our efforts to bring more efficient and higher-quality streaming to our members.

In this post, we’ll revisit Netflix’s AV1 journey to date, highlight emerging use cases, and share adoption trends across the device ecosystem. Having witnessed AV1’s significant impact,and with AV2 on the horizon, we’re more excited than ever about how open codecs will continue to revolutionize streaming for everyone.

AV1: A Modern, Open Codec

Since entering the streaming business in 2007, Netflix has primarily relied on H.264/AVC as its streaming format. However, we quickly recognized that a modern, open codec would benefit not only Netflix, but the entire multimedia industry. In 2015, together with a group of like-minded industry leaders, Netflix co-founded the Alliance for Open Media (AOMedia) to develop and promote next generation, open source media technologies. The AV1 codec became the first major project of this collaboration, with ambitious goals: to deliver significant improvements in compression efficiency over state-of-the-art codecs, and to introduce rich features that enable new use cases. After three years of collaborative development, AV1 was officially released in 2018.

Netflix’s AV1 Journey: From Android to TVs and Beyond

Piloting on Android Mobile

When we first set out to bring AV1 streaming to Netflix members, Android was the ideal starting point. Android’s flexibility allowed us to quickly integrate a software AV1 decoder using the efficient dav1d library, which was already optimized for ARM chipsets in mobile devices.

AV1’s superior compression efficiency was especially valuable for mobile users, many of whom are mindful of their data usage and network conditions. By adopting AV1, we were able to deliver noticeably better video quality at lower bitrates. For members relying on cellular data, this meant crisper images with fewer compression artifacts, even when bandwidth was limited. Launching AV1 support on Android in 2020 marked a significant step forward for Netflix on mobile, making high-quality streaming more accessible and enjoyable for members everywhere.

Front-and-Center for Netflix VOD Streaming

The success of our AV1 launch on Android proved its value for Netflix streaming, motivating us to expand support to smart TVs and other large-screen devices, where most of our members watch their favorite shows.

Smart TVs depend on hardware decoders for efficient high-quality playback. We worked closely with device manufacturers and SoC vendors to certify these devices, ensuring they are both conformant and performant. This collaborative effort enabled our AV1 streaming to TV devices in late 2021. Shortly thereafter, we expanded AV1 streaming to web browsers (in 2022) and continued to broaden device support. In 2023, this included Apple devices with the introduction of AV1 hardware support in the new M3 and A17 Pro chips.

As more devices began shipping with AV1 hardware support, a rapidly growing share of our members could enjoy the benefits of this advanced codec. Combined with our investment in adding AV1 streams across the entire catalog, AV1 viewing share has been consistently increasing in recent years. Today, AV1 accounts for approximately 30% of all Netflix streaming, making it our second most-used codec — and it’s on track to become number one very soon. The payoff has been substantial.

  • Elevating Streaming Experience Across the Board: Large-screen TVs and other devices demand higher bitrates to deliver stunning 4K, high frame rate (HFR) experiences. AV1’s superior compression efficiency has allowed us to provide these experiences using less data, making high-quality streaming more accessible and reliable. On average, AV1 streaming sessions achieve VMAF scores¹ that are 4.3 points higher than AVC and 0.9 points higher than HEVC sessions. At the same time, AV1 sessions use one-third less bandwidth than both AVC and HEVC, resulting in 45% fewer buffering interruptions. Moreover, Netflix’s diverse content catalog benefits universally from AV1, with improvements across all content types.
  • Driving Network Efficiency Worldwide: Netflix streams are delivered through our own content delivery network (Open Connect), in partnership with local ISPs around the globe. With more than 300 million members, Netflix streaming constitutes a non-trivial portion of global internet traffic. Because AV1 is a more efficient codec, its streams are smaller in size (while providing even better visual quality). By shifting a substantial share of our streaming to AV1, we reduce overall internet bandwidth consumption, and lessen system and network load for both Netflix and our partners.

Unlocking Advanced Experiences

In addition to its superior compression efficiency, AV1 was designed to support a rich set of features. Once we established a robust framework for the continuous expansion of AV1 streaming, we quickly shifted our focus towards exploring AV1’s unique features to unlock even more advanced and immersive experiences for our members.

High-Dynamic-Range(HDR)
HDR brings enhanced detail, vivid colors, and greater clarity to images. As a premium streaming service, Netflix has been a pioneer in adopting HDR, offering HDR streaming since 2016. In March 2025, we launched AV1 HDR streaming. We chose HDR10+ as the HDR format for its use of dynamic metadata, which enabled us to adapt the tone mapping per device in a scene-dependent manner.

As anticipated, the combination of AV1 and HDR10+ allows us to deliver images with greater detail, more vibrant colors, and an overall heightened sense of immersion for our members. At the moment, 85% of our HDR catalog (from the perspective of view-hours) has AV1-HDR10+ coverage, and this number is expected to reach 100% in the next couple of months.

Photographs of devices displaying the same (cropped) frame with HDR10 metadata (left) and HDR10+ metadata (right). Notice the preservation of the flashlight detail in the HDR10+ capture, and the over-exposure of the region under the flashlight in the HDR10 one.

Cinematic Film Grain
Film grain is a hallmark of the cinematic experience, widely used in the movie industry to enhance a film’s depth, texture, and realism. However, because film grain is inherently random, faithfully representing it in digital video requires a significant amount of data. This presents a unique challenge for streaming: restricting the bitrate can result in grain that appears unnatural or distorted, while increasing the bitrate to accurately preserve cinematic grain almost inevitably leads to elevated rebuffering. The AV1 specification incorporates a unique solution called Film Grain Synthesis (FGS). Instead of encoding grain as part of every frame, the grain is stripped out before encoding and then resynthesized at the decoder using parameters sent in the bitstream, delivering a realistic cinematic film grain experience without the usual data costs.

This approach represents a significant shift from traditional compression and streaming techniques. Our team invested substantial effort in fine-tuning the media processing pipeline, ensuring FGS delivers robust performance at scale. In July 2025, we successfully productized AV1 FGS, and the results were astonishing: AV1 with FGS could deliver videos with cinematic film grain at a bitrate well within the capabilities of typical household internet connections. For non-FGS AV1 encodings, even at much higher bitrate, they may not be able to achieve comparable quality.

The same (cropped) frame from source (left), regular AV1 stream encoded at 8274kbps (middle) and AV1 FGS stream encoded at 2804 kbps (right). The AV1 FGS stream reduces the bitrate by 66% while delivering clearly better quality.

Beyond VOD Streaming

So far, our AV1 journey has been mainly on VOD, but we see significant opportunities for AV1 beyond traditional VOD streaming. On a mission to entertain the world, Netflix has constantly explored and established other ways to bring joy to our members, and we believe AV1 could contribute to the success of these new products.

Live Streaming
Debuting in 2023, live streaming has experienced rapid growth at Netflix, becoming a key part of our streaming offerings in just two short years. We are actively evaluating the use of AV1 in live streaming, as we believe it could help further scale Netflix’s live programming:

  • Hyper-scale concurrent viewership: Live streaming at Netflix means delivering content to tens of millions of viewers simultaneously. AV1’s superior compression efficiency could significantly reduce the required bandwidth, enabling us to deliver high-quality live experiences to large audiences without compromising video quality.
  • Customizable graphics overlay: for live sport events such as football, tennis and boxing, graphics overlays have become an integral part of the member experience — from embedding game statistics to delivering sponsorships. AV1 offers an opportunity to make the graphics highly customizable: layered coding is supported in AV1’s main profile, allowing encoding the main content in the base layer, and graphics in the enhancement layer, and easily swapping out one version of the enhancement layer with another. We envision that the use of AV1’s layered coding can greatly simplify the live streaming workflow and reduce delivery costs.

Cloud Gaming
Cloud gaming is a new Netflix offering that is currently in the beta phase and is available to members in select countries. The game engines run on cloud servers, while the rendered graphics are streamed directly to members’ devices. By removing barriers and transforming every Netflix-enabled device into a game console, Cloud gaming aims to deliver a seamless, “play anywhere” experience for our members. For a glimpse of this in action, watch as Co-CEO Greg Peters and CTO Elizabeth Stone play a round of Boggle Party — powered entirely by Netflix’s cloud gaming platform!

Unlike traditional video streaming, cloud gaming requires that every player action is reflected instantly on the screen to ensure a responsive and immersive experience. This makes delivering high-quality video frames with extremely low latency, despite fluctuating network conditions, one of the biggest challenges in cloud gaming.

Our team is actively working on productizing AV1 for cloud gaming. Given AV1’s high compression efficiency, we can reduce frame sizes, helping video frames get through even when network conditions become challenging. This positions AV1 as a promising technology for enabling a high-quality, low-latency gaming experience across a wide range of devices.

A Device Ecosystem United for AV1

Netflix is a streaming company, and we have worked diligently to create highly efficient and standards-conformant AV1 streams for our catalog. However, an equally, if not more, important factor in AV1’s success is the widespread support from device manufacturers. Throughout our AV1 journey, we have been impressed by the unprecedented pace at which the device ecosystem has embraced AV1.

Just six months after the AV1 specification was finalized, the open-source AV1 decoder library sponsored by AOM, dav1d, was released. Small, performant, and highly resource-efficient, dav1d bridged the gap for early adopters like Netflix while hardware solutions were still in development. Continuous improvements to its performance and compatibility have made dav1d the preferred choice for a wide range of platforms and practical applications. Today, it serves as Android’s default software decoder. Additionally, it plays a key role in web browsers — for Netflix, it powers approximately 40% of our browser playback. This broad adoption has significantly expanded access to high-quality AV1 streaming, even in the absence of dedicated hardware decoders.

Netflix maintains a close working relationship with device manufacturers and SoC vendors, and we have witnessed first-hand their enthusiasm for adopting AV1. To ensure optimal streaming performance, Netflix has a rigorous certification process to verify proper support for our streaming formats on devices. AV1 was added to this certification process in 2019, and since then, we have seen a steady increase in the number of devices with full AV1 decoding capabilities. Over the past five years (2021–2025), 88% of large-screen devices, including TVs, set-top boxes, and streaming sticks, submitted for Netflix certification have supported AV1, with the vast majority offering full 4K@60fps capability. Notably, since 2023, almost all devices we have received for certification are AV1-capable.

We have also been impressed by the robustness of AV1 implementations across these devices. As mentioned earlier, FGS is an innovative tool that departs from traditional codec architectures and was not included in our initial full-scale AV1 streaming rollout. When we launched FGS this July, we worked closely with our partners to ensure broad device compatibility. We are pleased with the successful progress made, and AV1 with FGS is now supported across a significant and growing number of in-field devices.

Looking Ahead: AV1 Today, AV2 Tomorrow

As we reflect on our AV1 journey, it’s clear that the codec has already transformed the streaming experience for hundreds of millions of Netflix members worldwide. Thanks to industry-wide collaboration and rapid device adoption, AV1 is delivering higher quality, greater efficiency, and new cinematic features to more screens than ever before.

Looking ahead, we are excited about the forthcoming release of AV2, announced by the Alliance for Open Media for the end of 2025. AV2 is poised to set a new benchmark for compression efficiency and streaming capabilities, building on the solid foundation laid by AV1. At Netflix, we remain committed to adopting the best open technologies to delight our members around the globe. While AV2 represents the future of streaming, AV1 is very much the present — serving as the backbone of our platform and powering exceptional entertainment experiences across a vast and ever-expanding ecosystem of devices.

Acknowledgement

The success of AV1 at Netflix is the result of the dedication, expertise, and collaboration of many teams across the company — including Encoding, Clients, Device Certification, Partner Engineering, Data Science & Engineering, Infra, Platform, etc.

We would also like to thank Artem Danylenko, Aditya Mavlankar, Anne Aaron, Cyril Concolato, Allan Zhou and Anush Moorthy for their valuable comments and feedback on earlier drafts of this post.

Footnotes

  1. These numbers represent a snapshot of data from November 13, 2025. Actual values may vary slightly from day to day and across different regions, depending on the mix of content, devices, and internet connectivity.

AV1 — Now Powering 30% of Netflix Streaming was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

]]>