Tubi Engineering - Medium

Releasing RedisCluster for Elixir

Robert — Mon, 02 Mar 2026 23:55:38 GMT

The RedisCluster library was created for Tubi’s needs in scaling Redis clusters and making them more reliable and cost effective. It is now being released publicly because the Elixir community deserves nice things too. Check it out on GitHub.

What is Redis?

For those unfamiliar, Redis is a common key-value storage system. It supports hundreds of commands across a dozen data types. The Redis engine runs as a single-threaded process, servicing commands sequentially. Data is stored in a master node with zero or more replicas which can handle read-only operations. To better scale, the data can be sharded across several master nodes each with their own replicas.

Motivation

Our biggest challenge with Redis has not been memory capacity or CPU. It is bandwidth. Our Redis usage is very read heavy. Tubi’s viewers are growing far faster than the content being served. This requires more bandwidth, especially during large events.

Before RedisCluster, our best way to increase our bandwidth cap was to increase the instance sizes, i.e. vertical scaling. Since Redis is single-threaded the added CPU resources are largely wasted. ElastiCache’s enhanced I/O multiplexing is able to use some of this extra compute but the Redis engine is still the limiting factor.

We could add more master nodes, i.e. horizontal scaling. However, this requires resharding the data. That is not a good solution when the cluster is already under stress. It takes time, and bandwidth, to shuffle the data around all the nodes.

The other horizontal scaling option is adding replicas. This is not as intrusive as resharding the entire cluster. It is also easier than changing the instance size of all nodes. And we can keep each instance size more appropriately sized.

The catch, however, is that none of the current Elixir or Erlang Redis libraries directly support replicas.

Goals

Here are the key features we wanted to achieve with RedisCluster.

Cluster Support

Cluster support is vital. Simple Redis replication only scales so far. Eventually the data must be broken into shards. RedisCluster handles cluster discovery and redirects automatically.

Replica Support

As mentioned, this is the key differentiator. We needed replicas. Specifically to read data from them to take load off the masters.

We have actually used replicas for a long time for redundancy. However, without library support they sat idle, waiting for rare moments to replace a failed master node. It is costly to pay for idle, over-provisioned servers.

Easy Parallel Operations

With a sharded Redis cluster, we cannot send a simple MGET command to fetch a batch of values. The keys must be mapped to a hash slot and which nodes own those slots. This results in many requests to several nodes. We have found substantial latency improvements by sending these requests in parallel. Out of the box, RedisCluster provides parallel operations for GET, SET, and DEL.

Easy Extensibility

With well over 200 available Redis commands, RedisCluster cannot (and should not) implement them all. Instead it implements the common commands such as GET, SET, and DEL. RedisCluster includes examples for executing these commands efficiently in parallel. Projects that need to support Hashes, Streams, Bitmaps, or others can easily add the commands they need.

For example, if you need to use the Hash datatype, you can make a wrapper module. This will include all the basics plus your custom functions.

defmodule MyApp.Redis do
  use RedisCluster, otp_app: :my_app

  def hset(key, pairs) do
    fields = Enum.flat_map(pairs, fn {k, v} -> [k, v] end)
    cmd = ["HSET", key | fields]

    RedisCluster.Cluster.command(config(), cmd, key, role: :master)
  end

  def hget(key, field) do
    cmd = ["HGET", key, field]
    # Get field from the owning master or replica.
    RedisCluster.Cluster.command(config(), cmd, key, role: :any)
  end

  def hdel(_key, []) do
    0
  end

  def hdel(key, fields) do
    cmd = ["HDEL", key | fields]
    RedisCluster.Cluster.command(config(), cmd, key, role: :master)
  end
end

Developer Experience

RedisCluster includes several other features to make it easy to work with.

Standalone Redis

For a standalone Redis instance, RedisCluster is overkill. However, it can be convenient to use a standalone Redis instance for staging or testing. RedisCluster will happily work in each environment.

Telemetry

Using the standard :telemetry library, you can easily hook in and track key events. This includes sending commands, discovering cluster topology, checking out connections, and handling redirects.

Connection Pooling

RedisCluster manages several connections to the nodes in a cluster. It creates connections per node, rather than per hash slot. This keeps a consistent number of connections to each node even when the hash slot assignments become fragmented.

Broadcasting

With RedisCluster.Cluster.broadcast, you can send a batch of commands to all nodes. This is great for getting a quick health check with DBSIZE or INFO MEMORY.

Livebook Support

RedisCluster integrates into Livebook with custom cells. This makes it easy to put together demo and test notebooks. See the example below.

Foundation

We did not start from scratch. The excellent Redix library by Andrea Leopardi does the heavy lifting. It opens the connections and speaks the protocol. RedisCluster extends it with cluster support.

Obligatory hash ring diagram

Example topology: 4 shards with 2 replicas in each.

Like many distributed caching solutions, Redis stores data in a hash ring. Each node owns a portion of that ring. Ownership is not always contiguous. RedisCluster queries the cluster to get this ownership data. It can then quickly find which server to ask for the desired data.

Clusters are not static entities. Nodes can be added and removed. When Redis notifies clients that a key-value pair has moved, RedisCluster then rediscovers the new layout.

RedisCluster does not actively monitor the cluster for topology changes. It waits until it receives a MOVED redirect. With high traffic the topology changes are picked up quickly, avoiding the overhead of polling the cluster info.

Results

The effects of reading from both masters and replicas were immediate.

With our read-heavy load, the Redis engine’s CPU use is generally low. Though you can see the replicas jump up and the masters fall in CPU use.

Redis engine CPU usage

There’s a similar trend in outgoing network bytes. The replicas and masters don’t end up equal since the masters have to sync updates with their respective replicas.

Network bytes out

More dramatically, you can see the Get-type commands (GET, HGET, etc) cut in half per node. You can also see one node is running especially hot because of a popular key. We’ve since fixed the hot key but until then the replicas reduced a lot of strain on the masters.

Get-type command counts per node

We also saw a sharp decrease in the latency of hash-based commands (HGET, HSET, HMGET, HGETALL).

Hash-based commands latency

For the metrics pictured above, a single replica per master was enough to bring improve the metrics substantially. Under heavier traffic, more replicas could handle higher loads.

Try it Out

RedisCluster is designed to work nicely with Livebooks. It provides a couple smart cells, one for connecting and another for running pipelines.

Connecting chiefly involves setting the hostname and port.

Once connected, you can send a pipeline of commands.

The cells can be converted to code and customized further.

RedisCluster has a demo Livebook you can try now. It also explains important Redis concepts you may not have known.

Conclusion

RedisCluster came out of practical necessity, not as a side project. It solves real scaling challenges we’ve faced in production and fits naturally into existing Elixir systems. If you’re dealing with clusters, replicas, or just tired of reinventing this wheel, we hope RedisCluster saves you time and effort. Let us know how it works for you — or where it falls short.

Releasing RedisCluster for Elixir was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

AdRise Brand Intelligence: How We Use AI to recognize brands in ad video

Zhongwei — Fri, 06 Feb 2026 00:02:20 GMT

I. The Core Problem and the AI Breakthrough

Every day, millions of video advertisements flow through streaming platforms, each representing a brand trying to reach its audience. Behind the scenes, a critical question must be answered thousands of times per second: Which brand entity is actually behind this ad?

For years, the industry has relied on traditional computer vision approaches — specifically, logo detection powered by deep learning object detectors. These systems work by splitting videos into frames and scanning each frame for known brand logos. While this approach has served the industry adequately for simple use cases, it fundamentally breaks down when faced with the scale, diversity, and sophistication of modern advertising content.

At AdRise, we recognized this bottleneck was directly impacting core business operations at Tubi. We proved that a carefully engineered AI solution — one based on Large Language Models (LLMs) and semantic reasoning — could automate this process and dramatically increase data quality. Our project was greenlit to provide the Tubi Revenue, Finance, and Operations teams with the high-quality, real-time data needed to make critical business decisions, with additional benefits for our adtech capabilities. This marked our fundamental shift from low-accuracy logo detection to high-fidelity brand understanding.

II. Why Traditional Visual AI Struggles with Modern Ad Content

Traditional visual AI systems face three limitations that make them inadequate for modern streaming platforms.

The Limitations

The Cold Start and Maintenance Complexity

Logo detection systems require expensive manual annotation and model retraining for every new brand or logo redesign. For mid-market, regional, and direct-to-consumer brands, which form a significant part of inventory, this is economically prohibitive, leading to a system that is perpetually out of date.

2. Visual Ambiguity and Physical Limitations

Pure visual systems struggle with logos that are small, partially obscured, or viewed from unusual angles. They lack the human-like reasoning to distinguish a subject’s logo from a background brand. For example, the logo-based system might wrongly identify a cereal ad as being for a refrigerator brand, simply because the logo on a refrigerator is clearly visible in the kitchen background.

3. The Context Gap

Modern advertisers often use subtle cues like product placement, unique packaging, or voiceovers mentioning brand names, rather than prominent logos. The traditional visual system is blind to these semantic cues because it doesn’t “understand” what it’s seeing; it only detects visual patterns. It cannot connect a spoken phrase like “the ultimate driving machine” with a luxury automotive brand if the logo isn’t visible.

Real-World Impact

Consider a real-world scenario: an advertisement for a specific laundry detergent brand is filmed in a major retail partner. The legacy logo detection system confidently identifies the large, clear logo of the retailer in the background and mistakenly attributes the ad to the retailer. The actual advertiser is represented only by subtle visual cues. The business impact is severe: the system misses the true advertiser, leading to incorrect campaign attribution and substantial cumulative loss in revenue and targeting effectiveness.

III. The Shift: LLMs for Semantic Understanding

Our solution represents a fundamental rethinking of the brand detection problem. Instead of asking “Can we detect logos more accurately?”, we asked a different question: “Can we teach a system to understand what an ad is about, the way a human would?”

This shift in perspective led us to LLMs — systems that can reason about content semantically rather than just recognizing visual patterns. But using LLMs for video understanding is far from straightforward. Videos are temporal, high-dimensional, and computationally expensive to process. Brand names are diverse and brand relationships are complex. How do you feed a video to a language model and get an accurate brand entity?

The answer lies in a two-system architecture: a real-time recognition pipeline and an offline knowledge graph maintenance engine.

System 1: The Online Brand Recognition Architecture

Let’s start with the production system that processes incoming ads. This is a two-stage pipeline where each stage uses a different LLM for a specific purpose, transforming from raw pixels and audio to a canonical brand entity.

Caption: A detailed view of the System 1 online recognition pipeline. Video and audio are processed into a multi-modal prompt, which LLM #1 uses to extract a raw brand name. LLM #2 then uses tools to query the knowledge graph and resolve that name to a canonical entity.

Stage 1: Raw Brand Extraction via Temporal Montage

The first challenge is converting a temporal video into something a language model can process efficiently. Our solution is the temporal montage — a technique that compresses an entire video timeline into a single composite image.

For a typical 15–60 second advertisement, we:

Extract frames at one-second intervals (15–60 frames total).
Resize each frame to a manageable resolution
Stitch them into a grid arrangement (e.g., 6 columns, N rows) reading left-to-right, top-to-bottom.

The resulting montage is a single image that preserves temporal ordering, like a comic strip or storyboard. This compact representation reduces the LLM’s visual token consumption from 30+ separate image inputs to a single composite image.

But the visual montage alone isn’t enough. Audio content is equally important. We transcribe the ad’s audio track and provide it to an advanced multi-modal LLM alongside the montage.

The prompt instructs the LLM to:

Synthesize the visual narrative with the spoken content.
Correct transcription errors by referencing visual evidence (brand names are sometimes misspelled by speech-to-text).
Identify the primary advertiser, ignoring incidental background brands.
Output the raw brand name and description with a confidence level.

The most astonishing finding from our experiments was the LLM’s ability to perform temporal reasoning from these static montages. When presented with a grid of frames, the model could identify the narrative arc (“The ad shows a problem, then introduces the product”), recognize fleeting text overlays, and track objects through the sequence.

The output of this first stage is a simple, unstructured string. This needs to be mapped to a canonical entity, which brings us to Stage 2.

Stage 2: Entity Resolution via Knowledge Graph

The raw brand name must be mapped to a canonical brand entity in our Neo4j knowledge graph. This is where the second LLM(gpt-5 model) comes in — acting as a reasoning agent with access to graph search tools.

This stage uses function calling (tool use), where the LLM can invoke a set of graph search APIs we provide:

semantic_search(query): Finds brands by meaning similarity using vector embeddings. This is useful for catching thematic relationships.
fulltext_search(query): Finds brands by keyword matching on names and aliases. This is good for direct hits.
get_ancestors(brand_id): Explores parent/child relationships to find the ultimate corporate owner.

The LLM orchestrates a multi-round dialogue with these tools. For an illustrative input “Zenith of Maplewood” accompanied by its context-rich description, the flow looks like this:

Initial Query: The LLM first calls semantic_search(“Zenith of Maplewood”). The top result is “Zenith Motors Corporation” with a high similarity score.
Hypothesis & Verification: The LLM hypothesizes that “Zenith of Maplewood” is a dealership based on the description. It reasons that the pattern [Brand] of [Location] is common for regional franchises. To verify, it might call get_ancestors(“zenith motors corporation”) to confirm Zenith is a root-level automotive company.
Final Decision: Based on the high semantic similarity and the dealership pattern, the LLM confidently concludes this is a dealership-derived match and maps the ad to the canonical parent, Zenith Motors Corporation.

This two-LLM approach provides specialization and debuggability. The first model focuses purely on video understanding; the second on graph-based reasoning. We can inspect the intermediate raw brand name to diagnose failures in either stage.

System 2: The Offline Knowledge Graph Maintenance Engine

While System 1 handles incoming ads, System 2 works in the background to continuously grow and refine the brand knowledge graph. This is where our multi-agent architecture shines.

Caption: The System 2 offline maintenance pipeline. A new brand seed triggers a three-agent process (Triage, Research, Curator) that results in an atomic update to the knowledge graph.

The Knowledge Graph Structure

Our graph schema is deliberately minimal but expressive. It focuses purely on Brand entities and their Ownership relationships. This streamlined structure allows us to efficiently model complex corporate hierarchies — from local franchises up to global holding companies — enabling rapid traversal and semantic searching without getting bogged down in unnecessary metadata.

The Multi-Agent Research Pipeline

Given any brand name and description in the platform, it becomes a “seed” that triggers the offline research pipeline. This pipeline leverages the OpenAI Agents SDK and uses three specialized agents working in sequence.

Agent 1: TriageAgent — The Intelligent Gatekeeper

The first agent prevents unnecessary work. It uses lightweight graph queries and web popularity signals to quickly determine if the brand is already known, a simple derivative of a known brand, or genuinely new. Based on its findings, it can decide to Skip research, or trigger LIGHT or FULL research modes, optimizing for cost and speed.

Agent 2: ResearchAgent — The Knowledge Gatherer

When research is needed, this agent conducts web research to understand the brand’s corporate structure. In LIGHT mode, it uses a faster, cheaper model to establish core parent-child relationships. In FULL mode, it uses a more powerful deep research model to conduct comprehensive investigations into complex ownership chains and regional variants. The agent produces a human-readable brand family research report in markdown, citing its sources.

Agent 3: CuratorAgent — The Structure Extractor

The final agent parses the research report and extracts structured data. It identifies all brand entities and their relationships, then builds a batch of proposed changes. It submits this batch to our commit API for an atomic brand graph update.

Ensuring Data Integrity

To keep this graph reliable while multiple agents work concurrently, we utilize a strict atomic commit protocol. Just like a financial transaction, every proposed update to the graph is validated for consistency — checking for duplicates or circular ownership anomalies — before being applied in an all-or-nothing operation. This ensures our brand intelligence remains a trusted source of truth, free from corrupted or conflicting data.

IV. Results and Business Impact

The transformation from logo detection to semantic understanding delivered measurable improvements across every key metric.

A big increase in brand coverage means we can now recognize thousands of additional brands without any explicit training, including regional advertisers and emerging direct-to-consumer brands that were previously invisible.

The high lift in accuracy means fewer attribution errors, which translates directly to more accurate campaign reporting and better contextual targeting. At the scale of millions of ads per month, this has a significant positive impact on revenue.

Case Study

To illustrate this impact, let’s analyze a representative advertisement. (Note: Brand names and entities in this example have been anonymized for illustrative purposes.)This case demonstrates how our system resolves complex attribution challenges that often lead to gaps in traditional logo-based methods.

Below are the two primary inputs received for the “LuminaWash Gel” advertisement:

Temporal Montage: The entire visual narrative of the video — including laundry scenes, product close-ups, and fast-wash icons — is compressed into a single composite image.
Audio Transcript: The full-text transcription of the voiceover: “LuminaWash Gel. Just flip it, pop it, and squeeze it. Powerful cleaning even in 15 minutes. It’s brilliant.”.

Caption: Montage image extracted from the case ad video.

The solution is then handled by our two-stage online recognition architecture (System 1):

Stage 1: Raw Brand Extraction The multi-modal LLM (LLM #1) synthesizes the visual narrative from the montage with the spoken content from the transcript. Even if the transcript contains minor errors, the model references visual evidence to accurately extract the raw name: “LuminaWash Gel”.
Stage 2: Entity Resolution via Knowledge Graph This raw string is passed to the reasoning model (LLM #2), which orchestrates tool-use to find the matching entity and its root parent within our graph.

Apex Consumer Group [id: apex-cg] - Global consumer-goods giant.
├── Lumina [id: apex-lumina] - International fabric and home care brand.
│     └── LuminaWash [id: apex-luminawash] - High-efficiency laundry gel line.
├── Aura [id: apex-aura] - Premium air-care and home fragrance.
├── VitaPure [id: apex-vitapure] - Personal hygiene and skin care line.
└── ……

Caption: An illustrative brand family tree generated by System 2 to model complex corporate hierarchies.

The system correctly attributes the “LuminaWash Gel” advertisement to its ultimate corporate owner, Apex Consumer Group, even though the parent company’s name is never mentioned in the video or audio. This high lift in accuracy reduces attribution errors and directly improves campaign reporting and contextual targeting.

V. The Path Forward

This architecture establishes a foundational pattern for contextual intelligence in AdTech. The journey from traditional logo detection to semantic brand understanding represents more than a technical upgrade — it’s a fundamental rethinking of how machines comprehend video content, showcasing AdRise’s ability to execute cutting-edge solutions using AI & LLMs.

This project currently serves platforms like Tubi, optimizing critical revenue, finance, and operations decisions, and ensuring reliable spend attribution and enhanced competitive separation. Moving forward, AdRise is positioned to integrate this scalable and transformative capability to serve more streaming platforms, unlocking advanced ad serving features and greater monetization efficiency across the industry. The shift from detection to understanding is just beginning — promising even greater autonomy and relevance for advertisers and viewers alike.

AdRise Brand Intelligence: How We Use AI to recognize brands in ad video was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beyond Discovery: Teaching Recommenders to Resume the Story

Lihongfei — Fri, 30 Jan 2026 00:09:42 GMT

Overview

Every streaming session begins with a question: “Do I want to start something new, or continue what I left unfinished?”

In the exploration mode, users browse, discover, and sample new content — movies, shows, or genres they’ve never seen before.
In the continuation mode, they return to finish what they’ve started — a half-watched movie, or a series they plan to pick up again when the next season drops.

Most recommender systems, however, don’t explicitly distinguish between these two modes. Their training data simply reflects whether a user interacted with an item — not why. As a result, the model learns correlations based on aggregate watch patterns, without understanding the underlying intent. When a user opens the app, the system doesn’t know whether they’re in an exploration mindset (looking for something new) or a continuation mindset (wanting to resume something familiar).

This ambiguity becomes especially important for long-form streaming platforms, where users often return to ongoing stories after long gaps between seasons. That’s the challenge we faced at Tubi.

Tubi is a free, ad-supported streaming service that offers thousands of movies and TV shows across every genre imaginable — from Hollywood hits to hidden indie gems. Unlike short-video apps, where engagement is driven by endless novelty and rapid discovery, Tubi focuses on long-form storytelling. Our users invest time — starting a movie, pausing midway, and coming back later, or watching a series across multiple seasons.

This means our recommendation system must do more than predict what is new or trending. It must recognize what users are trying to continue. That realization led us to design a new class of features we call unfinished signals, which capture a user’s intent to resume partially watched content. These signals help the system remember what a user started but has not finished, providing longer term context that many traditional recommenders do not model explicitly.

In this post, we share how we improved Tubi’s recommendation system by teaching it to better recognize viewing intent, whether a user is exploring something new or returning to something familiar.

Our approach centers on a set of affinity signals that reflect a user’s longer term relationship with content, including measures of completion, freshness, and format, since behaviors differ significantly between episodic and one shot viewing.
We then explored how to incorporate these signals into deep models using feature transformations that make skewed behavioral signals easier for the model to learn from, and more robust under real traffic.
Finally, we strengthened long term modeling by summarizing historical interactions in a way that emphasizes stronger continuation intent, allowing unfinished, high affinity titles to remain salient even when they are not the most recent.

We also carried the same idea into the retrieval stage so unfinished titles could be surfaced earlier in the pipeline, showing that unfinished signals can improve long term memory throughout the personalization stack.

Defining Unfinished Signals

Before we could model unfinished-watching behavior, we first needed to define it in measurable terms.

While the idea of “unfinished” seems intuitive — a user starts something but hasn’t completed it — translating that into reliable features requires careful consideration of content structure, temporal context, and Tubi’s platform characteristics.

1. Completion Ratio: Quantifying Continuation

The notion of “completion” naturally differs between movies and TV series, which follow fundamentally distinct engagement patterns.

Movies are typically one-shot experiences — users either finish them or stop partway through.
TV shows, in contrast, are episodic and sequential — users often watch several episodes, pause, and return when a new season arrives.

To reflect these behavioral differences, we defined completion separately for each type:

Movies: watched time / total duration
TV Shows: watched episodes / total episodes

This definition aligns closely with how users engage with long-form storytelling on Tubi.

For TV series in particular, we measure completion within each season rather than across multiple seasons — a deliberate simplification that reflects Tubi’s annual content release cadence.

For example, a user who watched half of Season 1 but hasn’t started Season 2 would have a completion ratio near 0.5, signaling a strong continuation potential once the next season becomes available.

2. Recency: Capturing Temporal Relevance

Completion alone doesn’t capture when the engagement occurred, which is critical for understanding whether an unfinished title still holds interest. To address this, we added recency as a companion feature — measuring how long it has been since the user last interacted with a title. This signal helps differentiate between “recently paused” content (still top of mind) and “long-dormant” titles (possibly forgotten).

Let:

t_last : timestamp of the user’s most recent interaction with a title
t_now : timestamp at the time of scoring

Then the Recency feature is defined as:

Recency = (t_now — t_last) in days

We extended our historical window to capture long-term viewing patterns and meaningful continuation intent, while maintaining computational feasibility.

Integrating Unfinished Signals into the Model

With unfinished signals defined, the next question was how to make the model use them effectively. We explored three progressively enhanced configurations, each extending the same base architecture with different ways of representing and leveraging unfinished signals.

Figure 1. Ranker architecture integrating unfinished signals across three configurations. The first adds lightweight numeric features, the second uses a richer learned encoding of those signals, and the third aggregates longer term history in a continuation aware way to emphasize stronger resume intent.

T1: Numeric Features as Lightweight Memory

In this configuration, the model enriches each candidate with two lightweight signals derived from the user’s longer term viewing history:

a measure of how much of a prior interaction was completed
a measure of how recent that interaction was

When the history does not contain a relevant interaction for a given candidate, the model falls back to a standard missing value representation. These features are then added directly into the ranker’s feature set.

Even this basic version produced measurable lifts in offline ranking metrics such as NDCG at K, which was an early indicator that unfinished viewing behavior can capture longer term intent beyond short session signals.

T2: Learned Representations to Capture Nonlinear Patterns

Next, we replaced raw numeric inputs with a learned representation that gives the model more expressive power. This allowed it to capture nonlinear relationships in unfinished signals, for example distinguishing between a title a user paused recently and a title they finished long ago. Offline results were broadly similar to the first configuration, but this version behaved more consistently under live traffic, suggesting better generalization under real world distribution shifts.

T3: Completion Aware Pooling over Longer Term History

Finally, we explored a richer history representation that aggregates past interactions rather than relying on a single lookup. Instead of using a single pair of signals, this approach constructs a compact summary of longer term behavior, where partially completed interactions can contribute differently from fully completed ones. This allows the model to emphasize stronger intent signals while still preserving long range preferences.

Comparative Offline Results

Across all three configurations, we observed consistent improvements in both offline and online evaluations.

Figure 2. Offline evaluation results across T1–T3. Users with more unfinished titles in their watch history show larger gains in NDCG.

After integrating unfinished-watching signals, these titles are ranked higher in the recommendation lists, confirming that the model became more sensitive to continuation intent as expected.

Figure 3. Ranking-position shift of unfinished titles after unfinished-signal integration. Unfinished titles appear higher in recommendation lists, reflecting stronger continuation intent modeling.

Extending to the Retrieval Model

Beyond the ranking model, we also extended the same idea to the retrieval model. This stage is responsible for selecting candidate titles before ranking. By incorporating unfinished-watching signals into this earlier stage, we allowed the system to resurface users’ long-gap or unfinished titles as additional recall candidates.

The change was simple but effective: each user’s unfinished titles from the past several years were explicitly added to the recall pool. This ensured that returning series, sequels, or movies the user had partially watched were not overlooked by the retrieval model.

Offline tests showed that this adjustment significantly improved recall coverage, and online experiments confirmed a measurable lift.

Together, these enhancements made the overall recommendation pipeline better aware of long-term unfinished contents — the ability of remembering and re-surfacing what users cared about, even after long viewing gaps.

Online Performance and System-Level Impact

When deployed in production, the combined improvements from the ranker (T1–T3) and the retrieval model produced a substantial step forward in personalization quality and user engagement. Each component added value on its own, but the real impact appeared when they worked together, creating a lift that exceeded the sum of their individual contributions. In online A/B testing, this unified approach resulted in nearly a 1% increase in total viewing time (TVT), demonstrating the system-level strength of reinforcing unfinished-watching intent throughout the recommendation pipeline.

The online results also highlighted an important principle: long-term memory plays a critical role not only within individual models but across the entire recommendation stack. By incorporating unfinished-watching signals in both retrieval and ranking, the system developed a more complete understanding of each user’s viewing journey. It could retain context about what users had started, identify when that interest became relevant again, and bring those titles back at the right moment. What began as a small modeling enhancement ultimately evolved into a system-wide capability that made our recommendations feel more intuitive, more personal, and more aligned with real user behavior.

Conclusion: Remembering What Matters

In streaming, engagement isn’t just about helping users find something new — it’s also about helping them continue what they already love.

By teaching our recommendation system to recognize and remember unfinished watching behaviors, we made Tubi’s recommendations more human and more intuitive.

This project reminded us that even in highly optimized systems, sometimes the most valuable signals are the ones we overlook — the quiet, unfinished stories waiting to be continued.

Thank you for reading!
If you’re interested in recommendation systems, personalization, or large-scale machine learning, follow the Tubi Tech Blog for more behind-the-scenes insights from our data science and engineering teams.

Beyond Discovery: Teaching Recommenders to Resume the Story was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Evolution of Service Mesh in the Cloud Native Era: Tubi’s Journey

Zhiquan Yu — Fri, 26 Sep 2025 01:55:02 GMT

Zhiquan and Dong

As cloud-native technologies mature, Kubernetes has become the de facto standard for deploying and managing modern applications, thanks to its powerful features and ease of use. A growing number of technology companies have gradually transitioned their infrastructure from traditional bare metal to being fully containerized. This transformation has introduced numerous challenges, such as container networking, distributed storage, and security permission management. Among these, the evolution of inter-service traffic management is particularly noteworthy, giving rise to the concept of the “Service Mesh.”

What is a Service Mesh?

A Service Mesh is an infrastructure layer dedicated to handling service-to-service communication. It is responsible for reliably delivering requests across the complex topology of services that constitute a modern cloud-native application. In practice, a Service Mesh is typically implemented as an array of lightweight network proxies deployed alongside application code, without the application needing to be aware of the proxy’s existence.

The microservices architecture has led to a finer granularity of services, which in turn brings challenges like the unified management of cross-service capabilities such as circuit breaking, rate limiting, and security controls. If each service were to implement these features within its business logic, the work would not only be repetitive but also difficult to maintain. Therefore, abstracting the service communication logic from the business logic into a unified service mesh has become a core requirement in a microservices architecture.

As shown in the diagram, a Service Mesh primarily consists of a control plane and a data plane. The control plane is responsible for generating and distributing policy configurations, while the data plane is composed of proxies that directly handle traffic flowing in and out of services. In the early era of bare-metal deployments, the data plane could run as a host-level service. In a Kubernetes environment, however, the Sidecar pattern has become the best practice, allowing it to coexist with the business container.

Tubi has progressively transitioned from traditional bare-metal infrastructure to a fully cloud-native architecture, leveraging technologies such as containerization, Kubernetes orchestration, and CI/CD automation. Alongside this shift, the service mesh has evolved into a foundational layer of the platform.Starting from simple static configurations and custom service discovery, Tubi adopted modern service mesh solutions built on Envoy and Istio to enable centralized traffic control, deep observability, progressive delivery.

In the following sections, we provide a detailed review and analysis of this architectural evolution.

Part 1: Pre-Kubernetes (Service Mesh 1.0)

Pre-Kubernetes Stage: The First-Generation Service Mesh

Before adopting Kubernetes, Tubi used Terraform to create servers on AWS and deployed services with Ansible. We used Consul for service discovery and built our first-generation in-house mesh architecture with Envoy. Although it provided basic capabilities, this architecture faced issues such as opinionated scaling models, tailored configurations, and comprehensive tooling ecosystems — all things with a wealth of support in the Kubernetes ecosystem.

Starting in 2020, Tubi began to progressively migrate its services to Kubernetes.

Service Mesh 1.0: Envoy + In-House Control Plane

Within Tubi’s technical framework, we planned a gradual evolution from bare-metal deployments to a Kubernetes cluster, supplemented by cloud-native CI/CD and other peripheral components. Before and during this cloud-native transition, we used a custom service mesh based on Envoy. The specific architecture was as follows:

As the diagram illustrates, the service mesh at this stage was composed of three parts:

CPS (Control-Plane-Server): Acting as the control plane for the entire service mesh, it generated corresponding Envoy configurations based on configuration files and sent them down to the data plane via xDS requests. The data plane consisted of Envoy instances co-located with the services.
Envoy: As the data plane, it received configuration files from the control plane, generated the corresponding ingress and egress listeners, proxied incoming and outgoing traffic for business services, and implemented features like circuit breaking, rate limiting, and observability.
CPC (Control-Plane-Configuration): It generated control plane configuration files for each service based on templates and business-side settings, then submitted them to the CPS to build out the service mesh.

In this architecture, the data plane (Envoy) could run either as a standalone service instance on bare metal or be patched as a sidecar container into the corresponding business pod. Developers would write control plane configuration files based on service characteristics (e.g., port, request volume, circuit breaker thresholds). These files were then processed by auxiliary tools to generate standard Envoy configurations, which were applied to the control plane and distributed to the corresponding service instances or pods, completing the service mesh.

Under this design, we established a convention: the data plane Envoy would listen on a specific port (defaulting to 8001) and proxy all ingress traffic, routing it to different applications based on the Host header. Similarly, all egress traffic was proxied through a corresponding port and forwarded to different external services.

As Kubernetes clusters were rolled out at Tubi, we began to encounter several pressing issues:

Version Upgrades: The Envoy version was outdated, preventing us from using new features.
Maintenance Costs: Due to personnel changes, the maintenance cost of some components gradually increased, and some issues could not be resolved by relying on the experience of the open-source community.
Security: The new service mesh architecture is expected to deliver additional security-focused patches and adopt new features in line with evolving industry security standards.

For these reasons, we began exploring new service mesh components to replace our existing setup.

Part 2: Evaluating Istio vs Others (Service Mesh 2.0)

Technology Selection Comparison

Before deciding which tool to use, we first needed to clarify our requirements. We expected the following features:

Traffic Management: Improve flexibility in traffic shaping

Service Discovery
Custom URL Routing
Topology-aware Routing
Ingress & Egress Gateways

Reliability & Performance: New controls to improve stability and reduce latency

Load Balancers & Connection Pooling
Retries, Timeouts, and Rate Limits
Compression & Caching
Circuit Breakers & Fault Injection

Observability: Standardize and automate the collection of data

Latency, Error rates, Connection Count
Standardized Logging
Distributed Tracing

Security: Standardize and Automate Security Requirements

mTLS Encryption
Authorization Policy CRDs
Certificate management

At the time, there were many service mesh solutions available, both from the open-source community and as cloud services from major providers. Some of the more common options included:

Istio
Kuma
Cilium
Linkerd
AWS App Mesh
NGINX Service Mesh

Below is a feature list for these products (note: this information is from 2 years ago).

In addition, we conducted some benchmark tests on a few of the more feature-rich tools. We ran the benchmark with the tool Istio provides against different mesh projects. The test involved one client pod connecting to a server pod running on a different node.

Kuma actually had the minimum latency, and its resource consumption was lower.
Istio was not as good as Kuma, but they were close, as the latency we are talking about is based on 1000 requests. Since the data planes of Istio and Kuma are both Envoy, most of the latency should be produced by different filter chains in Envoy, which can be optimized in actual use cases.
Cilium’s latency was slightly better at the P90 level.

For the three products that performed well in the benchmark — Istio, Kuma, and Cilium — their community support was as follows:

Istio

Active maintainers are from many well-known companies, e.g., Google, IBM.
The roadmap is decided within the community. Platform Independent.

Kuma

Active maintainers are from Kong.
The roadmap might be heavily coupled with products from Kong, like Kong Gateway.

Cilium

Actively maintained by Isovalent.
The roadmap is decided within the community. Platform Independent.

The technology stacks for these three tools are as follows:

Istio

Sidecar is Envoy
Developed with Golang

Kuma

Sidecar is Envoy
Developed with Golang

Cilium

Sidecar-free mode, but it also uses Envoy/eBPF at the node level
Developed with Golang

After the above comparison, we ultimately decided to use Istio for the following reasons:

Stable, rich, and extensible features
Active community
Platform independent
Good performance

A particularly important factor was its use of the Envoy sidecar. This aligned well with our previous mesh technology stack. Our developers were already familiar with Envoy, and many of the monitors and dashboards we had built based on Envoy metrics could continue to be used.

Considering Migration Effort and Future Maintenance Costs

Because Istio mesh also uses Envoy as its proxy, the migration was effectively a transition from a manually maintained Envoy sidecar to an istiod-injected Envoy (istio-proxy) sidecar. Although both are Envoy, their operational models are quite different.

Istio uses a mutating webhook to inject the sidecar into the pod. There is no longer a need to maintain the sidecar manifest yourself.
Istio comes with its own control plane, istiod, eliminating the need to create and maintain a separate control plane. By default, it pushes all services in the cluster to the sidecar, so you no longer need to manually maintain these Envoy configurations.
istio-proxy automatically takes over all traffic in the pod by default, making the entire process transparent to the user. Users do not need to explicitly set the service port to the port listened to by Envoy.

As you can see, after the migration, developers’ daily development and maintenance work would become simpler. They could essentially ignore the existence of istio-proxy and automatically benefit from its conveniences.

Part 3: Migration Challenges & Lessons

Performance and Latency Issues

Since we were, in principle, switching from one Envoy sidecar to another (istio-proxy), everyone expected their performance to be comparable. However, when we actually started using it, we found that this wasn’t the case. The performance of istio-proxy was slightly worse than our previously used Envoy. In our research, Istio will take over all the traffic through iptables rules, include many external dependencies, such as PostgreSQL, Redis, and message queues, but to avoid an “apples to oranges” comparison, we decided to verify this using A/B testing.The method was to deploy two sets of deployments in the cluster simultaneously, both with the same app label. We then used this label in the service to select pods. This meant that both sets of deployments served the same service at the same time. When a request came in, Kubernetes would randomly distribute it among all pods of both deployments. During the comparison, we found we also needed to eliminate interference from the following factors:

Ensuring that istio-proxy had sufficient CPU and memory, with no obvious throttling.
Configuring RDS traffic to bypass istio-proxy. Our traditional Envoy sidecar did not handle RDS traffic.
Ensuring that the nodes where the pods of both deployments were located had comparable performance. For example, we observed some minor performance differences between pods on c6i and c5 nodes. There were even some differences between different instance sizes within the c6i series.

After our tests, the general conclusion was that switching to istio-proxy resulted in a performance degradation of 5–10ms for most of our services.

Headless Service Compatibility Issues

Why We Used Headless Services

A Kubernetes Headless Service is a special type of Service that does not assign a Cluster IP. Instead of load-balancing traffic to a set of pods, it allows clients to discover and connect directly to individual pod IPs.You can refer to the official Kubernetes documentation for more details.

The reason Tubi used Headless Services goes back to our use of the Envoy sidecar. Envoy supports health checking for the endpoints of an upstream cluster. When we used our Envoy sidecar, the upstream cluster was generally configured manually. We would configure the service DNS as the address of the upstream cluster and, in conjunction with Consul (remember our pre-Kubernetes setup?), use the resolved IPs as endpoints and perform health checks on them. Based on this use case, we implemented the same functionality in Kubernetes using Headless Services.

Why Headless Services are a Problem in Istio

In Istio, clusters are distributed to each istio-proxy via xDS, and the endpoint information for each cluster is sent to istio-proxy via EDS. This means that the Envoy used by istio-proxy no longer relies on DNS for service discovery, nor does it need to perform health checks on endpoints. Istio has good support for non-headless services, but its support for Headless Services has some issues, with several related issues in the community, such as:

During application deployments, after a pod’s rolling update, the old pod’s IP might still exist in the corresponding endpoints, causing access failures.
Lack of support for some advanced features, such as locality-based traffic distribution.

By setting PILOT_ENABLE_EDS_FOR_HEADLESS_SERVICES for Pilot, you can have the endpoints of headless services also distributed to istio-proxy via EDS. However, at the time, we judged this feature to be experimental and that it could potentially cause other unknown problems. Therefore, we decided to switch the service type to completely resolve this issue.

So, we needed to switch the type of service we were using from Headless to ClusterIP.

Resource Consumption Issues

Istio-proxy Container OOM Problems

Another problem we encountered during the migration was resource consumption. When we used our self-configured Envoy sidecar, the clusters were configured by us. With Istio, however, it defaults to pushing all services as cluster configurations to the istio-proxy sidecar. Our cluster had many services, which caused the istio-proxy to use much more memory than before. The solution was simple: a choice of one or a combination of the following methods:

Increase the Memory request/limit.
Use the Sidecar resource to limit the number of clusters distributed.

After comparison, we found that option one would lead to ever-increasing resource consumption as the cluster scaled, and synchronizing a large number of Envoy configurations across the cluster would also create a network performance bottleneck. Therefore, option two was more suitable in our scenario.

Taking one application as an example, before enabling the Sidecar resource, the memory consumption of the istio-proxy container could reach 1 GiB. After enabling it, the memory usage quickly dropped and stabilized at 220 MiB.

CPU Throttle Issues

This problem stems mainly from the working principle of istio-proxy. The Istio sidecar intercepts all TCP traffic of a pod by default. This means that requests to some RDS instances and other external services also pass through the sidecar. Before using Istio, our Envoy sidecar only processed requests to internal services. This difference meant the new sidecar needed to handle more traffic, leading to higher CPU usage. There were also two solutions here, to be used alone or together:

Increase the CPU request/limit with annotations:

2. Configure traffic to RDS and other non-cluster services to bypass the istio-proxy. Using an annotation like traffic.sidecar.istio.io/excludeOutboundPorts, requests to certain ports can be excluded from istio-proxy’s handling.

Both solutions were simple to implement. However, option one would lead to resource redundancy, while option two would result in the loss of some metrics and traffic management features. Therefore, we left the decision to the developers, allowing them to choose the solution that best fits their service’s characteristics and needs.

In real-world operations, we observed that the resource consumption of pilot/istiod increases significantly with cluster scale. Under high-concurrency scenarios and frequent cross-namespace interactions, istiod had to be scaled up to 10 CPUs and managed with HPA to maintain stability. This illustrates that the control plane itself faces saturation and optimization challenges, with performance bottlenecks closely tied not only to the scale of xDS configurations but also to cross-namespace service access patterns. These lessons directly influenced the evolution toward Ambient mode — which redefines the scope of the mesh from the “sum of all applications” to the “cluster level,” thereby alleviating control plane pressure and greatly simplifying performance optimization.

Locality Load Balancing and Pod Distribution Optimization

Locality Load Balancing has different names in different technical contexts, but they all refer to roughly the same concept:

In Istio, it’s called Locality Load Balancing.
A similar feature in Kubernetes is called Topology-aware Routing.
In Envoy, a similar feature is called Zone-aware routing.

Why Use This Feature?

This is a feature of Istio (and Envoy) that we were very keen to use. By using it, traffic can be kept within the same availability zone. This can:

Slightly reduce latency.
Reduce cross-zone traffic. As I understand it, some cloud providers do not charge for this. We use AWS, and this cost is not insignificant for us each month. So, reducing this traffic helps us save some money.

Enabling this feature in Istio is very simple. However, right after enabling it, we immediately ran into the problem of unbalanced load due to uneven traffic distribution. This might sound a bit abstract, so let’s illustrate with an example.

Suppose service A has a total of 10 pods and it needs to get some information from service B, which also has 10 pods. Without locality load balancing enabled, requests from each of A’s pods are evenly distributed across B’s pods. Thus, each pod of B receives 1/10 = 0.1 of the total requests from A.

After enabling locality load balancing, the requests received by B’s pods depend on the distribution of pods between zones. Let’s assume there are two zones, az1 and az2, and the pod distribution for A and B is as follows:

Here you can see that because the pod distribution is uneven, each pod of B in az2 will need to handle traffic from the 7 pods of A in az2. This means each of B’s pods in az2 needs to handle (7/10) / 4 = 0.175 of the total traffic. This is 1.75 times more traffic than in the scenario without locality load balancing. This traffic imbalance leads to several problems:

Pod requests are difficult to configure. If you configure them based on the high-traffic pods, the low-traffic pods will waste resources.
HPA targets are difficult to configure. Since HPA is usually based on the average CPU usage of all pods, load imbalance can mean that some pods are already under high load, but the average CPU usage has not yet reached the HPA target.

To successfully use locality load balancing, we first need to ensure that pods are evenly distributed across zones. For pods to be evenly distributed, the nodes must first be evenly distributed across zones. This is not easy to solve, as the ability to create nodes in the desired zone also depends on the cloud provider’s capacity. Different cloud services will behave differently in different regions. This article does not intend to delve into this; we will assume that nodes are relatively evenly distributed.

Pod Topology Spread Constraints (TSC)

By default, the Kubernetes scheduler tries to distribute a deployment’s pods as evenly as possible across the cluster. Please refer to the official documentation. If you feel this configuration is not enough, you can also define cluster-level TSCs.

Additionally, as mentioned in that article, you can define Pod-level TSCs.

So, does defining TSCs solve the distribution problem?

Usually, unless whenUnsatisfiable is set to DoNotSchedule, when node resources are insufficient, new pods may still be placed in zones unevenly. If DoNotSchedule is used, you risk sacrificing immediate scaling capability.
Because TSC only takes effect when a new pod is scheduled, an uneven distribution can still occur after a service scales down.

Descheduler

Based on the above discussion, we need a controller that can continuously check the pod’s TSC configuration and keep the pods evenly distributed. This is where the Descheduler comes in. Before we enable the Descheduler policy for applications, the best practice is to enable PDB(Pod Disruption Budget) to ensure that the application will not be interrupted due to eviction.

The Descheduler supports many features, but we will only discuss its RemovePodsViolatingTopologySpreadConstraint plugin. This plugin’s logic is quite simple: it periodically checks for pods that violate TSC configurations. If it finds any, it evicts them, causing them to be recreated. This allows them to be rescheduled onto more appropriate new nodes.

Conclusion

Once we took all these factors into account, we could generally ensure that a service’s pods would be mostly evenly distributed across the cluster. It’s imaginable that achieving a 100% even distribution is difficult. Therefore, it’s also important to note that when a service has a small number of pods, it is usually very sensitive to imbalances. Thus, it is generally recommended to only enable locality load balancing for services with a sufficiently large number of pods (e.g., more than 30). This way, even if some imbalance still occurs, its impact will be relatively small.

Part 4: What’s Next

Support Multi-Cluster Architecture

As Tubi’s business grows, our current single, gigantic cluster model is gradually becoming obsolete. A single cluster lacks service-to-service isolation, posing challenges to permission and security management and making it difficult to comply with certain legal and regulatory requirements. In the future, we hope to expand to a multi-cluster model. In a multi-cluster setup, as the number of maintained clusters and the complexity of inter-service calls grow, the need for cross-cluster service communication and traffic management becomes more frequent and important. Istio’s multi-cluster mode is exactly what we need. By deploying an Istio service mesh across multiple Kubernetes clusters, they can work together as a single, unified mesh. The multi-cluster mode can provide an enhanced inter-cluster service mesh, for example:

Services can be discovered and accessed across clusters as if they were in the same cluster.
Unified traffic management and security policies, regardless of which cluster a service is in.
Improved availability: if one cluster has an issue, traffic can be shifted to another.
Support for multi-cloud or hybrid cloud deployments for more flexible resource utilization.

Therefore, we are investigating Istio’s various multi-cluster models, such as the shared control plane and primary-remote models, to build a more resilient, secure, efficient, and highly available service mesh.

Explore Ambient Mesh Mode

Istio’s Ambient Mesh mode is a lightweight service mesh architecture designed to reduce intrusiveness on applications while improving performance and security, aimed at solving performance and maintainability issues in large-scale deployments. In the traditional Istio model, each service instance requires a sidecar proxy to be deployed for traffic management and security control. In Ambient Mesh mode, the strong dependency on sidecars is removed. Instead, a shared proxy (zTunnel) runs on each node, and centralized components (like a waypoint proxy) handle L7 traffic in the mesh.

This mode will bring us:

Lower resource overhead: No need to start a sidecar for every pod.
Simpler operations: Reduce the complexity of deployment and upgrades.
Stronger observability and security: Still retains the traffic management, security controls, and telemetry features provided by Istio.

We have some reservations about this model. For example, in a real environment, the services running on each node will be different, so their traffic handling pressures will also differ. Allocating resources for this node-level agent will be a difficult problem to solve. We are currently conducting some tests and solution research.

Epilogue

As of this writing, Tubi has fully completed the migration of its service mesh from an in-house architecture to Istio, significantly enhancing our service governance capabilities, observability, and system stability. Throughout this process, the Infra team ensured a smooth migration through comprehensive documentation, processes, and collaboration with developers. We have accumulated valuable engineering experience, laying a solid foundation for the future evolution towards a multi-cluster architecture.

The Evolution of Service Mesh in the Cloud Native Era: Tubi’s Journey was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tubi’s VOD Preflight QC: Automated Quality Control of Submitted Content

Kevin Corcoran — Wed, 03 Sep 2025 00:07:22 GMT

Tubi offers the largest collection of premium on-demand content, including over 300,000 movies and TV episodes and more than 400 exclusive originals. To keep this vast library growing and up to date, we continuously ingest video files from a wide network of content partners.

But not all incoming videos are created equal. Some video files provided by content partners may not meet Tubi’s submission guidelines. Such problem videos can slow down our processing pipeline, increase operational costs, and — if left unchecked — can result in broken playback experiences for viewers.

Parsing, transcoding, or packaging the video might fail.
Processing might succeed only to have Tubi’s Content QA team discover a problem, which would then need to be manually fixed. Once fixed, processing must start again from the beginning.
In the worst case, the problem video goes to production and is seen by the end viewer.

Tubi’s VOD Preflight QC project is intended to evaluate media files up front in order to flag problems. It is designed to

catch problems using a fast process before sending videos to much slower, more costly full processing,
give Tubi’s Content Processing team actionable information they can use for manual fixes, and
lay the groundwork for future automated fixes

Note that Preflight QC checks only for audio and video issues; other potential content-delivery issues fall outside its scope.

Currently Supported Media Inspections

Preflight QC performs the following inspections on submitted media:

Is the container format supported?
Are the video and audio codecs supported?
Do the audio and video track durations match (within a threshold)?
Does the file meet Tubi’s minimum duration requirements for long-form content?
If an MP4 or MOV file, can the file be parsed without error?
Is there one, and only one, video track?
Is the video bitrate sufficient for a good-quality video?
Is the video framerate constant, and is it one of the supported framerates (23.976, 24, 25, 29.97, 30, 50, 59.94, 60 fps)?
Is the video resolution at least standard definition (SD) or higher?
Are there any undesirable colorbars, counting leaders, production slates, or timecode at the beginning or end of the video?
Are there undesirable test tones at the beginning or end of the audio?

Preflight QC is a Feature of Tubi’s Zeus

Zeus is Tubi’s media processing system. It is directly responsible for parsing, analyzing, transcoding, and packaging videos that will be streamed to viewers.

The Zeus QC job was created to check for problems with transcode and packaging output, but is also the basis for the new Preflight inspections. For Preflight QC, we wrote additional inspections applicable to partner source videos.

The QC job reads in a Media File, runs Zeus MediaParse to gather metadata.

Zeus uses a pipeline of media filters and codecs to process audio and video. The filters may be LibAV filters, or may be custom Zeus filters. In the case of Zeus MediaParse, the pipeline uses packet logger filters to extract metadata describing the video and audio.

The gathered metadata is then evaluated in a series of inspections. Each inspection will seek out the metadata it needs to look for issues.

Zeus QC Results

QC Results output include a list of inspections run for each source file. Each inspection result includes

Inspection category (e.g., container, video, audio)
Inspection name
Required (boolean)
Test Rule An explanation of the rule being tested.
Status (pass, fail, warning)
Notes If the inspection failed, this is a detailed explanation of why if failed, and if appropriate, the specific media time when the problem occurred.

QC Inspections in Detail

Basic Media Info Inspections

Some inspections can be run using basic data read by libavformat. This data is similar to what you might get from running FFProbe or MediaInfo. These inspections are very fast, assuming the file is a header-based format such as .mov, .mp4, or .wav.

Colorbar Detection

The video_no_colorbars inspection can flag a variety of color-bar flavors including those pictured below.

The video_no_colorbars inspection detects colorbars using the following techniques:

A custom cascade classifier trained on images of various colorbars.
A high amount of vertical repeat
Saturated color.

When colorbars are detected, the inspection fails and the beginning and end times are reported in the QC results.

Slate and Timecode Detection

Slate and timecode inspections both require OCR, so they use the same metadata. The image below left shows an example of a production slate. The image below right shows an example of a burned-in timecode.

The video_no_slate inspection flags production slates by performing OCR, looking for a list of words that would appear with production info. (e.g., fps, aspect, ratio, codec, runtime, etc.)

This video_no_timecode inspection performs OCR on video frames, then uses a regex to search for groups of digits separated by colons and semicolons.

Note: OCR is run once for both inspections. Additionally, OCR is performed on a lower resolution, lower frame rate version of the video to save processing time.

When production info and/or timecode are detected, the inspection fails and the beginning and end times are reported in the QC results.

Countdown Leader Detection

The video_no_countdown inspection can flag a variety of counting leader flavors including those pictured below.

The video_no_countdown inspection detects countdown leaders using the following techniques:

A custom cascade classifier trained on images of various counting leaders.
Check frames for low color saturation, in order to rule out false positives.

When a countdown leader is detected, the inspection fails and the beginning and end times of the leader are reported in the QC results.

Test Tone Detection

This audio_no_tone inspection flags 440Hz and 1kHz test tones by running spectral analysis, looking for a prominent centroid at one of these frequencies. The image below shows what a tone might look like in a spectral analysis tool.

When tones are detected, the inspection fails and the beginning and end times are reported in the QC results.

User Interface

The results of Preflight QC inspections can be displayed in a tabular format, similar to the image below.

Conclusion

VOD Preflight QC is a new Zeus feature designed to proactively catch content issues before they disrupt the processing pipeline. By automating quality assurance checks at the earliest stage, it helps prevent costly rework, protects the end-user experience, and allows our content processing team to operate with greater speed and confidence.

Preflight QC marks a significant step toward a more resilient and scalable content ingestion system at Tubi. As we continue to expand our catalog, the need for automation and reliability becomes even more critical — and tools like this are paving the way forward.

Tubi’s VOD Preflight QC: Automated Quality Control of Submitted Content was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Argos: a Data-Driven Distributed Load Test Tool that Powers Super Bowl

Tyr Chen — Fri, 04 Jul 2025 01:48:42 GMT

Tian Chen and Tristen Wen

Introduction

In the fast-paced world of digital streaming, where events like the Super Bowl attract millions of viewers in a matter of seconds, ensuring system reliability under extreme load is non-negotiable. Traditional load testing tools like Locust, k6, and Akamai CloudTest have served the industry well, but they often stumble when faced with modern demands: high costs for scaling, sluggish adaptability, and complex scripting requirements. To address these gaps, we developed Argos — a cutting-edge, data-driven load testing tool designed to deliver unparalleled scalability, simplicity, and performance, as demonstrated during the preparation for Super Bowl, one of the most demanding live events of the year.

The Genesis of Argos

The limitations of legacy tools became glaringly apparent when preparing for massive traffic spikes, such as those seen during the Super Bowl. Scaling traditional tools to simulate hundreds of thousands of requests per second (RPS) often meant spinning up costly clusters of workers, with expenses ballooning rapidly. Their architectures struggled to scale up — or down to zero — efficiently, often bottlenecked by centralized master nodes. Resource sharing across teams was nearly impossible, as clusters were locked to single tests. Writing and debugging test scripts in Python or JavaScript added overhead, and customization options for post-test analysis were limited.

Argos was born from this frustration, with a mission to redefine load testing. It had to be cost-effective, leveraging AWS Lambda to scale dynamically and pay only for what’s used. It needed to be simple — think launching a test with a curl command — and lightning-fast, spinning up or shutting down in seconds. Above all, it had to provide rich, actionable data, both in real-time and for post-test deep dives, to ensure systems could handle the intensity of events like the Super Bowl without breaking a sweat.

What Makes Argos Unique

Argos stands out by blending flexibility, scalability, and data-driven insights into a seamless package. Here’s what sets it apart:

Simplicity: Launch a test with a single curl command — no scripting languages or domain-specific syntax required. For advanced scenarios, Jinja2 templates provide dynamic request generation.
Scalability: Built on AWS Lambda, Argos spawns thousands of functions (we call it Arglet) to simulate massive traffic, scaling to zero when tests are finished or get killed, keeping costs low.
Data Depth: Every test detail is captured — raw data stored in ClickHouse, aggregated via Materialized Views for real-time monitoring, and accessible through built-in visualizations (TUI/WebUI) or tools like Apache Superset.

Testing Single APIs and Complex Flows

Argos excels at testing both individual APIs and intricate flows of interconnected APIs, a capability that’s critical for real-world applications like streaming platforms or e-commerce systems.

Single API Testing: With Argos, you can load test a single endpoint by just putting the curl command in the test flow config.
Flow Testing: Where Argos truly shines is in testing sequences of APIs where each step depends on the previous one. Consider a user journey: [register, login, like_content]. During execution, each Arglet — the Lambda function running the test — remembers headers and response bodies from prior calls and uses them in Jinja2 templates for subsequent requests.

This flow testing capability eliminates the need for custom scripts to manage dependencies, making it ideal for various scenarios like user onboarding, or content interaction.

The Separation of Argos and Arglet

Argos is architecturally split into two distinct layers: the control plane and the data plane.

Argos (Control Plane): Argos is the brain of the operation. It manages test configurations (via files like argos.yml), spawns thousands of Arglets, and coordinates test execution. It’s lightweight, focusing on orchestration and result aggregation, ensuring the system scales without bottlenecks.
Arglet (Data Plane): Arglets are the workhorses — AWS Lambda functions that execute the actual test cases. Each Arglet runs independently, handling a portion of the test flow, processing requests, and returning results. With thousands of Arglets running in parallel, Argos can simulate millions of requests per second effortlessly.

This separation is key to Argos’ scalability. For instance, during one of many pre-SuperBowl load testing, Argos spawned over 6,000 Arglets to generate 1.5 million rps for analytics endpoints alone. The control plane remains agile, while the data plane absorbs the heavy lifting, leveraging Lambda’s elasticity to scale up or down instantly.

Here’s a simplified view of the architecture:

Users can run the Argos CLI on their laptops (requiring ClickHouse and a properly configured AWS environment). Based on the configuration, a fleet of Arglets will be spawned, with each assigned dedicated test cases.

Each Arglet executes its test cases independently, collects test results, and accumulates them in a channel before streaming the data back to the Argos CLI. Once received, Argos batches the data and inserts it into ClickHouse, triggering materialized views to aggregate per-second results. Finally, the visualization subsystem processes the aggregated results and renders them in either TUI or WebUI, as needed.

The Data Bus: The Heartbeat of Argos

At the core of Argos’ architecture lies the Data Bus, a high-throughput data layer that manages test results that are being received from Arglets. It separates the data processing and data consuming, making argos highly extensible. We use ClickHouse as the data bus for Argos. The Data Bus handles:

Store Raw Results: Arglets stream raw test data (up to 20MB per response) to Argos, and Argos will then store it via Argos reporter every X ms (depending on config).
Aggregation: The Data Bus processes this data via materialized view, generating per second aggregated results for real-time insights (e.g. visualization).

Here’s a visual representation:

How Argos Operates: A Step-by-Step Breakdown

Argos’ workflow is a symphony of distributed efficiency. When a user starts the Argos CLI with:

$ argos server start

Argos reads argos.yml, arglet.yml, and the associated test and context configurations to establish its core settings. This configuration-driven approach ensures maximum flexibility.

Next, Argos resolves all Jinja2 templates, dynamically generating the test configuration for each Arglet based on the provided parameters. Argos then spawns Arglets, passing them their respective test configurations as parameters.

Each Arglet parses its assigned configuration, independently executes its test cases, and sends the test results for each API execution back to Argos.

Now let’s look at the Argos config.

Argos config

The Argos configuration specifies key parameters, including:

The ClickHouse database location
The Lambda function to execute and the number of instances
The shared S3 bucket used by both Argos and Arglets

The shared S3 bucket plays a crucial role in Lambda function termination, which we will discuss in detail later.

---
name: my-test
users_per_worker: 20
port: 9999
clickhouse:
  url: http://localhost:8123
  database: argos
  username: default
  options:
    async_insert: 1
    wait_for_async_insert: 0
  process_interval: 250
lambda:
  function_name: arglet
  concurrency: 1000
  invoke_delay: 10
  regions: [us-west-1, us-west-2, us-east-1, us-east-2]
  profile: sandbox
s3:
  bucket: arglet-bucket
  profile: sandbox

Arglet config

The Arglet configuration specifies the test duration and the specific flows that each Arglet will execute.

- -
duration: 1500
ramp_up_duration: 100
tear_down_duration: 100
send_interval: 250
error_samples:
500: 2
502: 10
503: 500
504: 100
follow_redirect: false
flows:
- name: homescreen
count: 1
steps: [homescreen]
- name: container_and_content
count: 2
# step interval in milliseconds
interval: 2
steps: [container, content_by_resp]

Test flow config

All APIs to test can be defined using cURL commands, with dynamic parameters formatted in Jinja2.

In Jinja2 templates, double curly braces ({{ … }}) are typically used for interpolation, enabling the embedding of dynamic content within templates. However, for placeholders that should not be immediately rendered — such as data derived from a previous response — we use double square brackets ([[ … ]]) to enable delayed template interpolation.

Homepage/categories example

## The homepage api response
{
  "categories": [
    "featured",
    "comedy",
    "action",
    ...
  ],
  ...
}

## Argos templates config
homescreen: |
  curl -X GET "https://{{api.domains|random}}/api/{{api.version|random}}/homepage" \
    -H "Authorization: Bearer {{user.token}}" \
    -H "User-Agent: {{ua[user.platform]}}" 
container_by_resp: |
  # This will leverage the homescreen API response to retrieve the container_id
  curl -X GET "https://{{api.domains|random}}/api/{{api.version|random}}/categories/[[resp.body.categories|last]]" \
    -H "Authorization: Bearer {{user.token}}" \
    -H "User-Agent: {{ua[user.platform]}}"

User flow example

## templates:
  user_register: |
    curl --location 'https://{{api.domains|random}}/api/v1/register' \
      --header 'Content-Type: application/json' \
      --data '{
        "email": "test-user-[[uuid()]]@example.com",
        "password": "strong-password",
        "name": "TestUser"
      }'

  user_login: |
    curl --location 'https://{{api.domains|random}}/api/v1/login' \
      --header 'Content-Type: application/json' \
      --data '{
        "email": "[[resp.body.email]]",
        "password": "strong-password"
      }'

  user_profile: |
    curl --location 'https://{{api.domain|random}}/api/v1/profile' \
      --header 'X-Load-Test: 1' \
      --header 'Authorization: Bearer [[resp.body.access_token]]'

## arglet:

  - name: user_signup_flow
    count: 10
    steps: [user_register, user_login, user_profile]

Using template syntax like [[resp.body.email]] and [[resp.body.access_token]], Argos can easily extract fields from the response of a previous request and inject them into the next one.

API context

The API context is a separate configuration used by APIs, independent of the API configuration itself. This separation ensures loose coupling, enabling multiple API configurations to share the same API context efficiently.

---
api:
  version: [v3, v4]
  domains:
    - a.tubitv.com
    - b.tubitv.com
  ua:
    iphone: Mozilla/5.0 (iPhone; ...) Mobile/15E148
users:
  - id: 1
    device_id: abcd
    platform: iphone
    token: 
    ...
  - id: 2
    ...

These configurations are pretty simple and straightforward.

Raw Test Result and Its Aggregation

We captured the key information from the API response to describe a test result. Its data structure looks like this:

#[derive(Debug, Default, Serialize, Deserialize)]
#[cfg_attr(feature = "argos", derive(clickhouse::Row))]
pub struct TestResult {
    /// The name of the test (this is input when running `argos server start`).
    /// It is used for isolating different test runs.
    #[serde(default)]
    pub name: String,
    /// The timestamp of the test, in milliseconds.
    pub ts: i64,
    /// The unique ID of the test.
    pub id: String,
    /// The user ID.
    pub user_id: String,
    /// The endpoint name.
    pub endpoint: String,
    /// The status of the test.
    pub status: u16,
    /// The latency of the test, in milliseconds.
    pub latency: usize,
    /// The server type.
    pub server_type: ServerType,
    /// Detailed error message if any.
    pub error: String,
}

In the event of an API error, raw error samples will be collected based on the error_samples configuration in arglet.yml (see above config) and returned alongside the test results in the error field. This allows engineers to analyze the specific errors encountered during load testing. All test results will be stored in a standard MergeTree format in ClickHouse. See below definition:

-- raw data table
CREATE TABLE IF NOT EXISTS test_results(
    name String,
    ts DateTime64(3),
    id String,
    user_id String,
    endpoint String,
    status UInt16,
    latency UInt64,
    server_type Enum8('CloudFront' = 1, 'Akamai' = 2, 'Tubi' = 3, 'Other' = 4),
    error String) ENGINE = MergeTree()
ORDER BY
    (
        name,
        ts
);

Data Aggregation

We utilize materialized views to aggregate per-second results for real-time visualization and final reporting. Below is an example of a materialized view:

-- per second etl materialized view
CREATE TABLE IF NOT EXISTS test_results_per_sec(
    name String,
    ts DateTime64(3),
    total UInt64,
    status_2xx UInt64,
    status_4xx UInt64,
    status_5xx UInt64,
    max_latency SimpleAggregateFunction(max, UInt64),
    min_latency SimpleAggregateFunction(min, UInt64),
    avg_latency AggregateFunction(avg, UInt64),
    p90_latency AggregateFunction(quantile(0.90), UInt64),
    p99_latency AggregateFunction(quantile(0.99), UInt64),
    cloudfront UInt64,
    tubi UInt64,
    akamai UInt64) ENGINE = SummingMergeTree()
ORDER BY
    (
        name,
        ts
);

CREATE MATERIALIZED VIEW IF NOT EXISTS test_results_per_sec_mv TO test_results_per_sec AS
SELECT
    name,
    toStartOfInterval(ts, INTERVAL 1 SECOND) AS ts,
    count(*) AS total,
    countIf(status >= 200
        AND status < 300) AS status_2xx,
    countIf(status >= 400
        AND status < 500) AS status_4xx,
    countIf(status >= 500) AS status_5xx,
    maxSimpleState(latency) AS max_latency,
    minSimpleState(latency) AS min_latency,
    avgState(latency) AS avg_latency,
    quantileState(0.90)(latency) AS p90_latency,
    quantileState(0.99)(latency) AS p99_latency,
    countIf(server_type = 'CloudFront') AS cloudfront,
    countIf(server_type = 'Tubi') AS tubi,
    countIf(server_type = 'Akamai') AS akamai
FROM
    test_results
GROUP BY
    name,
    ts
ORDER BY
    name,
    ts;

Clickhouse is pretty powerful to handle various data aggregation requirements. It fits our needs on Argos well.

Besides the above mentioned aggregation, we also provide several more data aggregations:

test_results_per_sec: per second standard aggregation
test_results_per_sec_status: per second status related aggregation
test_results_per_sec_endpoint: per second endpoint related aggregation
test_results_per_sec_server_type: per second server type related aggregation

The users of Argos could also add their own data aggregations easily.

Visual Reporting

The aggregated results are then utilized by the visualization subsystem for real-time charts and for generating the final report once the test concludes.

Comparison with Locust

Argos runs workers (Arglets) on AWS Lambda, while Locust typically uses dedicated servers (e.g., EC2) with a master-worker model. This makes it hard to ensure identical network and runtime conditions across tools.

To demonstrate the performance of Argos, we conducted a simple comparison based on QPS (queries per second). We selected a straightforward resource located behind a CDN for testing, such as test-cdn.com/contents/1.

Locust

We used an EC2 cluster running in the us-east-2 region to execute the Locust load testing tool. The cluster configuration was as follows:

1 master node + 10 worker nodes(320workers), the EC2 instance is m5.16xlarge EC2(64 vCPUs and 256G memory).

Screenshot

Argos

In Argos, concurrency is influenced by several factors: the number of Lambda functions, the number of users per function, and the number of tasks each user initiates. The “concurrent users” in the table below are represented in the format {lambda_function_count}- {users_per_worker}-{task_count_per_user}. We ran the Lambda function only in the us-east-2 region which is the same as the Locust EC2, the results are as follows.

It is worth noting that AWS Lambda functions have a maximum execution time of 15 minutes. The cost mentioned above has been extrapolated to reflect the equivalent of a 1-hour cost.

Screenshot

The reason QPS is trending downwards at the end is because the lambda functions are being stopped.

Engineering Challenges and Solutions

Building Argos required carefully navigating several constraints imposed by AWS Lambda to ensure seamless operation under intense conditions. One significant challenge was managing data size, as Lambda imposes a hard limit of 20MB for streamed responses. To address this, we adopted postcard serialization, which reduced data size by 40% compared to JSON, and further optimized with flate2 compression, shrinking the serialized data to just 41.7% of its raw postcard size. This approach allowed us to transmit large volumes of test results efficiently within Lambda’s boundaries.

Another hurdle was handling the termination of Lambda functions, or Arglets, which run indefinitely unless explicitly stopped. AWS Lambda doesn’t provide a direct mechanism to halt a function once invoked, so we relied on the timeout defined in the arglet.yml configuration, capped at a maximum of 900 seconds. However, in scenarios where a test fails quickly, waiting for the timeout isn’t practical — we needed a way to terminate early. To solve this, we implemented a signaling mechanism using S3 as an intermediary. Each Arglet periodically checks a preconfigured S3 bucket for a special termination file. If the file is detected, the Arglet self-terminates immediately. To make this process user-friendly, we also developed a CLI (argos server stop) that allows users to stop a test case on demand by generating the termination file in the same S3 bucket, giving developers precise control over test execution.

Concurrency presented another challenge, as AWS imposes a soft limit on the number of Lambda functions that can run simultaneously in a single region. During peak testing, we often approached this limit, luckily AWS allows users to request a quota increase with relative ease. By scaling across multiple regions and securing higher concurrency limits, we ensured Argos could handle the massive parallelism required for events like the Super Bowl, where thousands of Arglets operated in unison to simulate millions of requests per second.

Argos for the Super Bowl: A Real-World Triumph

Since its inception in early January this year, Argos has proven to be an indispensable tool in load testing a wide array of systems, most notably our account and analytics services. Preparing these systems to handle the immense traffic anticipated during the Super Bowl was a top priority — not only for our own systems but also for the underlying databases and third-party vendors that support them. Argos quickly became the cornerstone of our testing strategy, enabling us to ensure reliability under extreme conditions.

In the initial phases, we utilized Argos as an exploratory tool to uncover performance vulnerabilities across the entire flow. These early tests, often lasting just 10–30 seconds, allowed us to rapidly identify and address weak points in the system. As we resolved bottlenecks and fine-tuned performance, we progressively extended test durations to simulate more realistic scenarios. For instance, we conducted many 15-minute tests to assess the system’s ability to sustain prolonged high loads, ensuring it could endure the relentless pressure of a Super Bowl-scale event.

Traditional load-testing tools simply couldn’t keep up with Argos’ flexibility and scalability. By harnessing thousands of Arglets, we achieved staggering results: 1.5 million rps for our analytics service and over 1 million rps for account-related traffic, including 300k rps for logins and registrations alone. These rigorous tests instilled unwavering confidence that our systems were fully equipped to manage both the “thunder of herds” time — the overwhelming surge of users at the event’s outset or during pivotal moments mid-show — and the sustained high-traffic spikes that persisted throughout. Thanks to Argos, Tubi’s services delivered a flawless performance across the entire Super Bowl, even as concurrent viewership reached an impressive peak of 15 million, with not a single P2 or higher severity incident reported. This resounding success highlighted Argos’ critical role in ensuring a seamless and uninterrupted experience for millions of viewers.

Future Horizons

Argos is poised for an exciting evolution with a series of ambitious enhancements on the horizon. One key initiative involves implementing a leaky bucket mechanism to finely tune request rates, ensuring smoother and more controlled testing scenarios. We are also embarking on a refactoring effort to achieve platform agnosticism, paving the way for seamless integration with GCP, Azure, and Cloudflare, expanding Argos’ reach beyond its current AWS foundation. Most thrillingly, we are diligently stripping away proprietary components to prepare Argos for an open-source release, inviting the global developer community to collaborate and shape its future trajectory.

Far more than just a tool, Argos represents a paradigm shift in the world of performance testing. By harmonizing simplicity — through intuitive curl-based tests — with unparalleled scalability, driven by thousands of Arglets, and a robust Data Bus powered by ClickHouse, it empowers teams to rigorously stress-test systems under the most extreme conditions with remarkable ease. The flawless performance during the Super Bowl stands as a powerful testament to its capabilities, and as we gear up for its open-source debut, Argos is well-positioned to redefine industry standards in performance testing. Stay tuned — this transformative journey is only just beginning!

Argos: a Data-Driven Distributed Load Test Tool that Powers Super Bowl was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Client Error Handling at Scale

Ashankar — Tue, 13 May 2025 20:56:23 GMT

Arun Shankar and Gregory Lewis

Introduction

When users interact with a streaming service, they expect instant playback, seamless navigation, and uninterrupted experiences. However, in a distributed system that spans millions of devices across Tubi’s 30+ devices, network failures, authentication issues, and server overloads are inevitable. How we respond to these errors can make the difference between a frustrating user experience and a smooth, resilient one.

At Tubi, we set out to rethink how client error handling is done: making it dynamic, centralized, and adaptable without needing constant app updates. This document explains the journey, challenges, and the final architecture we adopted to achieve fault-tolerant error handling across all clients.

Challenges with Previous Approaches

Before introducing a centralized solution, client error handling was fragmented and inconsistent. Each platform implemented its own rules and strategies, leading to unpredictable behaviors:

Hardcoded Retry Logic: Retry strategies were embedded within app code. Tuning them required submitting new app versions and waiting through store review cycles.
Lack of Consistency: Client apps sometimes handled the same server error (e.g., a 401 Unauthorized) differently.
Poor Adaptability: During high-traffic events or system-wide outages, adjusting retry strategies was slow or impossible.
Thundering Herd Problem: Without proper backoff strategies, clients hammered servers at the same time after failures, worsening the problem.

Our New Approach: Dynamic Client Error Handling

Our new system is built around a centralized JSON configuration, dynamically fetched by clients at runtime. It defines retry strategies, error conditions, and error actions in a standardized way. Updates to this config instantly propagate to all client apps without needing to submit new app versions or go through store review cycles.

Key goals include:

Centralized control, allowing teams to manage behavior from a single source of truth
Real-time adaptability enables both proactive adjustments like tuning retry strategies during major events to prevent retry storms and reactive responses to emerging incidents.
Consistency across all device types to ensure uniform behavior and simplify debugging
Scalability without sacrificing user experience, enabling growth while maintaining reliability and performance.

This shift laid the foundation for a resilient and scalable error handling architecture.

Architecture Overview

System Flow

Clients across multiple platforms, fetch the latest version of the client-error-config.json file on app launch.

To ensure high availability and minimal latency, the configuration file is stored in an AWS S3 bucket and distributed through a Multi-CDN setup. The use of Multi-CDNs ensures that regardless of geographical location or traffic conditions, clients can reliably and quickly access the latest configuration without experiencing downtime or delays.

Decoupling configuration from app binaries and delivering it via a global CDN enables error handling rules to evolve independently of release cycles, offering greater flexibility and control.

Diagram: Clients interacting with error config via a Multi-CDN architecture

Deployment Workflow

Changes to the error handling configuration are managed through a robust, automated deployment pipeline. Whenever a new configuration is proposed, GitHub Actions are triggered to validate the structure and contents of the JSON file against a predefined schema. This automated validation step ensures that only syntactically correct and semantically valid configurations are deployed, preventing potential client-side parsing errors.

Once validated, the new configuration is uploaded to the designated AWS S3 bucket. Each JSON version adheres to its associated schema, which cannot be modified. Any change requiring a schema update must be published as a new version to ensure backward compatibility and prevent breaking client applications. As part of the deployment, we issue a cache invalidation request across the Multi-CDN layers to ensure that clients fetch the updated file on their next pull. This step purges outdated versions from edge locations, ensuring fresh content is served on the very next client request.

This automated deployment pipeline guarantees both the correctness and the rapid availability of configuration updates, minimizing human error and operational risk.

Diagram: Deployment workflow example for client error configuration

Versioning

Maintaining compatibility across a diverse and evolving client ecosystem requires a carefully designed versioning strategy. Whenever a breaking schema change is introduced, for example, altering the structure of the JSON in ways that older clients cannot parse, a new versioned file is created, such as v2/client-error-config.json, v3/…, and so on.

Older client applications continue to fetch and process the configuration file version they were designed for, ensuring backward compatibility and stable operation. Meanwhile, non-breaking updates, those that adjust values or add backward-compatible features, are applied to the existing configuration file version without requiring a version bump. This layered versioning approach allows for seamless migration, progressive adoption of new features, and safe coexistence of multiple client generations in production.

Together, this architecture enables rapid iteration, real-time adaptability, and long-term stability which are essential traits for supporting millions of users across a fragmented device landscape.

With the configuration architecture and deployment process in place, the next key piece of the system is how we handle retries when failures occur. Robust retry strategies are essential to ensure resilience and prevent outages from escalating due to uncontrolled client behavior.

Defining Retry Strategies for Resilience

Why Exponential Backoff?

When retries are executed indiscriminately without any control or spacing between attempts, they can quickly saturate server capacity, compounding failures across the system. A sudden surge of simultaneous retries from thousands or millions of clients can overwhelm backend infrastructure, leading to cascading failures that affect even healthy components. To mitigate this risk, we implement exponential backoff, a strategy that systematically increases the delay between consecutive retry attempts. With each failure, the wait time grows exponentially, giving the backend systems sufficient breathing room to recover and preventing synchronized retry storms that could further destabilize the platform.

A typical backoff formula:

retry_delay = base_delay * (2^attempt)

Introducing Jitter

Even when using exponential backoff, there remains a critical vulnerability: if millions of devices experience failures at roughly the same time, such as during a widespread outage, they may still schedule their retries along similar exponential timelines. As a result, despite the increasing delays between attempts, a large number of clients can end up retrying simultaneously at each backoff interval, creating massive traffic spikes that overwhelm server capacity.

To address this challenge, we introduce jitter, a technique that adds randomness to the retry delay for each client. Instead of all clients waiting for the same calculated backoff duration, jitter causes each device to wait for a slightly different, randomized amount of time within a defined range. This randomness smooths out retry traffic, dispersing requests over time rather than clustering them into synchronized bursts. By staggering retries more effectively, jitter significantly reduces the risk of secondary overloads, improves system recovery times, and ensures that backend services can stabilize even under high failure volumes.

Full Jitter Formula:

retry_delay = min(base_delay * (2^attempt), max_delay) * random(0.5, 1.0)

Retry Strategy Comparisons (per AWS Analysis)

Comparison of different retry strategies (source: AWS Architecture Blog)

Client Error Handling Configuration

Our client-error-config.json defines global and service-specific error handling rules:

Retry Strategies

To handle network failures gracefully at scale, we adopted a jittered exponential backoff model. This approach helped prevent synchronized retries that previously led to server overload during outages. Introducing a jitter range of 50–100% of the delay window proved critical in spreading out retry traffic and accelerating system recovery during high-error conditions.

"retry_strategies": {
  "exp_backoff": {
    "max_retries": 3,
    "retry_base_millis": 1500,
    "retry_exponent": 2,
    "retry_cap_millis": 7500,
    "retry_jitter_ratio": 0.5
  }
}

HTTP Status Code Mapping

Maps specific HTTP response codes to appropriate retry strategies or error actions, allowing the client to respond intelligently to different types of server responses.

"status_codes": {
  "400": { "retry_strategy": "no_retry" },
  "401": { "retry_strategy": "new_token", "conditions": ["user_not_found"] },
  "403": { "retry_strategy": "no_retry", "conditions": ["expired_token"] },
  "500": { "retry_strategy": "exp_backoff" }
}

Service-Specific Handling

Provides customized error handling rules for critical services and endpoints, enabling finer-grained control beyond the global defaults to address unique service behaviors.

"services": {
  "account": {
    "routes": {
      "/user_device/login/refresh": {
        "status_codes": {
          "401": { "retry_strategy": "sign_off" },
          "403": { "retry_strategy": "sign_off" }
        }
      }
    }
  }
}

How Clients Process Errors

When a network request fails, the client initiates a structured error resolution process to determine the most appropriate recovery action.
It begins by identifying the most specific error handling rule available, prioritizing configurations defined at the endpoint level first, then falling back to service-level rules, and finally to global defaults if no specific match is found.
Once the applicable rule is located, the client applies the corresponding retry strategy, whether it involves exponential backoff, token refresh, or other defined approaches. If the rule specifies any special error actions, such as invalidating an authentication token or signing the user off, those actions are executed immediately to maintain system integrity.
Finally, before attempting a retry, the client calculates a randomized delay based on the exponential backoff parameters combined with jitter, ensuring that retry attempts are staggered and system load is minimized.

Example retry delay calculation in Kotlin:

fun getRetryDelayMillis(attempt: Int, strategy: RetryStrategy): Long {
  val rawVal = strategy.retry_exponent.pow(attempt) * strategy.retry_base_millis
  val maxVal = rawVal.coerceAtMost(strategy.retry_cap_millis)
  val minVal = maxVal * (1 - strategy.retry_jitter_ratio)
  return Random.nextLong(from = minVal, until = maxVal + 1)
}

And JavaScript:

const getRetryDelay = (strategy, attempt) => {
  const rawVal = strategy.retry_exponent ** attempt * strategy.retry_base_millis;
  const maxVal = Math.min(rawVal, strategy.retry_cap_millis);
  const minVal = maxVal * (1 - strategy.retry_jitter_ratio);
  return Math.floor(Math.random() * (maxVal - minVal + 1)) + minVal;
};

Error Actions

Beyond standard retry mechanisms, some error conditions require explicit corrective actions to ensure system stability and user security. One such action is invalidate_token, which instructs the client to immediately discard the current authentication token and fetch a fresh one. This helps recover gracefully from issues such as expired credentials, revoked access, or token tampering, without unnecessarily disrupting the user session.

In more severe cases where recovery is not feasible, the sign_off action is applied. This forces the client to log the user out, clear any persisted authentication data, and redirect the user to the sign-in flow. Triggering a full sign-off ensures that the application does not continue operating under invalid or compromised authentication states, preserving the integrity of both user data and backend systems.

Together, these corrective actions complement the retry strategies by providing targeted recovery paths, ensuring that clients remain resilient and that user trust is maintained even during error scenarios.

Showcase: Deriving Retry Strategies for Each Endpoint

To ensure comprehensive and consistent error handling across all Tubi client applications, we began by cataloging every network endpoint in use. Using a network proxy tool like Charles, we captured traffic from each client platform and compiled a detailed list of endpoints.

We documented these endpoints in a spreadsheet, where each row represents an endpoint and the columns denote possible server-side HTTP error codes. For each combination, we defined the appropriate retry strategy directly in the corresponding cell, whether it required exponential backoff, no retry, token refresh, or other action.

This methodical approach allowed us to standardize client behavior while preserving flexibility for service-specific nuances. The process involved deep collaboration and multiple cross-team meetings to align on a unified configuration. Ultimately, this CSV became the foundation for generating the JSON config now used across all Tubi clients.

Conclusion

By introducing a dynamic and centralized client error handling system, we have fundamentally transformed how errors are managed across platforms. The new system enables real-time adaptability without the need for app releases, allowing us to respond swiftly to emerging issues. It ensures that error responses are consistent and predictable across all client platforms, eliminating fragmented behavior. Smarter retry strategies also help significantly reduce server overloads during failure scenarios, preserving system stability under pressure. Most importantly, this approach delivers a smoother, more reliable user experience, fostering greater trust and satisfaction among our users.

Dynamic error handling is no longer a luxury — it has become a core necessity for building resilient, scalable consumer applications capable of thriving at a global scale.

Client Error Handling at Scale was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling Tubi for the Super Bowl: Implementing a Multi-CDN Strategy for Web and TV Apps

Ashankar — Wed, 16 Apr 2025 00:52:39 GMT

Arun Shankar and Stephen Sorensen

Introduction

Tubi’s Super Bowl LIX streaming was a historic milestone both for Tubi and the FAST (Free Ad-Supported Streaming TV) industry. While we’ve long led in on-demand streaming, this was our first live event at this scale. With 13.5 million average viewers and a peak of 15.5 million concurrents, our infrastructure faced an unprecedented test.

The biggest challenge wasn’t just handling millions of viewers — it was scaling instantly under unpredictable surges. A gradual ramp-up is one thing, but what if 10 million users opened Tubi in under a minute? That’s 166,667 requests per second — an extreme surge that could cripple even the most resilient systems.

Our top priority was ensuring seamless app launches, instant streaming, and uninterrupted playback for clients, preventing CDN overloads, authentication bottlenecks, backend failures, and playback hiccups. Achieving this required multi-CDN redundancy specifically for client delivery, along with real-time traffic management and aggressive scaling.

In this blog, we’ll break down our technical strategies — multi-CDN implementation, real-time monitoring, and load balancing — and key lessons learned from this record-breaking event.

Why Multi CDN?

Before delving into the intricacies of our Multi-CDN strategy, it’s essential to first examine the limitations of our existing architecture. Understanding these challenges will provide valuable context for why we adopted a Multi-CDN strategy for client apps and how this approach enhanced their scalability, reliability, and performance. By identifying the constraints of our current system, we can better appreciate the critical role that a Multi-CDN setup plays in addressing issues such as traffic spikes, latency optimization, redundancy, and fault tolerance.

Current Architecture (Without CDN)

Pre-Super Bowl WebOTT Architecture

Client Requests
Tubi apps across smart TVs, game consoles, and browsers request initial web assets (HTML, JS, CSS) from our backend upon launch. At the time, these assets were served through a single CDN.
Reverse Proxy (Nginx)
Requests route through an Nginx reverse proxy in our VPC, providing load balancing and basic security against attacks.
Kubernetes Web Servers
The proxy forwards traffic to Kubernetes web servers, which are designed to scale dynamically using Horizontal Pod Autoscaling (HPA) based on real-time traffic load. While we proactively configure pre-scaling and set generous limits — such as allowing up to 1,200 pods — even that may fall short during extreme surges, like those experienced at peak game time or during a large-scale DDoS attack. The unpredictable nature and sudden velocity of such spikes can outpace even the most aggressive autoscaling policies, creating potential bottlenecks that impact availability and responsiveness.
Backend Communication (gRPC)
Web servers communicate with backend microservices via gRPC for authentication, content and recommendations, coordinated by a control plane.
Client-Side Rendering
Servers return the final HTML and assets. Clients render the UI, allowing users to browse and stream content seamlessly.

Key Limitations of This Architecture

While this setup is effective for regular traffic patterns, it has notable limitations when trying to achieve high availability during extreme traffic surges, such as during the Super Bowl:

Scalability Bottlenecks:

The fixed upper limit on HPA scaling could lead to service degradation if traffic exceeds the maximum replicas. We could just increase the upper limit further, but doing so without improving the architecture would be prohibitively expensive.
Client apps scaling focuses on keeping users logged in without interruptions. Some platforms use Redis to manage sessions, so we set a high upper limit to handle peak traffic smoothly.
Kubernetes pods take time to scale, which might be too slow for sudden traffic spikes.

Single Entry Point (Nginx in VPC):

All requests go through the Nginx reverse proxy, making it a potential single point of failure under high load.
If Nginx becomes overwhelmed, users may experience delays or failures in app launch.

Latency & Geographic Distribution:

Since all requests are routed through our centralized VPC, users in distant geographic locations might experience higher latency.
There is no caching layer for API responses at the edge, leading to repeated backend calls for similar requests.

Traffic Spikes & Load Balancing:

The system is optimized for gradual traffic increases, not for sudden hockey-stick growth (e.g., 10M users joining in under a minute).
Load balancing within Kubernetes helps distribute traffic, but without CDN-level caching, backend services could get overwhelmed.

Multi CDN Architecture

Multi CDN Architecture for WebOTT Apps

Overview

To overcome the limitations of our existing architecture, we implemented significant changes. We removed all transient data from the initial HTML document, enabling us to serve a static, cached version of the pre-generated HTML page to all users. The client now fetches data directly from backend services. We also introduced CDN services to deliver this static content, incorporating traffic rerouting mechanisms for various failure scenarios. These changes specifically enhance performance, reliability, availability, and failover support for WebOTT applications.

Instead of routing requests directly to our origin servers, all traffic first passes through a Content Delivery Network (CDN). We utilize two CDN services: AWS CloudFront and Akamai CDN and we can easily route traffic to either or both of these very quickly if issues arise.

Requests are handled based on primary and secondary origins:

Primary Origin: CloudFront or Akamai routes traffic to our Nginx reverse proxy, which forwards it to WebOTT proxy pods running on Kubernetes. For most platforms, the primary origin is actually disabled, so the CDN will route requests directly to the secondary origin.
Secondary (or failover) Origin: If the primary origin is disabled or the CDN is unable to get a successful response from the primary origin, requests are routed to a static asset fallback hosted on AWS S3.

Backend Service Migration

Because we’ll be serving a statically-generated page to all users, we need to move all transient data out of the initial HTML document. Previously, we rendered the HTML document on the server based on the results of several gRPC API calls to several backend services. Now, we’ll skip those API calls when creating the HTML document and fetch that data directly from those backend services on the client-side instead. This was a massive undertaking, and also required each backend service to have its own Multi-CDN architecture and failover assets.

DNS Setup

To efficiently distribute traffic across multiple CDNs and enable rapid failover when needed, we have configured our DNS with the following setup:

Weighted Record Set with Low TTL

We utilize Route 53 weighted record sets with a low TTL to dynamically manage traffic distribution between CloudFront and Akamai. This enables real-time adjustments to the percentage of traffic directed to each CDN without significant DNS propagation delays. By altering the weighted distribution, we can balance load, optimize performance, or incrementally test new configurations.

AWS Application Recovery Controller (ARC) for Failover

We have integrated AWS Application Recovery Controller (ARC) as a highly available kill switch. In the event of a major outage affecting one of our CDNs, ARC enables us to quickly redirect all traffic to the other CDN with minimal downtime. This fast and automated failover mechanism ensures uninterrupted service availability during disruptions.

# Pseudocode for managing weighted DNS entries
# Hosted Zone: tubitv.com
Resource: "aws_route53_record" "weighted_record_cloudfront" {
  zone_id = ""
  name = "tubitv.com"
  type = "A"
  alias {
  name = aws_cloudfront_distribution.cf_distribution.domain_name
  zone_id = aws_cloudfront_distribution.cf_distribution.hosted_zone_id
  evaluate_target_health = true
  }
  # Weighted routing
  weight = 50 # 50% traffic to CloudFront
}

Resource: "aws_route53_record" "weighted_record_akamai" {
  zone_id = ""
  name = "tubitv.com"
  type = "A"
  # Not a direct alias if not using AWS distribution
  # Typically need to point to an Akamai-provided IP or edge host
  # (Implementation details may vary)
  weight = 50 # 50% traffic to Akamai
}

This DNS configuration plays a crucial role in maintaining a resilient, multi-CDN architecture by allowing both controlled traffic distribution and instantaneous failover capabilities.

Request Handling and Failover Mechanism

Failover Flow (Primary Origin Failure)

If the primary origin is enabled but unavailable (e.g., WebOTT servers or Nginx are down), the request is rerouted to the failover origin:

The CDN detects an origin failure and triggers the failover mechanism, which changes the origin server (instead of the primary origin, it will fetch data from the AWS S3 bucket.
The request is rewritten using a cloud function (CloudFront Function, Lambda@Edge, or Akamai Edgeworker) to fetch the corresponding static asset from an AWS S3 bucket.
The static asset is returned to the client.

CDN Processing

Each CDN has specialized mechanisms to enable failover and request transformation.

AWS

Lambda@Edge: Handles request rewriting and routing logic when the primary origin is enabled but unavailable.
CloudFront Functions: Used for header-to-query parameter transformations, allowing native apps (e.g., AndroidTV) to pass headers as query parameters when serving static assets in failover mode. Also handles request rewriting and routing logic like the lambda, but only when the primary origin is disabled.

// Pseudocode for a CloudFront Function to handle failover rewriting
function handler(event) {
  var request = event.request;
  // Example condition: if primaryOrigin is flagged as "down"
  if (primaryOriginDown) {
  // Rewrite the path to static fallback
  request.uri = "/failsafe-index.html";
  }
  return request;
}

This snippet demonstrates how we can direct all traffic to a fallback HTML when the primary origin is unavailable.

Akamai

EdgeWorkers: Similar to Lambda@Edge, EdgeWorkers execute at the edge, managing request rewriting and failover logic when the primary origin is unreachable.

Header-to-Query Parameter Transformation

During failover, headers used by native apps cannot be passed to static assets. To overcome this:

CloudFront Functions extract headers and convert them into query parameters.
The static asset URL is modified to include these parameters.
On the client side, the app reads the query parameters to reconstruct the required context.

function handler(event) {
    const request = event.request;
    const queryParams = request.querystring || {};
    const headers = request.headers || {};
    const cookies = request.cookies || {};

    const headerToQueryMap = [
        { header: 'x-android-native-version', param: 'x-android-native-version', cookie: 'o_cnav' },
        { header: 'device-deal', param: 'device-deal', cookie: 'o_dd' },
        // Add more mappings as needed
    ];

    const newQueryParams = {};

    for (const mapping of headerToQueryMap) {
        const headerValue = headers[mapping.header]?.value;
        const paramValue = queryParams[mapping.param];
        const cookieValue = cookies[mapping.cookie]?.value;

        if (headerValue && !paramValue && cookieValue !== headerValue) {
            newQueryParams[mapping.param] = headerValue;
        }
    }

    const hasNewParams = Object.keys(newQueryParams).length > 0;

    if (hasNewParams) {
        const mergedParams = { ...queryParams };

        for (const key in newQueryParams) {
            mergedParams[key] = { value: newQueryParams[key] };
        }

        const serializedParams = Object.entries(mergedParams)
            .map(([key, val]) => `${encodeURIComponent(key)}=${encodeURIComponent(val.value)}`)
            .join('&');

        const redirectUrl = request.uri + (serializedParams ? `?${serializedParams}` : '');

        return {
            statusCode: 302,
            statusDescription: 'Found',
            headers: {
                location: { value: redirectUrl },
            },
        };
    }
    // Return original request if no redirect needed
    return request;
}

This ensures a seamless transition when switching to failsafe mode.

Multi CDN for Static Assets like JS/Images/CSS

This architecture uses multiple CDNs (AWS CloudFront and Akamai) to efficiently deliver static assets like JavaScript, images, and CSS files to client apps. Route 53 routes requests to the optimal CDN, while Lambda@Edge and Akamai Edge Workers allow for customization and optimization at the edge. The assets are stored in an S3 bucket, and the CDNs cache and serve them globally, ensuring fast, reliable delivery with reduced latency and increased availability.

Multi CDN Architecture for JS Chunks/CSS/Fonts/Images

Building Failsafe Assets

To support failover scenarios, failsafe assets are generated at build or release time. These assets serve as the static fallback content when the system switches to failover mode.

Process for Building Failsafe Assets

At Release Time:

For each WebOTT release, a set of failsafe assets (HTML, JS, and CSS) is generated.
These assets are bundled with the primary build and uploaded to an AWS S3 bucket.
CDN cache is invalidated so these new assets can be served to users.

Usage in Failover Mode:

When the primary origin is down, the Multi-CDN strategy routes requests to the failsafe assets stored in S3.
The static assets ensure users continue to receive essential UI elements and fallback content, preventing a complete outage.

The Multi-CDN architecture enhances WebOTT reliability by introducing redundancy and intelligent failover mechanisms. By integrating CloudFront, Akamai, and automated failover handling, we ensure a robust, scalable, and highly available content delivery system. The addition of failsafe assets guarantees a smooth user experience even in the event of backend failures.

Benefits of CDN-Based Static Delivery

Ultra-fast roundtrip times as requests are served directly from CDN edge locations.
Global availability with reduced latency, as content is cached at geographically distributed edge servers.
Scalability & reliability, reducing the dependency on backend servers.

Deployment and Cache Invalidation

During a deployment:

The new static assets are uploaded to S3.
A CDN cache invalidation is triggered to remove outdated content.
The first request after deployment fetches fresh content from the origin.
Subsequent requests are fully cached at the CDN level for optimal performance.

Whitelisted Routes for Proxy Access

Certain API routes and dynamic endpoints must still be processed by backend proxy servers. To accommodate this:

We maintain a whitelist of routes that bypass CDN caching and always hit the proxy servers.
These routes are configured at the CDN level to ensure proper request routing.

This hybrid approach ensures a balance between performance (CDN caching) and dynamic content access (whitelisted proxy routes) for an optimal user experience.

Challenges We Faced

Implementing a Multi-CDN architecture came with its own set of challenges. Some of these were related to domain-level limitations, security constraints, and compatibility issues across various OTT devices. Below are the key challenges we encountered and how we addressed them.

1. Web (tubitv.com) — Apex Domain Limitations

One of the biggest challenges was incorporating a Multi-CDN strategy for the Web (tubitv.com) platform. Since tubitv.com is an apex domain (root domain), we could not add multiple CDNs using CNAME or A records.

DNS Restrictions: DNS allows adding CNAME records for subdomains (e.g., www.tubitv.com) but does not allow a CNAME for an apex domain, which we needed to be able to fully implement the Multi-CDN architecture for tubitv.com (without the www). At the apex, you cannot use a CNAME; you must use an “A” record to point to a web host.

Potential Workarounds Considered:

Use an A record with an alias: We tried to use an alias “A” record (an implementation-specific extension of DNS used by Route53) but those can only be used for AWS-Managed resources, so that wouldn’t work for pointing to the Akamai CDN.
IP Whitelisting for Akamai: One option was to use Akamai’s IP CIDRs or IP addresses in a whitelist. However, this was not a practical solution, as Akamai’s IP ranges change over time and would require constant maintenance.
Using www.tubitv.com Instead: If we served Web traffic via www.tubitv.com, we could have incorporated both CloudFront and Akamai in a Multi-CDN setup. However, this would require us to redirect all traffic from tubitv.com → www.tubitv.com, which is not ideal for SEO and user experience.
Final Decision: Given the limitations, we decided to stick with CloudFront as the sole CDN for the Web platform while leveraging Multi-CDN for OTT apps.

2. Viewer TLS Certificate Compatibility Issues

After deploying the Multi-CDN setup, we encountered an issue with TLS compatibility on older devices, particularly Samsung and Sony Smart TVs.

Initial Configuration:

We initially configured the CDN with a strict security policy using the latest TLS version for enhanced security.
However, after deployment, we noticed a sudden drop in Total Viewing Time (TVT) for Sony Smart TVs, indicating that users could not access the service.

Root Cause:

Many older Samsung and Sony TV models do not support the default cipher suites used in modern TLS certificates. The issue was not with the TLS version itself but with the specific cipher suites required for encryption. These cipher suites are not enabled by default in most modern certificates, so we had to explicitly opt into them to ensure compatibility.
These devices were unable to establish a secure connection, leading to app failures.

Solution & Rollback Strategy:

We reverted the CDN configuration for Sony devices and tested multiple TLS versions.
We finally solved the issue by specifying specific TLS versions and cipher suites to support to maintain compatibility while still ensuring security on both AWS & Akamai.

These challenges highlighted the complexity of deploying a Multi-CDN strategy, especially for a wide range of devices with varying levels of compatibility. Despite these hurdles, we were able to implement a solution that maximized performance while ensuring availability for all supported platforms.

Monitoring Dashboards for Multi-CDN

To ensure real-time observability and proactive issue detection, we integrated Datadog for monitoring our Multi-CDN architecture. With Datadog’s real-time analytics and alerting capabilities, we were able to track key CDN performance metrics and quickly respond to anomalies.

Key Metrics Monitored

CDN Queries Per Second (QPS): This metric allows us to track the number of requests hitting our CDNs in real time. Sudden spikes or drops in QPS signal potential issues, such as traffic surges, bot attacks, or service disruptions. We have established alerting thresholds to detect unusual patterns.
CDN Error Rate — 4xx (Client Errors): Monitoring 4xx errors (such as 403, 404, and 429) helps us identify misconfigurations, authentication failures, and bot rate limits. A surge in 403 errors may suggest that legitimate traffic is being blocked by WAF rules or geo-blocking policies, while a high number of 404 errors could indicate missing assets or incorrect routing configurations.
CDN Error Rate — 5xx (Server Errors): 5xx errors indicate server-side failures at either the CDN level or the origin level. A high 5xx error rate signals that either the primary origin (WebOTT proxy) is down, failover to the secondary origin (S3 static assets) is malfunctioning, or misconfigurations in Akamai or CloudFront are preventing proper response delivery.
CDN Cache Hit Ratio: This metric measures the percentage of requests served directly from the CDN cache rather than hitting the origin. A high cache hit ratio (approximately 90% or higher) ensures low latency, reduced origin load, and improved performance. A sudden drop in this ratio may indicate frequent cache invalidation or the need to optimize cache-control headers.
Weighted Routing Between CDNs: Given our use of CloudFront and Akamai, we required visibility into the traffic distribution between the two. We established Datadog monitors to track traffic distribution based on our weighted routing configurations. If one CDN serves significantly less traffic than expected, it may indicate a misconfiguration in our routing logic or a CDN outage.

Real-time Alerts & Incident Response

We configured Datadog alerts to notify on-call engineers via Slack when critical thresholds were breached.
Dashboards were made available to engineering, operations, and SRE teams to quickly diagnose issues and perform root cause analysis (RCA).

By leveraging Datadog dashboards and alerts, we improved the reliability of our Multi-CDN strategy while ensuring high availability and performance across all platforms.

QPS Metrics for Select OTT Platforms During Peak Traffic Surge

Conclusion and Lessons Learned

Implementing a Multi-CDN architecture was critical to scaling Tubi for a high-profile, high-traffic live event like the Super Bowl. By removing transient data from the server-rendered HTML and fully leveraging CDN caching and weighted failover, we achieved the following:

Enhanced Reliability: Even under unexpected surges, we could route traffic instantly to whichever CDN or origin was healthiest.
Scalability and Speed: Edge caching and S3-based failover kept our origin from collapsing under tens of millions of requests.
Reduced Latency: Users worldwide were served from nearby edge locations, improving load times.
Operational Efficiency: With automated failover and robust monitoring, on-call teams could respond to incidents quickly without scrambles.

This Multi-CDN approach provided real redundancy, ensuring Tubi’s streaming experience remained smooth and uninterrupted for one of the world’s largest sporting events. It also paved the way for a more generalized, globally distributed infrastructure that benefits viewers every day, not just during the Super Bowl.

Scaling Tubi for the Super Bowl: Implementing a Multi-CDN Strategy for Web and TV Apps was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Optimizing Latency in Redis Online Feature Store: A Case Study

lilac — Mon, 03 Mar 2025 12:34:41 GMT

As MLOps teams scale up online data-intensive services, the need for efficient, low-latency access to feature data becomes increasingly crucial. In this blog post, we discuss our recent journey to optimize the latency of Redis in our Online Feature Store (OFS), focusing on batch query patterns and the lessons we learned along the way.

Background

We at Tubi have a movie recommendation system powered by advanced ML models. For these ML models to perform effectively, we need high-quality feature data created through robust feature engineering. This feature data is generated offline and saved to an online feature store, for which we use Redis as the primary storage mechanism.

Access patterns of feature data

During online inference of ML models, online features need to be queried then are fed as inputs to the model. For different types of feature data, the use case is different, so are the query patterns. We group related feature data together and call it a feature family. You can think of a feature data like a column, and a feature family like a table.

An entity feature family saves features of an entity, either a user, program, country etc. The expected query pattern is, given a set of entity keys, to return a row of its feature data.
A context feature family saves features of an entity under a context, eg. a program watched by a user. So the query pattern is, given N context entity key, returns N rows of its feature data.
A candidate feature family saves features of candidates, eg. some programs similar to currently watched movie. So the query pattern is, given an entity key, returns any number of rows of data.

A summary of the three categories of patterns is listed below.

https://medium.com/media/cd3742692e00276f45e018b44094dcea/href

To support these feature family types and their corresponding query patterns while maintaining low latency, different Redis data structures are used.

Each row of entity feature family or context feature family is saved to Redis as basic key-value.
Each row group of candidate feature family that share the same entity key is saved to Redis as hash map.

Among the 3 types of queries, the 1st and 3rd are easy, since they translates to a single Redis command. But the second one is challenging.

The Challenge: Handling Batch Query Access Patterns

For a single query to context feature family, multiple row needs to be fetched. That’s why it’s called batch query access pattern.

For example, if we want to do recommendation based on user’s watch history, then the context keys are (user_id, content_id). If a user has watched 1000 movies, then that’s 1000 rows to fetch from Redis. To serve 1000 concurrent user’s recommendation, 1 million rows have to be looked up. This fan-out effect amplifies the load from the front-end to the back-end.

The Optimization Journey

Initially, our approach stored each feature as a Redis hash map and used a pipeline to fetch multiple rows simultaneously. Here is an example of how this approach works using the Lettuce library in Scala:

// Create a RedisClusterClient
val redisUri = RedisURI.create("redis://localhost:6379")
val clusterClient = RedisClusterClient.create(redisUri)

// Obtain a connection
val connection = clusterClient.connect()
val asyncCommands: RedisClusterAsyncCommands[String, String] = connection.async()
// List of keys to fetch
val keys = List("key1", "key2", "key3", "key4", "key5")
// Begin pipelining
asyncCommands.setAutoFlushCommands(false)
// Issue HGETALL commands for each key
val futures = keys.map(key => asyncCommands.hgetall(key))
// Flush commands to Redis
asyncCommands.flushCommands()
// Re-enable auto-flushing
asyncCommands.setAutoFlushCommands(true)

The P99 latency of such calls is around 20–30 ms. This was a good start, but our goal of SLA is less than 10ms.

Initial Optimization: Pipeline vs. MGET

We implemented our first optimization by changing the Redis data structure and retrieval strategy:

Each row was encoded into a binary format (using Avro) and stored as plain key-value pairs in Redis.
Instead of using Redis pipelines, we switched to MGET for fetching multiple rows at once.

This change brought the P99 latency down to 3–4 ms. However, as our use case evolved, the fan-out factor of our queries grew significantly. For instance, while our previous feature family device_program_real_time_feature_vp had a fan-out factor of 43, our new country_program_2_program_feature had a fan-out factor of 855—approximately 20 times more rows per query.

Addressing the Latency Increase

The increased fan-out resulted in a significant latency increase, with P99 latency jumping to 15 ms for the new feature family, compared to 5 ms for the old one. Our target was to bring this down to under 10 ms.

Attempt 1: Virtual Partitioning

We suspected that the increased latency was due to the use of MGET on a single Redis shard. To address this, we introduced virtual partitions, splitting the 800 rows into 10 partitions of 80 rows each, and issued concurrent MGET commands to multiple Redis shards. Unfortunately, this approach did not yield the expected reduction in latency.

Breaking Down the Latency: Where Was the Bottleneck?

To identify the root cause, we broke down the latency into two metrics:

Lettuce First Byte Latency: The time to receive the first byte from Redis.
Lettuce Completion Latency: The time to complete the entire request.

Below is a detailed explanation of these latency metrics, as quoted from the Lettuce official doc.

First Response Latency

The first response latency measuring begins at the moment the command sending begins (command flush on the netty event loop). That is not the time at when at which the command was issued from the client API. The latency time recording ends at the moment the client receives the first command bytes and starts to process the command response. Both conditions must be met to end the latency recording. The client could be busy with processing the previous command while the first bytes are already available to read. That scenario would be a good time to file an issue for improving the client performance. The first response latency value is good to determine the lag/network performance and can give a hint on the client and server performance.

Completion Latency

The completion latency begins at the same time as the first response latency but lasts until the time where the client is just about to call the complete() method to signal command completion. That means all command response bytes arrived and were decoded/processed, and the response data structures are ready for consumption for the user of the client. On completion callback duration (such as async or observable callbacks) are not part of the completion latency.

We found that the Lettuce first-byte latency was quite good, significantly less than the total 15 ms latency. This led us to conclude that the bottleneck was not on the server side, but rather on the client side. Upon analyzing our client code, we discovered that after receiving the response, we needed to deserialize the byte array, and this deserialization process was sequential instead of parallel. We suspected that this sequential deserialization was contributing to the increased latency, especially as the fan-out factor increased.

To validate this, we added new metrics to monitor deserialization performance, which showed that the deserialization latency was up to 10 ms — validating our hypothesis.

Final Solution: Parallelizing Deserialization

After validating our hypothesis, we optimized our system by parallelizing the deserialization process. This is done by leveraging the Scala Parallel Collections.

if (rows.size > settings.parallelDecodeThreshold) {
  rows.par.map { row => deserializeAvroRow(schema, row.getValue, featureFamily, features) }.seq
} else {
  rows.map { row => deserializeAvroRow(schema, row.getValue, featureFamily, features) }
}

This code change brought the P99 latency down from 15 ms to 10 ms — meeting our target.

Latency Before and After the Change

Key Takeaways

This optimization journey taught us several valuable lessons:

Do Not Blindly Optimize: It’s essential to break down metrics to identify the actual bottleneck before applying optimizations.
Think End-to-End: Effective optimization requires looking at both server-side and client-side components.
Hypothesis Validation Is Key: Before making changes, validate hypotheses by adding relevant metrics to understand where the actual problem lies.
Collaborate Actively: Engaging with experts — in our case, AWS experts — can provide insights that are crucial for performance tuning.

This process recalls a classic hypothesis-validation loop.

Conclusion

Optimizing Redis latency in our Online Feature Store was a journey filled with challenges and lessons. By being methodical — breaking down the problem, validating hypotheses, and working through optimizations — we were able to meet our performance targets and improve our system’s responsiveness. We hope this case study provides insights for others facing similar challenges in managing feature stores at scale.

Optimizing Latency in Redis Online Feature Store: A Case Study was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Building Error Recovery Solution Improves Tubi OTT Player Stability

Xin Zhang — Thu, 19 Dec 2024 23:44:37 GMT

Seamless playback is critical to Tubi’s ability to deliver a streaming experience that keeps users coming back. When playback errors occur, quick recovery is essential, and equally important is accurate error reporting to help engineers identify and resolve issues. Over the past year, Tubi’s Web/OTT Team has refactored the player error reporting, tracking, and handling feature while continuously exploring and enriching recovery strategies for various playback errors. This has resulted in significant improvements in both business metrics and playback performance metrics.

Overview

During video playback, the player may encounter unexpected errors due to network fluctuations, hardware performance issues, or other factors, which can interrupt normal playback and affect the user experience. Before the refactoring, the error-handling logic of the player had the following pain points:

When playback errors occur, error information generated by different playback adapters is directly reported, resulting in overly complex playback error statistics on the data dashboard. This complicates issue analysis and makes it difficult to understand the meaning of the error information for those not familiar with the player’s internal logic.
The player’s error-handling logic is scattered across different business components, making it unclear and prone to errors, which hinders quick issue resolution.
For some frequent playback errors, there is a lack of appropriate recovery strategies.

Preparatory refactor

The diagram below illustrates the layers involved in the playback error messaging and handling:

Adapter Layer — To adapt to different OTT platforms, the underlying player adapter layer includes multiple playback adapters.
Controller Layer — The controller layer offers slots for various functions, such as playback session management, abnormal state tracking, and log reporting.
UI Layer — The UI layer handles the display logic for the playback error popup.

Architecture related to playback errors

To address the aforementioned pain points, we reviewed the existing logic and carried out a refactoring process focused on playback error reporting, defining reasonable playback error metrics, and unifying the handling of playback errors. This lays a solid foundation for future production data analysis and the exploration of error recovery strategies.

Playback error reporting

We redefined the playback error reporting logic based on the level of awareness of errors by the business layer and users, making data analysis more accurate and efficient:

If a playback error occurs but the playback adapter can recover playback through internal automatic retries, the error will not be reported. Instead, the number of retries will be logged internally.
If the playback adapter’s internal automatic retries cannot restore playback (i.e., the maximum number of internal retries is reached), the error is passed up to the controller layer. At this level, strategies such as automatic retries or switching to alternative video resources at the player instance level are used to attempt recovery. The controller layer may need to handle multiple errors during a single playback session. This layer needs to log the number of errors received and detailed error information throughout the playback session, reporting this data at the end of the session.
If the controller layer reaches the maximum number of error-handling attempts and still cannot restore normal playback, the UI layer will display an error popup to the user, and the current playback session will be recorded and reported as a playback failure.

Statistical metrics

We use a unified enumeration type to represent all detected player errors, making it straightforward to display and analyze the distribution of different errors on the data platform. This approach also helps us map existing recovery strategies to specific playback error types and identify which errors still lack recovery strategies.

At the same time, we established corresponding statistical metrics on the dashboard based on the impact severity of different errors.

Playback Error Rate — In the controller layer, we calculate the proportion of playback sessions which handles errors. These errors may be resolved and return to normal playback through retry or other strategies. This metric helps us identify code logic issues and anomalies in playback-related modules that may not be easily detected through business metrics.
Playback Failure Rate — We calculate the proportion of playback sessions where the player’s UI layer receives errors and displays error pop-ups. These errors cannot be resolved through retries or other strategies and significantly impact the user’s viewing experience. This metric directly reflects the user’s actual viewing experience, and any anomalies detected should be promptly analyzed and addressed.

Unified playback error handling

We consolidate the existing playback error-handling logic scattered across different components into a single controller for centralized processing. For some existing error-handling logic with limited recovery effectiveness, adjustments, and refactoring are carried out through experimental validation. This unified and standardized approach makes the logic clearer and enhances code reusability.

Before and after the refactoring of playback error handling logic

The following pseudocode demonstrates the main process of playback error handling after the refactoring.

receivedError = (error: ErrorData) => {
  // 1. Record error information in the playback session
  PlaybackSession.getInstance().recordError(error);

  // 2. Apply different recovery strategies based on error types
  let recoverTimesReachLimit: boolean = false;
  switch (error.code) {
    case ErrorCode.DRM_ERROR:
      recoverTimesReachLimit = this.recoverDRMError(error);
      break;
    case ErrorCode.CODEC_ERROR:
      recoverTimesReachLimit = this.recoverCodecError(error);
      break;
    case ErrorCode.NETWORK_ERROR:
      recoverTimesReachLimit = this.recoverNetworkError(error);
      break;
    // ...
    default:
      recoverTimesReachLimit = true;
      break;
  };
  // Recovery operation executed, return
  if (!recoverTimesReachLimit) return;

  // 3. Recovery attempts have reached the maximum limit; the current playback session is marked as a playback failure
  PlaybackSession.getInstance().recordFailure(error);

  // 4. Recovery attempts have reached the maximum limit; an error popup is displayed
  this.showErrorModal(error);
};

Effective recovery strategies

Through extensive experimentation and exploration, we have identified and summarized several effective recovery strategies for Web/OTT playback errors. Employing targeted strategies for different types of playback errors can effectively help restore normal playback.

Device decoding error

Some OTT devices may encounter underlying errors during playback, such as decoder initialization errors or decoding errors, due to factors like decoding performance or DRM modules, which can disrupt the user’s viewing experience.

Different OTT devices have varying requirements for parameters such as video resource codec, maximum resolution, and DRM type. These parameters need to be determined based on the device specifications and real-device testing to ensure the appropriate resource type is requested in the production environment. We collected playback statistics to make adjustments about using which type of video resource. In the actual production environment, we can deliver multiple video resources with different specifications within the scope of compliance, providing alternative resources as backup options.

Experimental validation has shown that when such underlying decoding errors occur, employing strategies like switching to alternative video resources or reloading the current resource can significantly improve business metrics and playback startup success rates.

Network error

In poor network conditions, timeout errors may occur when requesting playlists or fragments. Effective recovery strategies include retrying network requests a certain number of times, switching to alternative CDN resources, or downgrading to lower-resolution video resources. Additionally, some playback adapters stop the internal playback process after reaching the retry limit. In such cases, it is necessary to display relevant network error messages to the user, guiding them to improve their network conditions.

Buffer hole error

During playback, the player may experience a stall or audio-video de-sync due to a gap (also known as “hole”) in the cached data. There are various reasons for these cache gaps, such as missing or corrupted media fragments, discontinuities in the playlist timeline, or user seeking. Some Tubi Web/OTT platforms use open-source hls.js library as their playback adapter. In experiments on playback error recovery for media error types, we found that using the recoverMediaError method provided by hls.js can significantly reduce cache gap issues during playback. The following diagram illustrates the internal implementation logic of the recoverMediaError method in hls.js:

First, the HTML5 video element is unbound from the hls.js playback instance. This operation triggers the removal of event listeners from the video element, clears downloaded fragment data and cached media data, and stops network download operations and related timers.
Next, the video element is re-bound to the hls.js playback instance. This operation triggers the reattachment of event listeners to the video element and restarts the timers to load media fragment data.

Internal logic of the hls.js recoverMediaError method

As seen from the above logic, the hls.js recoverMediaError method helps clear cached data and restart the data loading process, effectively eliminating cache gaps that existed before the method was called, thereby restoring normal playback. Experimental data also indicates the need to set a limit on the number of times we invoke the recoverMediaError method. Excessive calls can cause the player to repeatedly clear cached data, increasing the likelihood of rebuffering.

Recoverable error

Before the refactoring, when users frequently encountered playback stalls, we would display an error pop-up prompting them to retry playback or exit the current session. One scenario that could cause playback stalls was network instability. In some cases, the player might resume normal playback once the network conditions improve. We ran experiments investigating whether it might be better to delay showing a popup, allowing the player to potentially recover on its own. We found that doing so significantly improved business metrics, as a showing error pop-up can prompt a user to actively exit a playback session that could have automatically recovered.

Playback error pop-up

Error recovery experiment results

Over the past year, we have conducted experiments related to playback error recovery across different Web/OTT platforms and device models. Multiple strategies have achieved significant positive results in both business metrics and playback performance metrics.

Business metrics such as Total View Time, User Retention, Ad Impressions, and Ad Revenue showed significant positive improvements.
Playback performance metrics such as Content Startup Failure Rate and Playback Failure Rate were significantly optimized.

Comparison of Content Startup Failure Rate and Playback Failure Rate between the control group (blue line) and experimental group (orange line) in playback error recovery strategy experiment A

Experience

How can we efficiently identify and validate playback error recovery strategies that are more effective in improving business metrics? We have summarized some experiences on this topic.

First, we need to collect the occurrence probability of each type of playback error in the production environment. Establishing a correlation between playback error data and business metrics, we can assess the impact of each error. With this data as a foundation, we can focus on playback errors that occur frequently and significantly affect business metrics, analyze their root causes, and develop corresponding recovery strategies.

Compare the difference in play time between the occurrence of Error A and the end of the playback session to determine the impact of the error on business metrics

It is essential to understand the underlying logic and specific impacts of each type of playback error in the code. This knowledge helps determine whether the player can recover automatically when the error occurs and whether the error significantly affects user experience and business metrics. Such insights guide us in designing and deciding on appropriate recovery strategies. Additionally, analyzing user logs during playback errors can also provide valuable information to support this process.
Sometimes, we set up multiple experimental groups within a single experiment to test the effectiveness of different strategies. If the results are significantly positive, we implement the best-performing strategy. While other strategies that were not implemented may have yielded relatively smaller benefits in this experiment, they still provide valuable insights. We can analyze their performance metrics and consider combining them with the already deployed strategy to form composite strategies. These can then be tested in subsequent experiments to explore further improvements in business metrics.
Analyzing production data, we found that the occurrence probability of each type of playback error varies significantly across different OTT platforms and device models. This may be due to differences in DRM types, internal platform implementation logic, or hardware performance. Therefore, it is important to carefully select the appropriate platforms and device models to set up A/B experiments, ensuring efficient validation of strategy effectiveness.

Playback Failure Rate of error A across different OTT models

Summary

The refactor of playback error recovery enhances the accuracy of data tracking, streamlines data analysis, and enables engineers to pinpoint issues precisely through key metrics, providing strong support for strategy optimization and iteration. The effective recovery strategies for different playback errors ensure stable player performance and deliver a smoother playback experience for Tubi users.

We’re Hiring!

Tubi’s Web/OTT Team has refactored the player error reporting, statistical analysis, and handling functionalities. Additionally, through a series of experimental validations, the team has developed effective recovery strategies for different playback errors. We will deep dive into production data and explore more possibilities to enhance the user viewing experience and business metrics.

If you are interested in high-impact projects that enhance the user experience, join us at Tubi Engineering! Check out Tubi careers here.

Building Error Recovery Solution Improves Tubi OTT Player Stability was originally published in Tubi Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.