Dropbox Tech Blog

Improving storage efficiency in Magic Pocket, our immutable blob store

Facundo Agriel — Thu, 02 Apr 2026 10:00:00 -0700

Magic Pocket is the core Dropbox storage system—a custom-built, exabyte-scale blob storage system designed for durability, availability, scale, and efficiency. It holds user content, which means it must be safe, fast, and cost-effective to scale with the company. For Dropbox, storage efficiency really matters. We measure it by looking at how much total disk space we use compared to how much user data we’re actually storing.

Last year, we rolled out a new service that changed how data is placed across Magic Pocket. The change reduced write amplification for background writes, so each write triggered fewer backend storage operations. But it also had an unintended side effect: fragmentation increased, pushing storage overhead higher. Most of that growth came from a small number of severely under-filled volumes that consumed a disproportionate share of raw capacity, and our existing compaction strategy couldn’t reclaim the space quickly enough. At exabyte scale, even modest increases in overhead translate into meaningful infrastructure and capacity costs, so bringing that number back down quickly became a priority.

In this post, we’ll walk through why overhead is particularly hard to control in an immutable blob store, how compaction works in Magic Pocket, and the multi-strategy approach we rolled out to drive overhead back down, even below our previous baseline.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

The cost of immutability

When users upload files to Dropbox, Magic Pocket breaks those files into smaller pieces called blobs and stores them across its storage fleet. A blob is simply a chunk of binary data—part or all of a user file—written to disk. Magic Pocket is an immutable blob store, which means that once a blob is written, it is never modified in place. If a file is updated or deleted, new data is written and the old data remains until it is reclaimed by a compaction process.

At Dropbox scale, Magic Pocket stores trillions of blobs and processes millions of deletes each day. (A delete is a request to remove a blob when a file is deleted or updated.) Because data is immutable, deletes do not immediately free up disk space. Old data stays on-disk inside storage volumes. Once a volume is closed, it is never reopened. The tradeoff is that deletes leave unused space behind, and that waste grows over time unless we actively reclaim it.

Without reclamation, volumes gradually become partially filled, spreading live data across more disks than necessary. Fragmentation from lack of reclamation can have a big impact on storage overhead.

We address this in two steps. Garbage collection identifies blobs that are no longer referenced and marks them as safe to remove, but it does not free space on its own. Compaction performs the physical reclamation. Because volumes cannot be modified once closed, we gather the live blobs from volumes, write them into new volumes, and retire the old ones. This is how deletes eventually translate into reusable space.

Compaction lifecycle of volume 1 getting compacted along with a donor volume (2). Volume 1 is then eligible to be reused.

Compaction controls the waste created by deletes. But fragmentation isn’t the only factor that affects storage overhead—durability does too. To protect against hardware failures, we store data redundantly either as full copies or as encoded fragments distributed across different machines, so data can be recovered after disk or server failures. One approach is replication, which keeps multiple full copies of each blob and increases storage use proportionally. In Magic Pocket, we use erasure coding for nearly all data. Erasure coding splits data into fragments and adds a small number of parity fragments (extra pieces that let us reconstruct the original data if part of it is lost). It provides the same level of fault tolerance as replication, but with significantly less additional storage.

Redundancy affects overhead, but fragmentation determines how efficiently that space is used. A useful way to think about this is what percentage of a volume that contains active data. If a volume is half full of live data, we are effectively using twice the storage needed for that data. If only ten percent is live, we are using about ten times the space required. Without continuous compaction, disk capacity would eventually be exhausted even if the data redundancy scheme—how we store extra copies or fragments to protect against failures—never changed. Keeping storage overhead low in an immutable system therefore requires both efficient redundancy and constant consolidation of fragmented space.

The incident that forced a rethink

Earlier this year, we uncovered an issue with a new service that performs on-the-fly erasure coding, which we’ll refer to as the Live Coder service. It rolled out gradually over several months to new regions. The problem, which went unnoticed for weeks, was that volumes created through this path were severely under-filled. In the worst cases, less than five percent of their allocated capacity contained live data.

In practical terms, that meant live data was spread across far more volumes than intended. Instead of densely packing blobs together, we were creating many mostly empty volumes. Because volumes are fixed in size, each under-filled volume consumed the same disk allocation as a full one. The result was a sharp increase in fragmentation and a corresponding rise in storage overhead.

We saw early signs that this was impacting our effective replication factor, a signal that more raw storage was being consumed per live byte than expected. But identifying the root cause required significant investigation. Once we understood what was happening, we also needed to design recovery mechanisms capable of bringing overhead back down efficiently. The existing compaction strategy continued to make progress, but it was not designed to handle a long tail of severely under-filled volumes at this scale.

This incident exposed a limitation in our steady-state approach. It forced us to rethink how compaction should work when the distribution of live data shifts, and to develop new strategies capable of reclaiming space faster and more effectively.

What steady-state compaction looks like

In normal operation, before the incident, the distribution of data across volumes was relatively stable. Most volumes were already highly filled, and deletes accumulated gradually. In that steady state, compaction’s job was to continuously consolidate small amounts of fragmentation and keep storage overhead bounded.

For years, our baseline compaction strategy, which we call L1, worked well in this environment. It treats compaction as a packing problem: move live data from one or more partially filled donor volumes into a host volume that has enough free space. Over time, as donor volumes are drained of their live data, they become empty and can be removed.

L1 selects a host volume that is already highly filled, then chooses donor volumes whose live bytes fit into the host’s available space, and finally, writes them into a new volume. The selection logic is simple and fast, and it keeps placement risk and metadata updates bounded. However, each compaction run is relatively expensive. It may read tens of GiB across the host and donors but typically produces only a single new densely packed volume. On average, fewer than one full volume is reclaimed per run, since only donors are fully drained.

This approach works well when most volumes are close to full. But the incident changed that distribution. We saw overhead concentrated in a long tail of severely under-filled volumes. L1 continued to make progress, but it could not compact those volumes quickly enough. Its core assumption, that most volumes are highly filled, no longer held. To address this, we introduced two new compaction strategies, L2 and L3, each designed to handle different parts of the volume fill distribution.

A better way to reclaim space

When the distribution shifted, the limitations of L1 became clear. It was designed to top off already dense volumes, not to quickly reclaim a large population of severely under-filled ones. We needed a strategy that could reclaim space faster by combining multiple sparse volumes into a single near-full destination.

When we examined the distribution of live data across volumes, most of the wasted space was concentrated in a distinct subset that was less than half full. L1 wasn’t designed for this pattern; it topped off already dense volumes rather than aggressively consolidating sparse ones. Instead of incrementally packing donors into a host, L2 groups under-filled volumes together and selects combinations whose live data can nearly fill a new destination volume. Reclaiming several sparse volumes at once allows the system to recover space far more quickly.

Inputs:
volumes[] with LiveBytes
  maxVolBytes (destination volume capacity)
  maxVolumesToUse (count cap)
  granularity (scaling factor)
1) Scale live bytes and capacity by granularity to shrink the DP table.
2) DP over (i = volume index, k = count, c = capacity), keeping max packed bytes.
3) Track choices in a parallel “choice” table for reconstruction.
4) Backtrack from the best (k, capacity) to recover the selected volumes.

Under the hood, L2 is a bounded packing problem solved with dynamic programming. For each run, we choose a limited set of volumes whose combined live bytes come as close as possible to the destination capacity without exceeding it. To keep this practical at production scale, we cap how many source volumes can be used in one run and coarsen byte counts to reduce the search space. This keeps compute and memory bounded while still producing tight packings.

In practice, we tuned granularity, batch size, and planner concurrency to balance packing quality against compute and memory cost. Those settings allowed L2 to run efficiently in production while still producing tight packings.

In testing, the results were strong. With data shaped to resemble production distributions, L2 consistently produced near-full volumes. In production, it reduced compaction overhead two to three times faster than L1. In cells where L2 was enabled, overhead returned to sustainable levels within days, and over the course of a week, compaction overhead was thirty to fifty percent lower compared to cells running L1 alone. (Cells, in this instance, refer to independent units of the storage system that manage their own data.)

Cleaning up the sparsest volumes

L2 was effective at targeting the middle of the distribution, where volumes were under-filled but still dense enough to combine efficiently. But it was less effective at quickly reclaiming the sparsest volumes, or those with only a small fraction of live data remaining. These volumes formed the extreme tail of the distribution and required a different approach.

While iterating on compaction strategies, we returned to the Live Coder service. Its original purpose was to write data directly into erasure-coded volumes, bypassing the initial replicated write path. Although it isn’t ideal for latency-sensitive traffic, it’s well suited for background workflows where throughput matters more than immediacy.

Compaction is, in effect, a constrained form of re-encoding: take live data from one set of volumes and produce a new, durable volume. L3 builds on that idea by using Live Coder as a streaming pipeline. Instead of packing volumes together in a bounded batch, L3 continuously feeds the remaining live blobs from severely under-filled volumes into Live Coder and allows it to accumulate and encode them into new volumes over time. Once a source volume’s live data has been drained, it can be reclaimed immediately.

This strategy focuses on volumes that aren’t good candidates for L1 or L2. Under-filled volumes occur naturally as donors are partially drained, and they can accumulate quickly during failure modes like the incident described earlier. By prioritizing the sparsest volumes first, L3 minimizes the amount of data that needs to be rewritten per reclaimed volume and accelerates recovery of fragmented space.

L3 does introduce tradeoffs. Because it writes live data into entirely new volumes, every blob it moves has to be rewritten, which means new identifiers and additional metadata updates. That extra bookkeeping creates load on storage and metadata systems. With the amount of under-filled volumes observed in steady state, that additional load is tolerable and limits are in place to prevent overwhelming those systems.

Operational tuning and safeguards

To prevent compaction from competing with user traffic, we rate-limit the pipeline and keep traffic local to each cell rather than sending it across data centers. Together, L1, L2, and L3 form a layered strategy: L1 maintains steady state, L2 consolidates moderately under-filled volumes, and L3 drains the sparsest tail, thereby reclaiming space quickly without destabilizing the fleet.

Rolling out L2 and L3 wasn’t just about improving packing efficiency. We also had to ensure the system could absorb the additional work without creating new bottlenecks. Compaction touches storage, compute, metadata systems, and network bandwidth, so increasing its aggressiveness requires careful controls.

One of the most sensitive levers is the host eligibility threshold, which determines when a volume qualifies for compaction. If the threshold is too high, too few volumes are eligible and overhead rises. If it’s too low, we spend compute and I/O reclaiming very little space. We replaced static tuning with a dynamic control loop that adjusts the threshold based on fleet signals. When overhead rises, the system raises the threshold to prioritize higher-yield compactions. When overhead stabilizes, it lowers the threshold to stay responsive to deletes without over-compacting.

Candidate ordering is another important tuning lever. Choosing which volumes to compact first can speed up space reclamation, but it can also increase metadata work because more blobs may need to be rewritten. We tailor the ordering to each strategy. L1 stays conservative and limits how many donor volumes it touches to keep placement risk and metadata load low. L2 benefits from more aggressive grouping because denser packings reclaim more space per compaction run. L3 focuses on the sparsest volumes first, since draining them typically requires rewriting relatively little data per volume.

The final step was enabling L1, L2, and L3 to run concurrently without interfering with one another. Each strategy targets a different part of the volume distribution: L1 maintains steady state among highly filled volumes, L2 targets moderately under-filled volumes into dense destinations, and L3 drains the sparsest volumes. We enforce clear eligibility boundaries between strategies and rate-limit each path to protect downstream services. We also constrain traffic locality so compaction remains within a cell and avoids stressing cross-cluster bandwidth.

Together, these safeguards allow the system to adapt to workload shifts while keeping metadata pressure, network traffic, and compute utilization within safe limits.

What we learned

This project reinforced that compaction can’t rely on a single heuristic. L1 worked well in steady state because most volumes were already close to full, and only a small number were partially filled at any given time. When that distribution shifted and a large group of very sparsely filled volumes accumulated, L1 couldn’t recover overhead quickly enough. Splitting the problem across multiple strategies gave us coverage across the full range of volume fill levels: L1 maintains steady state for mostly full volumes, L2 consolidates moderately under-filled volumes, and L3 focuses on the sparsest volumes.

We also learned that manual tuning doesn’t scale. The host eligibility threshold is too sensitive to manage by hand, especially at exabyte scale. Moving to a dynamic control loop tied to fleet signals made overhead more stable and reduced the need for constant intervention. Candidate ordering and rate limits must also be tuned with awareness of downstream systems, particularly metadata services.

Operationally, metadata capacity turned out to be one of our biggest constraints. Not every compaction move has the same metadata cost. In L1 and L2, many blobs can stay under the same volume identity, so only donor blobs need location rewrites. In L3, blobs are written into brand-new volumes, so most blobs need new location entries. So it wasn’t enough to pack volumes efficiently; we also had to control how much rewriting we triggered. By limiting how much work L2 does in a single run, routing the sparsest volumes through L3, and keeping traffic local to each cell, we were able to reclaim space without overwhelming our metadata, storage, or network systems.

Finally, this work showed us that we needed better visibility into how compaction was performing. We added metrics to track how much data Live Coder is producing, how full volumes are across the fleet, and how storage overhead changes week over week. We also put monitoring in place to warn us early if compaction starts to fall behind. The goal is to catch shifts in how data is distributed before overhead rises too far so that we can respond proactively instead of scrambling to recover later.

Storage overhead directly determines how much raw capacity we need in order to store the same amount of live user data. Even small changes in overhead materially affect hardware purchases and fleet growth. By turning compaction into a layered, adaptive pipeline and strengthening our monitoring and controls, we made Magic Pocket more resilient to workload changes and better positioned to keep storage growth predictable over time.

Acknowledgments: Tommy Dean (contributions to L2 strategy) and Lisa Kosiachenko (contributions on automation)

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Reducing our monorepo size to improve developer velocity

Facundo Agriel,Ishan Mishra — Wed, 25 Mar 2026 10:00:00 -0700

At Dropbox, almost every product change flows through a single place: our server monorepo. A monorepo is a single, shared Git repository that contains many services and libraries used across the company. Instead of splitting code across dozens of smaller repositories, we keep a large portion of our backend infrastructure in one place. That architecture makes cross-service development easier, but it also means the repository sits at the center of nearly everything we build.

Building AI-powered features at Dropbox often requires small changes across ranking systems, retrieval pipelines, evaluation logic, and UI surfaces. All of that work moves through the same engineering loop: pull the latest code, build and test it, get it reviewed, merge it, and ship it. Over time, we began to notice that this loop was getting slower. Our monorepo had grown to 87GB; downloading a full copy of the codebase (or “cloning” the repository) took more than an hour, and many continuous integration (CI) jobs were repeatedly paying that cost. We were also approaching GitHub’s 100GB repository size limit, which introduced real operational risk.

In this post, we’ll share how we reduced the repository from 87GB to 20GB (a 77% reduction), cutting the time required to clone the repository to under 15 minutes. We’ll also explain what was driving the growth and what we learned about maintaining a large monorepo at scale.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

When repository size becomes a real problem

To understand why repository size matters, it helps to look at how engineers actually work. The first time someone sets up their development environment, they clone the repository, meaning they download a full copy of the codebase and its history to their machine. After that initial setup, daily work is less intensive. Engineers fetch and pull incremental updates rather than redownloading everything. But that first clone is unavoidable, and when the repository reached 87GB, it regularly took more than an hour.

That cost didn’t just affect onboarding. Many continuous integration jobs—automated build and test workflows that run on every code change—begin from a fresh clone. That meant our CI pipelines were repeatedly incurring the same overhead. Internal systems that synchronize the repository were also handling significantly more data than before, which increased the likelihood of timeouts and degraded performance.

At the same time, the repository was growing steadily, typically by 20 to 60MB per day, with occasional spikes above 150MB. At that rate, we were on track to hit the GitHub Enterprise Cloud (GHEC) 100GB repository size hard limit within months. The issue wasn’t simply that we had a large codebase. The growth rate itself didn’t match what we would expect from normal development activity, even at Dropbox’s scale. That suggested the problem wasn’t just what we were storing, but how it was being stored.

When compression backfires

At first, we looked for the usual causes of repository bloat: large binaries, accidentally committed dependencies, or generated files that didn’t belong in version control. None of those explained what we were seeing. The growth pattern pointed somewhere less obvious: Git’s delta compression.

Git doesn’t store every version of every file as a complete copy. Instead, it tries to save space by storing the differences between similar files. When multiple versions of a file exist, Git keeps one full version and represents the others as deltas, or “diffs,” against it. In most repositories, this works extremely well and keeps storage efficient.

The issue was how Git decides which files are similar enough to compare. By default, it uses a heuristic based on only the last 16 characters of the file path when pairing files for delta compression. In many codebases, that’s good enough. Files with similar names often contain related content. Our internationalization (i18n) files, however, followed this structure:

i18n/metaserver/[language]/LC_MESSAGES/[filename].po

The language code appears earlier in the path, not in the final 16 characters. As a result, Git was often computing deltas between files in different languages instead of within the same language. A small update to one translation file might be compared against an unrelated file in another language. Instead of producing a compact delta, Git generated a much larger one.

Routine translation updates were therefore creating disproportionately large pack files. Nothing about the content was unusual. The problem was the interaction between our directory structure and Git’s compression heuristic. Once we understood that mismatch, the rapid growth of the repository finally made sense.

Testing a fix locally

Once we suspected that delta pairing was the root cause, we looked for ways to influence how Git grouped files during compression. We found an experimental flag called --path-walk that changes how Git selects candidates for delta comparison. Instead of relying on the last 16 characters of a path, it walks the full directory structure, which keeps related files closer together.

We ran a local repack—essentially asking Git to reorganize and recompress the objects in the repository—using this flag. The results were immediate. The repository shrank from the low-80GB range to the low-20GB range. That confirmed our hypothesis: the issue wasn’t the volume of data, but how it was being packed.

However, that success exposed a new constraint. GitHub told us that --path-walk was not compatible with certain server-side optimizations they rely on, including features like bitmaps and delta islands that make cloning and fetching fast. Even though the fix worked locally, it wouldn’t work in production.

We needed a solution that achieved the same size reduction while remaining compatible with GitHub’s infrastructure. That meant working within the parameters GitHub could safely support, rather than relying on an experimental client-side flag.

Why we couldn't do this alone

Our local experiments proved that better packing could dramatically reduce the repository size. But there was a critical limitation: you can’t repack a repository locally, push it to GitHub, and expect those improvements to persist.

GitHub constructs transfer packs dynamically on the server based on what each client is missing. That means the server’s own packing strategy determines clone and fetch sizes. Even if a local mirror is perfectly optimized, GitHub will rebuild the pack during transfer using its own configuration. To permanently reduce repository size and improve performance, the repack had to be executed on GitHub’s servers.

$ git clone --mirror git@github.com:dropbox-internal/server.git server_mirror
performance: 2795.152366000 s

$ du -sh server_mirror
84G     server_mirror

$ git repack -adf --depth=250 --window=250
performance: 31205.079533000 s (~9h)

$ du -sh server_mirror
20G     server_mirror

We shared our findings with GitHub Support and worked with them on a solution that would be compatible with their infrastructure. Instead of relying on experimental flags, they recommended a more aggressive repack using tuned window and depth parameters. These settings control how thoroughly Git searches for similar objects and how many layers of deltas it allows. Higher values increase compute time during repacking but can significantly improve compression.

We tested the approach on a mirrored clone of the repository. The repack took roughly nine hours to complete, but the result was clear: the repository shrank from 84GB to 20GB. Because this method aligned with GitHub’s server-side optimizations, it could be executed safely in production.

Rolling it out without breaking anything

Repacking a repository changes how billions of objects are physically organized on disk. It doesn’t alter the contents of the code, but it does change the structure underlying every clone, fetch, and push. Given how central the monorepo is to our development workflow, we treated this like any other production infrastructure change.

Before touching the live repository, we created a test mirror and had GitHub perform the repack there first. We monitored fetch duration distributions, push success rates, and API latency to ensure the new pack structure didn’t introduce regressions. The mirror dropped from 78GB to 18GB, and while there was minor movement at the tail of fetch latency, it was well within the tradeoff we were willing to make for a fourfold size reduction. We didn’t observe stability issues.

With that validation in place, GitHub rolled out the production repack gradually over the course of a week. They updated one replica per day, beginning with read-write replicas and reserving buffer time at the end of the week in case a rollback was needed. This phased approach ensured that if anything unexpected surfaced, they could revert safely.

The final result was substantial. The repository shrank from 87GB to 20GB, and clone times dropped from over an hour to under 15 minutes in many cases. New engineers no longer begin onboarding with a long wait. CI pipelines start faster and run more reliably. Internal services that synchronize the repository are less prone to timeouts. And by moving well below GitHub’s 100GB limit, we reduced the risk of platform-level performance degradation during high-traffic periods.

Just as importantly, the system remained stable throughout the rollout. Fetch duration, push success rates, and API latency all stayed within expected ranges. The improvements held without introducing new operational risk.

Project data size dropped significantly and has remained stable since.

What we learned

Beyond the size reduction itself, this project reinforced a few broader lessons about maintaining large-scale infrastructure. The following three mattered most:

Growth isn’t just about commit volume
When we first noticed the repository ballooning, the instinct was to look at what was being added: large files, unused dependencies, generated artifacts. But the root cause had nothing to do with the content of our commits. It was about how our directory structure interacted with Git’s compression heuristics. Our i18n paths encouraged Git to compute deltas across different languages rather than within the same language. Routine translation updates were therefore creating oversized pack files. The growth was structural, not behavioral.

Tools embed assumptions. When your usage patterns diverge from those assumptions, performance can degrade quietly over time. In our case, Git’s 16-character path heuristic worked as designed. It just didn’t work well with our repository structure. Understanding those internal mechanics was what allowed us to diagnose the issue correctly.

Some fixes require working with your platform provider
We were able to identify the root cause and even validate a fix locally. But because GitHub determines how repositories are packed and transferred, a local repack wasn’t enough. The solution had to align with GitHub’s server-side infrastructure.

That meant bringing clear data to GitHub, testing collaboratively, and working within supported parameters. When your system depends on a managed platform, some problems live at the boundary between your code and theirs. Having strong relationships and a shared debugging process makes a meaningful difference.

Treat repo health like production infrastructure
A repository repack changes the physical structure of billions of objects. Even though the code itself doesn’t change, every engineer and every automated system interacts with that underlying structure. We approached this project the same way we would approach any production infrastructure change: test on a mirror, measure real-world impact, roll out gradually, and maintain a rollback path.

Repositories can feel like passive storage, something that simply grows over time. At scale, they are not passive. They are critical infrastructure that directly affects developer velocity and CI reliability. As part of this work, we built a recurring stats job that tracks key health indicators for the monorepo and feeds them into an internal dashboard. It monitors things like overall repository size, how quickly that size is growing, how long a fresh clone takes, and how storage is distributed across different parts of the codebase. If growth starts accelerating again or clone times begin creeping up, we'll see it early rather than discovering it when engineers start feeling the pain. Monitoring growth trends and investigating anomalies early is part of running a healthy engineering organization.

What’s next

Reducing the repository from 87GB to 20GB had an immediate impact on how we build. New engineers can get started in minutes instead of waiting through a lengthy initial clone. CI pipelines spin up faster and run more reliably. Teams working on AI features—where progress often comes from many small, iterative changes across multiple services—feel that improvement in every development cycle.

The investigation also led to structural changes designed to prevent the same issue from resurfacing. We updated our i18n workflow to align more closely with how Git’s packing algorithm groups files, reducing the likelihood of pathological delta pairing in the future. Just as importantly, we now have better visibility into repository growth trends and a clearer understanding of what “normal” looks like.

More broadly, this project gave us a repeatable playbook. When growth accelerates unexpectedly, we know how to investigate at the compression layer, how to validate fixes safely, and how to work across platform boundaries when necessary. Monorepos will continue to grow as products evolve, but growth doesn’t have to mean friction. With the right tooling and discipline, it can remain invisible to the engineers who rely on it every day.

Acknowledgments: Samm Desmond, Genghis Chau

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

How we optimized Dash's relevance judge with DSPy

Facundo Agriel,Ishan Mishra,Eric Wang,Dmitriy Meyerzon — Tue, 17 Mar 2026 10:00:00 -0700

Dropbox Dash brings your files, messages, and team’s knowledge together in one place, so you can ask questions and get useful answers that are actually grounded in your company’s context. Under the hood, that experience relies heavily on one deceptively simple capability: reliably judging which results are relevant to a query at scale. Relevance judges are used across multiple pipelines like ranking, training data generation, and offline evaluation. Without systematic optimization, they can become a primary source of regressions, cost blowups, and loss of trust as models change.

Making a relevance judge work in production is harder than it looks. A prototype might lean on a state-of-the-art model, but real systems have latency and cost budgets, which usually means migrating to smaller or cheaper models. The catch is that prompts often don’t transfer cleanly across models. We ran into this while scaling our LLM-as-a-judge work: manual prompt tuning got us to a functioning judge, but quality plateaued early and every model swap—or even a small prompt edit—risked regressions in unexpected cases.

To address prompt brittleness and scale up relevance label generation for the long tail of candidates, we brought in DSPy. DSPy is an open-source framework for systematically optimizing prompts against a measurable objective, turning a manual, fragile process into a repeatable optimization loop. In this article, we’ll show how we defined that objective, used DSPy to adapt our judge across models, and made the judge both cheaper and more reliable in production.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

How to measure agreement with humans

Before we can improve a relevance judge, we need a clear definition of what “good” means. At its core, the judge’s job is straightforward: given a query and a document, it assigns a relevance score from 1 to 5, where 5 indicates a perfect match and 1 indicates no meaningful connection to the query and user intent. To evaluate how well the judge performs, we compare its scores to those assigned by human annotators performing the same task.

In our evaluation dataset, humans are shown a query and a candidate document and asked to rate its relevance on that same 1–5 scale. They also provide a short explanation describing why they chose that score. These human judgments serve as our reference point. For more details on the annotation process, see our LLM-as-a-judge blog. (Dropbox conducts these reviews with limited, non-sensitive internal datasets; no customer data is reviewed by humans as part of this process.)

We then measure how far the model’s ratings deviate from the human ratings using normalized mean squared error (NMSE), a metric that summarizes the model’s average disagreement with humans as a single number. If a human assigns a 5 and the model assigns a 4, that’s a small disagreement; if the human assigns a 5 and the model assigns a 1, that’s a much larger one. NMSE captures those differences across the entire dataset by computing the average squared gap between the model’s score and the human score, scaled to a 0–100 range. An NMSE of 0 indicates perfect agreement, while higher values indicate worse alignment.

We also account for structural reliability. The judge’s output is formatted as JSON; if the model returns broken JSON or fails to follow the expected structure, that output cannot be parsed and therefore cannot be used. In those cases, we treat the response as fully incorrect. These formatting failures aren’t cosmetic: if the output cannot be read, examples may be dropped, batches can fail, and evaluation metrics become unreliable.

Taken together, this framework gives us a clear and measurable objective: minimize disagreement with human relevance judgments while ensuring that outputs remain consistently usable in production systems. That’s the objective DSPy optimizes against.

Adapting our relevance judge for large-scale use

Our best-performing relevance judge was built on the most powerful proprietary model at the time (OpenAI’s o3). It produced high-quality scores and aligned closely with human ratings, but it was expensive to run at scale. As Dash grew, we needed to score orders of magnitude more query–document pairs. Running the most expensive model for every judgment wasn’t sustainable. We wanted to move to a lower-cost, open-weight model that we could run at scale.

We chose gpt-oss-120b, an open model that offered a strong balance between cost and performance. In simple terms, it was much cheaper to run, but still capable of following complex instructions. The problem was that our carefully tuned prompt for o3 did not transfer cleanly. When we applied it to the cheaper model, quality dropped under our evaluation metric. Manual prompt rewriting could eventually recover performance, but it would require weeks of iteration and regression chasing. Instead of starting over by hand, we used DSPy to systematically adapt the judge to the new model.

How DSPy helped us adapt the judge

We already had everything needed to define the problem clearly. The task was fixed: given a query and a document, assign a relevance score from 1 to 5. The dataset was fixed: human-annotated examples with ratings and explanations. And the metric was fixed: NMSE, which measures how far the model’s ratings deviate from human ratings.

DSPy allows you to define that setup—task, data, and metric—and then systematically search for prompt variants that improve performance on that metric. We used DSPy’s GEPA optimizer (a method that iteratively improves prompts by analyzing where the model disagrees with humans and generating feedback) to adapt and optimize the relevance-judging program for a specific target model—in this case, gpt-oss-120b.

Rather than treating evaluation as a single score, GEPA generates structured feedback for each example where the model disagrees with a human annotator. In our case, we combined the size and direction of the gap with the human explanation and the model’s reasoning, producing concrete signals about what went wrong and why.

This feedback powers the DSPy reflection loop. The prompt is evaluated, its failure modes are surfaced in plain language, the prompt is revised, and the cycle repeats—all while directly optimizing against the human-alignment metric defined earlier. Instead of trying to infer improvements from a single number, the system can respond to specific patterns, such as underweighting recency relative to the human explanation or overvaluing keyword matches. To make this more concrete, here is a simplified version of how we construct that textual feedback:

diff = predicted_rating - expected_rating
direction = "higher" if diff > 0 else "lower"
feedback_parts = [
    f"Predicted rating {int(predicted_rating)} but expected {int(expected_rating)}.",
    f"Model rated {abs(diff):.0f} point(s) {direction} than the expected human rating.",
]

# Include human explanation if available
if gold.explanation:
    feedback_parts.append(f"Human rationale: {gold.explanation}")

# Include model's explanation for comparison
if pred.explanation:
    feedback_parts.append(f"Model's reasoning: {pred.explanation}")

feedback_parts.append(
    "Remember: when adapting the prompt, avoid overfitting to specific 
example(s). Do not include exact examples or keywords from them in the prompt. 
Also ensure you do not change the basic parameters of the task (e.g. changing the 
rating range to be anything but 1-5). Try to add a general rule to an execution 
plan to rate similar documents in the future."
)

feedback = "\n".join(feedback_parts)

There were important caveats. In early experiments, we observed that the optimizer could overfit by copying specific keywords, usernames, or verbatim document phrases directly into prompts. That behavior improved performance on the training examples but did not generalize. To address this, we added explicit guardrails to forbid direct inclusion of example-specific content. We also found that candidate prompts sometimes modified key task parameters, such as changing the rating scale from 1–5 to 1–3 or 1–4. Additional constraints ensured that the task definition remained stable throughout optimization.

With this setup in place, we could move beyond intuition and measure the impact directly. Because the task, dataset, and metric were fixed, we could compare the optimized prompt to our original manually tuned prompt under identical conditions. That gave us a clear view of what changed and by how much.

Comparing the best-performing DSPy-optimized prompt to the original manually written prompt, we reduced NMSE by 45 percent (from 8.83 to 4.86). That means the judge’s scores tracked human ratings much more closely, increasing our confidence in using it for evaluation and training signals. Model adaptation time dropped from one to two weeks of manual iteration to one to two days. That allowed us to swap in newly released models with less regression risk and keep the judge aligned with evolving product needs.

Because the optimized judge could run on a much cheaper model than our production o3 judge, we were also able to label 10–100 times more data at the same cost. That increased coverage and statistical power, enabled larger experiments, and reduced the risk of downstream models overfitting to a small evaluation set. Those results showed that DSPy could preserve quality while dramatically reducing cost.

However, optimizing for cost and human alignment still leaves an important question: can the judge behave reliably when its outputs are consumed programmatically in automated pipelines? In Dash, the relevance judge doesn’t run in isolation. It sits inside systems that score large candidate sets, generate training data, and run offline simulations. That means its outputs aren’t just read by people; they’re parsed and acted on by other components. This introduces a second requirement: operational reliability.

Improving operational reliability

When we talk about judge quality, it’s easy to focus only on how closely the model’s scores match human ratings. But in practice, the judge also has to consistently produce JSON outputs that downstream systems can read and use.

To stress test this dimension of reliability, we introduced gemma-3-12b, a much smaller and cheaper model. Smaller models reduce cost and enable broader scaling, but they are more brittle about formatting and instruction-following. By adapting our judge to a significantly smaller model, we could measure and directly optimize what was effectively the system’s weakest link: whether a low-cost judge could produce valid, machine-readable outputs consistently enough to be usable in Dash’s pipelines.

In the baseline configuration, more than 40 percent of gemma-3-12b’s responses were malformed JSON. Under our evaluation rules, those responses were treated as fully incorrect. This meant that even before considering alignment with human ratings, the judge was unreliable from an operational standpoint. After DSPy optimization, malformed outputs dropped by more than 97 percent, and NMSE improved substantially:

Version	NMSE	Valid Response Format	Invalid Response Format
Original Prompt (Baseline)	46.88	498	358
DSPy prompt (MIPROv2)	17.26	847	9

This result showed that DSPy was not only improving alignment with human judgments, but also strengthening structural reliability. Even a smaller, weaker model could become operationally dependable when optimized against the right objective.

At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.

Incrementally improving our o3 model

One finding emerged across our explorations: DSPy let us control the scope of changes, from small prompt edits to broader adjustments. When adapting to a new, cheaper model (like gpt-oss-120b or gemma-3-12b), we were comfortable with full prompt rewrites, prioritizing broad exploration and end-to-end optimization. But when the target was our production o3 judge—already strong and widely depended on—the constraint flipped. Our goal was to make targeted improvements without destabilizing behavior relied on across multiple pipelines.

When it came to optimizing the o3-based judge, we weren’t starting from scratch. We already had a high-performing baseline. Large prompt rewrites were too risky; even small wording changes could shift behavior in corner cases, and the blast radius was high. So instead of rewriting the prompt end-to-end, we limited changes to a small, predefined set of safe edits.

We introduced an instruction library layer to make prompt improvement more targeted and easier to control. When we found cases where the judge’s score differed substantially from the human rating, humans wrote short explanations describing what the judge misunderstood and what it should have paid attention to instead. We then distilled those explanations into single-line instruction bullets, or small, reusable “rules of thumb” the model can follow. In this setup, the optimization module is responsible only for selecting the best bullet-instructions. DSPy can’t rewrite the entire prompt from scratch; instead, its job is to choose which instruction bullets to include (e.g. select common themes of errors), and how to combine them, so the prompt grows by assembling the most helpful additional guidance rather than be constantly rewritten.

This turned optimization into something closer to “small PRs with tests” than a large-scale refactor: improvements were incremental, regressions were easier to diagnose, and we could keep the baseline behavior stable while still pushing agreement upward.

For example, if a disagreement was explained as “the document is older than a year, so it’s less relevant for this query,” we translated that into a bullet like: “Documents older than a year should be rated at least one point lower unless they are clearly evergreen.” DSPy could then learn whether including that bullet improved alignment on the eval set without unintended side effects.

We can see the cumulative effect of these incremental changes in the evaluation results below:

Each step represents a small, testable change, but together they produce a substantial improvement over the initial prompt.

Conclusion

In Dash, relevance scoring is a core capability that shapes ranking, training data generation, and offline simulation. Because it sits at the center of multiple pipelines, even small changes in how we score relevance can ripple outward. If every new model or prompting idea requires manual prompt surgery, progress becomes slow and risky.

With DSPy, we define the objective—alignment with human relevance judgments—and systematically optimize toward it. With the task and dataset held fixed, we can swap in new models and adapt them quickly, with measurable evidence instead of intuition. The workflow becomes less about rewriting prompts and more about improving against a clear metric. Just as importantly, DSPy lets us choose how to improve depending on our risk tolerance. We can run full end-to-end optimization when exploring new, cheaper models, or apply constrained, incremental updates when stability matters for production systems like o3.

In a system like Dash, where relevance scoring touches ranking, training data generation, offline simulation, and cost–latency tradeoffs, prompt optimization can’t be a one-off effort. DSPy turns it into a repeatable loop: define the task, measure against human labels, optimize, and ship changes with confidence as models evolve.

Acknowledgments: This work was made possible by close collaboration across Dropbox. We’d like to thank Eider Moore, Mingming Liu, Stella Xiang, Sean Chang, Prasang Upadhyaya, Hans Sayyadi, and Josh Clemm for their thoughtful reviews, technical feedback, and help shaping both the system and the story.

We’re also grateful to the DSPy community for their engagement and support. In particular, we‘d like to thank Isaac Miller, Drew Breunig, Lakshya A. Agrawal, and Omar Khattab for their guidance, discussions, and responsiveness as we applied DSPy to real production systems at Dropbox.

Dropbox hosted a Bay Area DSPy Meetup at our San Francisco office on Wednesday, March 18, 2026, bringing together developers building real-world, in-production AI systems. Dropbox engineers shared how we’re using LLM judges and DSPy to optimize prompts and improve reliability in production. Head here to view our presentation from the event.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Using LLMs to amplify human labeling and improve Dash search relevance

Facundo Agriel,Ishan Mishra,Eric Wang,Dmitriy Meyerzon,Dmitriy Meyerzon — Thu, 26 Feb 2026 09:00:00 -0800

When someone uses Dropbox Dash to search or ask a question, it follows a retrieval-augmented generation (RAG) pattern. This means our AI first retrieves relevant company information and then uses that information to generate responses. To produce those answers, it relies on enterprise search to retrieve company-specific context and then uses that context to ground the response. Rather than responding solely from general knowledge, Dash incorporates information that already exists within an organization.

When a user submits a query, Dash first interprets the underlying information need and determines how to retrieve relevant content. Search returns a set of candidate documents, and a large language model (LLM) analyzes the most relevant results to generate an answer. Because there are millions (and, in very large enterprises, billions) of documents in the enterprise search index, Dash can pass along only a small subset of the retrieved documents to the LLM. This makes the quality of search ranking—and the labeled relevance data used to train it—critical to the quality of the final answer.

Search results in Dash are ordered by a relevance model that assigns a score to each document based on how well it matches the query. Like most modern ranking systems, this model is trained rather than hand-tuned. It learns from examples of queries paired with documents, annotated with human relevance judgments that define what high-quality search results look like. These judgments are labeled examples in which people evaluate how well a document answers a given query.

In this story, we explain how we train Dash's search ranking models with a mix of human and LLM-assisted labeling—starting with a small amount of internal, human-labeled data, and then amplifying those efforts with LLMs to produce relevance labels at scale.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

Dash search relevance models

Dash’s ranking model is trained using machine learning techniques such as XGBoost rather than manually tuned rules. It learns from labeled examples of query–document pairs, where each document is evaluated based on how well it satisfies a given query. Over time, the model adjusts how it weighs different signals to reduce ranking mistakes (for example, cases where less useful documents are placed ahead of more useful ones). This framing leads directly to a core challenge: generating enough high-quality relevance labels to train the model effectively.

Where relevance labels come from

Training a relevance model requires examples that show what “good” and “bad” search results look like. In practice, those relevance labels can be created in several ways. One approach infers relevance from user behavior, such as clicks or skipped results. Another relies on humans manually assigning relevance scores to query and document pairs. And a third approach uses LLMs to generate relevance judgments directly.

This story focuses on the latter two approaches: direct human labeling and LLM-based evaluation. Signals from user behavior can still be helpful, but on their own they tend to be incomplete, influenced by existing rankings, and unevenly distributed. In practice, they work best as a supplement to labeled data rather than a replacement for it.

For the purposes of this article, relevance is treated as a graded score on a 1–5 scale. A score of 5 means the result closely matches what the user is trying to find, while a score of 1 means it isn’t useful enough to show. Importantly, relevance isn’t a fixed property of a document; it depends on the specific query, the user’s context, and the moment the search is made.

Human labeling
Historically, search engine providers relied on teams of human judges, typically third-party vendors, to label large datasets for model training. This approach had clear advantages. Human judges could systematically evaluate full result sets for each query, ensuring comprehensive and consistent relevance coverage in a way that user feedback—which is often sparse and biased—cannot.

However, the drawbacks are substantial. Human labeling is expensive and difficult to scale. Judges can be inconsistent and require ongoing training. In practice, it is also nearly impossible for humans to directly evaluate sensitive or proprietary customer data. Human evaluators also require training, which is made more difficult by the diversity of types of content that need to be rated (for example, comparing a Slack message to a Jira ticket or a Salesforce contact record can require very different contextual understanding and judgment).

LLM evaluation
LLMs provide an alternative mechanism for producing relevance judgments at scale. Compared to human annotators, LLMs are significantly cheaper, more consistent, and capable of evaluating much larger candidate sets across languages. They can also analyze customer content within defined compliance boundaries.

At the same time, LLMs are not general intelligence systems. Their performance depends heavily on both the quality of the underlying model and the clarity and precision of the instructions provided. As a result, LLM-generated relevance judgments must be evaluated and calibrated carefully before they are used for training. In practice, using LLMs for relevance evaluation requires a structured process that combines automation with human oversight.

Combining LLM evaluation with human review

To scale relevance evaluation without sacrificing quality, Dash pairs automation with human judgment. Before deploying an LLM to generate relevance labels at scale, its performance is validated against a small, high-quality set of human-labeled examples. (Human review is conducted by Dropbox with limited, non-sensitive internal datasets; no customer data is reviewed by humans as part of this process.)

A small group of human evaluators labels a dataset that is orders of magnitude smaller than what would be required for full training. These labels are used to tune the LLM prompt and model parameters. Once performance meets quality thresholds, the LLM is deployed to generate hundreds of thousands—or even millions—of relevance labels used to train Dash’s relevance model. In this setup, the LLM acts as a force multiplier for human effort: humans teach the LLM, and the LLM generates large-scale training data in return.

Human labeling effort is multiplied 100x to allow deeper and more representative training datasets

Using LLMs directly at query time to replace traditional ranking models is not currently feasible due to context window limitations and latency constraints. Instead, Dash uses LLMs offline to generate high-quality training data. In this role, the LLM functions as a teacher for smaller, more efficient relevance models that can operate at production scale.

Evaluating LLM relevance judgments

Improvement always starts with evaluation. Performance is measured, a change is made to the model or its instructions, and results are measured again to determine whether the system is moving in the right direction.

A chess engine operates under the same principle. As Garry Kasparov describes in Deep Thinking, reflecting on his historic match against IBM’s Deep Blue, the engine explores possible move sequences to a fixed depth and evaluates each resulting position. Poor evaluations prune entire branches of the search tree, while strong evaluations preserve promising lines of play. The overall strength of the system depends critically on the quality of the evaluation function.

Gradient-based optimization follows a similar pattern. Rather than enumerating the entire parameter space, machine learning algorithms compute a gradient and take incremental steps in the direction it indicates. Progress depends entirely on whether the evaluation signal accurately reflects improvement.

For an LLM acting as a relevance judge, evaluation follows the same logic. Dash compares LLM-generated relevance ratings with human judgments, rewarding exact matches on a 1–5 relevance scale and applying penalties for disagreement. Small differences incur small penalties, while large mismatches incur substantially larger ones. This behavior is captured using mean squared error (MSE), where the error ranges from 0 for exact agreement to 16 for the maximum possible disagreement.

Document sampling for LLM evaluation

At scale, not all evaluation data is equally informative. To improve LLM accuracy efficiently, Dash focuses evaluation effort on the cases most likely to surface errors. Training samples are biased toward situations where mistakes are more likely, since these offer the greatest opportunity for learning. Dash identifies such cases by analyzing discrepancies between user behavior and LLM-predicted relevance.

Examples include users clicking on documents the LLM rated as low relevance, or consistently skipping documents the LLM rated as highly relevant. These discrepancies are prioritized for human review and prompt refinement. (These processes are, again, limited to small internal datasets that do not include customer data.) The process is repeated iteratively until major sources of error are addressed or improvements plateau.

Evaluating relevance with additional context

Accurate relevance evaluation often depends on context that isn’t explicitly present in the query or document text. Without this context, even well-trained models can make systematic errors. In many cases, a query and a document alone are insufficient to make a reliable relevance judgment. Additional context about internal terminology, acronyms, or organizational knowledge may be required.

For example, within Dropbox, the term “diet sprite” refers to an internal performance management tool rather than a soft drink, a distinction that can be difficult for LLMs to infer without additional context. Acronyms present similar challenges, as they often have multiple meanings across organizations or even within the same company. Human evaluators typically resolve this ambiguity by running additional searches or consulting internal tools.

To automate this process, Dash provides LLMs with tools that allow them to research query context before assigning relevance labels. Once the LLM understands the user’s intent, it can apply consistent, context-aware relevance labeling across large candidate result sets, often going deeper than human evaluators would in practice.

Prompt optimization

As evaluation scales, prompt quality starts to matter much more. Prompt optimization ends up looking a lot like how human guidelines are developed: You review cases where the model gets relevance wrong, adjust the instructions or add missing context, and then test again. This is harder than it sounds. Small prompt changes can cause unexpected regressions, and consistency becomes harder to maintain as prompts grow longer and more complex.

Meta-prompting frameworks such as DSPy can help manage this complexity. (DSPy is a library for programmatically optimizing LLM prompts against defined evaluation targets.) Given a clear objective and a small set of human-labeled examples, DSPy can automatically refine prompts to better match human judgments. This makes it possible to reuse the same optimization approach across different evaluation tasks and model configurations, rather than treating each case as a one-off.

The chart below shows how the mean squared error (MSE) for the LLM-based relevance evaluator improved over time, driven by prompt refinement, the use of a reasoning-optimized model, incorporation of query context, and automated optimization with DSPy.

Conclusion

The relevance labeling approach described here is not limited to document search or tied to a specific model or evaluation framework. What matters is the underlying pattern: starting with a small amount of high-quality human judgment, using that judgment to calibrate LLM-based evaluation, and then scaling relevance labeling in a way that remains measurable, auditable, and correctable over time.

Because LLM-generated labels are grounded in human-reviewed reference data, they can be continuously monitored, stress-tested, and re-calibrated as models, prompts, and product requirements change. This grounding establishes a stable evaluation baseline that makes regressions detectable and improvements measurable, even as the surrounding system evolves.

As Dash expands to support additional content types—such as images, videos, messages, and chat—the evaluation problem becomes more complex. Each domain encodes relevance differently, and surface-level similarity is often insufficient. Human-calibrated LLM evaluation provides a shared mechanism for adapting relevance judgments across modalities without rebuilding labeling pipelines or redefining evaluation criteria from scratch.

Even as models improve, human grounding remains a structural requirement. Prompts drift, models change, and product expectations shift. A persistent, human-reviewed reference set anchors evaluation over time, allowing LLMs to scale judgment without eroding correctness. In short, LLMs make it possible to apply human judgment consistently and at scale, rather than replacing it.

Acknowledgments: Eric Wang, Hans Sayyadi, Josh Clemm, Mingming Liu, Andrew Yates, Marta Mendez, Jun Sun, Jay Frank, Angela Li

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

How low-bit inference enables efficient AI

Thu, 12 Feb 2026 10:00:00 -0800

In just the past few years, large machine learning models have made incredible strides. Today’s models are not only remarkably capable but also achieve impressive results across a range of applications, from software engineering and scientific research to content creation and data analysis. With the arrival of models like Kimi-K2.5 and GLM-5, the pace of progress shows no sign of slowing down. (Kimi-K2.5 has an impressive 1 trillion parameters, nearly twice as many as the DeepSeek V3 model family that was released just last year.) And as these models continue to grow in size and capability, so does the demand for memory, computing power, and energy.

One of the most effective ways teams are addressing these constraints is through low-bit inference, a set of techniques widely adopted across the industry that make AI models faster and cheaper to run by reducing how much memory and compute they need when serving real user requests. At Dropbox, products like Dropbox Dash rely on various models to deliver fast, reliable, and cost-effective AI-powered search and understanding across vast amounts of user content. Making this possible requires careful attention to model efficiency, hardware utilization, and latency constraints. And making this technology accessible to individuals and businesses means tackling new challenges around efficiency and resource use.

In this article, we’ll dive into the current landscape of low-bit compute for efficient inference. We’ll cover the different types of quantization, why and when they’re needed, and the key optimization challenges required to deploy advanced AI models in production.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

The cost of running modern models

At Dropbox, almost all the models used in-house are attention-based architectures used for tasks like understanding text, images, videos, and audio—core capabilities behind Dash’s ability to search, summarize, and reason over large collections of user content. As these models grow in size and complexity, efficiently serving them in production becomes a central challenge for delivering responsive user experiences. In attention-based models, most of the compute comes from repeated matrix multiplications in two main parts of the model:

The first is the linear layers, which compute embeddings throughout the model. These include:

The layers used within attention blocks, the components that determine how different parts of the input relate to one another
MLP layers, which further process and refine those representations
The model’s final output stage, where those representations are converted into a concrete result, such as a prediction or response.

The second is the attention mechanism itself, where the model evaluates relationships across the input to determine which information is most relevant, a step that significantly increases compute cost with longer context sizes.

On GPUs, these matrix multiplications are handled by specialized hardware. NVIDIA GPUs use Tensor Cores, while AMD GPUs use Matrix Cores. These dedicated processors are accessed through matrix multiply-accumulate (MMA) instructions and are designed specifically to accelerate matrix operations—the heavy-duty math that underpins large-scale linear algebra in neural networks—delivering substantial performance gains compared to executing the same work on general-purpose CUDA Cores.

One notable property of these cores is their scaling behavior. As numerical precision is reduced, these cores can perform more matrix operations per second, typically resulting in higher FLOPS (floating point operations per second, or how much math the hardware can do in a given time). In practice, halving the precision often allows these cores to roughly double throughput. This scaling behavior plays a key role in improving both performance and efficiency when running large-scale AI workloads.

Fig. 1: Tensor Core dense matrix multiplication performance (FLOPs) across different NVIDIA RTX 6000 variants and data precisions

Lowering numerical precision is accomplished through quantization, a technique that reduces the number of bits used to represent numerical values. By quantizing tensors, for example, from 16-bit to 8-bit or 4-bit, the memory footprint is reduced because each element requires fewer bits. This is typically done by rescaling the data to fit within a smaller representable range. For instance, 8-bit quantization maps values to 256 bins, restricting each tensor element to one of these discrete levels while approximating the original floating-point values. Quantization to lower than 8 bits typically requires an additional process called bitpacking, where multiple low-bit elements are combined into a native data type such as uint8 or int32, since 4-bit formats are not natively supported.

Lowering precision not only improves speed and memory usage, but also improves energy efficiency, since lower-bit data requires less power for both memory transfer and computation. For instance, with FP4 support, Blackwell offers significant energy savings compared to the H100.

There have also been attempts to explore lower bits such as binary and ternary weights (restricting weights to two or three discrete levels), which would offer even more theoretical energy efficiency. However, this form of quantization isn’t well suited for modern GPUs because it can’t fully leverage Tensor/Matrix Cores. Although there have been experimental efforts to explore custom hardware or specialized accelerators tailored to such schemes, this approach hasn’t yet seen broad industry adoption as a result of limited ecosystem support and model quality concerns. In short, while lower precision can dramatically improve efficiency, real-world gains depend on how well those formats are supported by existing hardware and software ecosystems.

In the following section, we examine different quantization configurations and highlight their key trade-offs when deployed on modern GPUs.

Understanding quantization formats

Quantization is not a single technique, but a family of approaches that differ in how numerical values are represented, scaled, and executed on hardware. These design choices directly affect model accuracy, performance, and how efficiently modern GPUs can accelerate inference. As a result, quantization formats are closely tied to the capabilities and constraints of the underlying hardware.

In practice, these differences matter because Dropbox runs a diverse set of AI workloads—such as multimedia understanding—across multiple generations of hardware, each with distinct performance characteristics. Some workloads are highly latency sensitive, prioritizing fast per-request execution, while others are throughput oriented and optimized for processing large volumes of data efficiently. Quantization formats influence how well a model can adapt to these constraints, determining whether computation is bound by software overhead, memory bandwidth, or specialized hardware units like Tensor Cores. Framing quantization through this lens helps clarify why different formats unlock different tradeoffs across our stack, and why no single approach is optimal for every workload we deploy.

With the introduction of the MXFP microscaling format, which standardizes low-bit data types with native hardware support, quantization methods for large language models can be broadly grouped into two categories: pre-MXFP formats, which rely on explicit dequantization and software-managed scaling, and MXFP formats, which move these operations directly into Tensor Core hardware. The sections below walk through both approaches, highlighting how they differ in practice and why those differences matter for real-world inference workloads.

Pre-MXFP formats

Prior to the introduction of MXFP, quantization primarily relied on integer data types for sub-byte formats. Common configurations included A16W4 (16-bit activations, 4-bit weights) for weight quantization, and either integer or floating-point formats for activations, such as A8W8 (8-bit activations, 8-bit weights). In contrast, sub-byte weight quantization generally requires calibration or more advanced algorithms to maintain model quality. For example, A16W4 relies on techniques such as AWQ or HQQ—quantization methods designed to preserve model quality at low bit widths—while lower-bit formats like A16W3, A16W2, and BitNet require increasingly sophisticated training quantization methods to achieve acceptable accuracy.

When activations and weights use different data types, the typical approach is to explicitly dequantize the lower-bit tensors to match the higher-precision format before performing the matrix multiplication (MMA) operation. This strategy can improve performance in memory-bound scenarios, where reducing data movement is the primary concern. However, in compute-bound workloads, the additional dequantization step can offset these gains and even slow execution due to the extra arithmetic involved.

This trade-off is especially visible in weight-only quantization, which reduces data transfer but does not accelerate the extra computation required to run matrix multiplications. The choice between activation quantization (such as A8W8) and weight-only quantization (such as A16W4) ultimately depends on the characteristics of the inference workload. Weight-only quantization often performs better in local deployments with smaller batch sizes and reasoning-heavy tasks, where memory bandwidth is a limiting factor. In contrast, activation quantization tends to be more effective for large-context prefills and high-throughput serving scenarios, where compute becomes the dominant bottleneck.

Fig. 2: A8W8 vs. A16W4 decoding performance across various batch sizes. A8W8 tends to outperform A16W4 in more compute-bound scenarios. A16W4 tends to perform worse than 16-bit matrix multiplication due to the additional cost of explicit dequantization

Popular methods such as AWQ and HQQ rely on linear quantization with grouping, a design that balances efficiency with accuracy. In symmetric linear quantization, dequantization is expressed as a simple scaling operation. A more flexible variant, asymmetric linear quantization, introduces an additional offset, allowing dequantization to be implemented as a fused multiply-add operation that maps efficiently to modern GPU hardware.

Grouping further improves accuracy by assigning shared parameters to small blocks of tensor elements rather than individual values. These groups typically consist of contiguous elements of size 32, 64, or 128. While simple, this approach substantially reduces quantization error at low-bit widths and has become a core component of most practical low-bit quantization schemes.

Fig. 3: Linear quantization overview where a matrix W is decomposed into Wq (low-bit tensor) and additional floating-point scales (s) and zero-points (z)

On the activation side, two 8-bit approaches are commonly used: channel-wise quantization and per-block quantization. Channel-wise quantization is straightforward and efficient, making it well suited for on-the-fly inference. The required rescaling can be applied directly after matrix multiplication, allowing for a highly efficient implementation on modern GPUs.

Per-block quantization, popularized by systems such as JetFire and DeepSeek V3, takes a more fine-grained approach. By dividing tensors into small tiles and assigning an independent scale to each block, this method limits the impact of outliers and reduces quantization error. It is particularly effective in quantization-aware training, where preserving pre-training accuracy is critical, while still delivering practical Tensor Core speedups.

Beyond linear quantization, several non-linear approaches, including QuiP# and GPTVQ, have explored alternative representations to push precision even lower. While these methods can achieve higher accuracy at very low-bit widths, they face practical challenges. Linear 4-bit quantization already delivers strong accuracy and can often be applied on the fly using techniques such as HQQ, avoiding expensive offline quantization passes. In addition, deploying non-linear formats efficiently requires custom fused kernels and deep integration into inference frameworks. Even then, low-bit weights must still be converted into a form compatible with Tensor Cores, making linear quantization both simpler and more practical on current GPU architectures.

Quantization techniques are also well-suited for optimizing the attention module. Methods such as Flash Attention 3 and Sage Attention use 8-bit quantization to accelerate attention-related matrix multiplications, improving throughput and memory efficiency with minimal impact on model accuracy.

MXFP formats

The MXFP microscaling format introduces a new standard for low-bit data types that fundamentally changes how quantized models run on modern GPUs. Unlike earlier formats, MXFP provides native hardware support for quantization, allowing Tensor Cores to operate directly on quantized activations, weights, and their associated scaling factors in a single fused operation. In contrast, pre-MXFP approaches required explicit dequantization steps before or after matrix-multiply-accumulate (MMA) operations, adding overhead and limiting achievable performance.

MXFP quantizes both activations and weights using a micro-scaling approach, similar in spirit to methods like AWQ and HQQ discussed earlier, but implemented directly in hardware. It uses symmetric quantization with a fixed block size of 32 and applies shared scaling factors stored in the E8M0 format. MXFP also supports mixed-precision MMA operations on some hardware, such as MXFP8 × MXFP4, giving practitioners flexibility to balance performance and accuracy. For example, activations can use MXFP8, MXFP6, or MXFP4 while the weights can remain in MXFP4. A breakdown of the MX types is demonstrated in the table below (source: Open Compute Project, OCP Microscaling Formats (MX) Specification, Version 1.0, Table 1).

Fig. 4: MX dtype breakdown

The E8M0 format for the scales represents positive powers of two in the range [2⁻¹²⁷, 2¹²⁷]. The scales are typically quantized as follows: scale = weight.amax(axis=1, keepdim=True) / max_val. As a result, scale values are effectively limited to values at or below 1, and extremely small magnitudes are rarely needed. In many cases, values as small as 2⁻¹⁵ are sufficient to capture near-zero weights. This observation suggests that scales could theoretically be represented with fewer bits than E8M0, although doing so would introduce additional complexity.

While E8M0 offers hardware-friendly implementation and flexibility, constraining scale values strictly to powers of two leads to a noticeable accuracy drop when using MXFP4. Fortunately, this loss can largely be mitigated through simple post-training adjustments, restoring most of the original model quality, as we demonstrated in our blog post.

To address remaining numerical limitations, NVIDIA introduced NVFP4 as an alternative to MXFP4. NVFP4 uses a smaller group size of 16 rather than 32 and employs E4M3 FP8 scaling factors, providing higher precision for scale representation. Because FP8 has a relatively large minimum representable value, a global per-tensor floating-point multiplier is applied to normalize the scaling range, achieving improved numerical stability.

Although MXFP4 and NVFP4 are standardized formats, their implementation depends on the GPU architecture. Different compute capabilities rely on different Tensor Core instructions. For example, sm_100 architectures use the tcgen05.mma instruction, while sm_120 architectures use mma.sync, both incorporating the block_scale modifier. As a result, kernels compiled for sm_100 are not portable to sm_120 due to these instruction-level differences. While most of the mainstream AI software stack remains focused on server-grade GPUs like the B200 and B300, there has been significant recent progress toward improving portability of low-bit workloads. Notably, Triton has introduced support for MXFP on sm_120 devices, enabling greater flexibility and cross-device compatibility for low-bit Triton kernels.

Looking forward

In this article, we explored several quantization techniques that are widely adopted across the industry to accelerate AI workloads. These approaches unlock substantial gains in efficiency and throughput, making it possible to deploy increasingly large and capable models within practical hardware, cost, and energy constraints.

At Dropbox, these considerations are central to how we build and operate products like Dash. Dash relies on large-scale models for experiences such as conversational AI, multimodal search, document understanding, and speech processing, all of which must meet strict latency, reliability, and cost requirements. To satisfy these constraints in production, we already employ a range of quantization strategies to optimize model deployment and fully utilize modern accelerators. The techniques discussed here reflect the kinds of trade-offs we evaluate when deciding how and where to run models across our infrastructure.

Despite the progress, important limitations remain. In real-world deployments, adoption of formats such as MXFP and NVFP is still evolving, and support for FP4 quantization remains incomplete across popular frameworks and model stacks. For example, many open-source runtimes don’t yet provide full support across different GPU architectures, and FP4 models are not yet widely available.

As hardware continues to evolve and the industry pushes toward lower-bit compute, these challenges will only become more pronounced. In our view, making low-bit inference viable for production systems like Dash will require tighter software design, more mature framework support, and new quantization techniques that preserve model quality at scale. We view this as an active area of exploration, one that will directly shape how we deliver fast, reliable, and efficient AI-powered experiences to Dropbox users in the years ahead.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Insights from our executive roundtable on AI and engineering productivity

Wed, 11 Feb 2026 09:00:00 -0800

Improving engineering productivity is crucial to the work we do at Dropbox. The more quickly we can deliver high-quality features to our customers, the more value they can get from our products. This rapid iteration has been key to developing tools like Dropbox Dash, context-aware AI that connects to all your work apps, so you can search, ask questions about, and organize all your content.

In the process of building Dash, we’ve become big adopters of AI tools in our own work, from Claude Code to Cursor. The early results have been promising, but there are still a lot of open questions about how to work with these tools most effectively and where they can have the most impact. To push this conversation forward, Dropbox CTO Ali Dasdan hosted an executive roundtable on December 11, 2025, at our San Francisco studio. We brought together a small group of technology leaders from top companies for an afternoon of open discussion, idea-sharing, and a deep dive into the evolving world of engineering productivity and AI. Here’s how it went.

How Dropbox is accelerating progress with AI

Adopting AI tooling for the sake of AI is meaningless; it must be tied to tangible business results. As we navigate this shift, we’ve had to ask ourselves: Which approach is the right one? What existing processes need to be upgraded in light of AI workflows? To kick off the event—and show attendees how we’ve been thinking through these questions at Dropbox—Uma Namasivayam, Senior Director of Engineering Productivity, took a closer look at our own experimentation, adoption, and enablement cycle to accelerate engineering productivity with AI.

We started by working with Dropbox leadership to gain buy-in and establish the importance of AI tooling, and together made AI adoption a company-level priority. This turned AI from a grassroots experiment into an urgent organizational priority, and helped everyone get aligned. Teams were now empowered to experiment with tooling, and we reduced the overhead associated with getting contracts approved to pilot new tooling at Dropbox.

In our experimentation, Dropbox saw impact across the entire software development life cycle, from code review and documentation to debugging and testing. Like other large organizations, Dropbox has our unique challenges. Off-the-shelf AI tools don’t always fit our scale constraints—we have a very large, multi-language monorepo—so we’ve had to be deliberate about where to adopt, where to extend, and where to build our own capabilities. For example, Dropbox built our own AI tooling that listens for failed builds on pull requests and uses our AI platform to propose fixes to them.

As a result of our efforts, most Dropbox developers are now using at least one AI tool in their workflows. We track pull request (PR) throughput per month, per engineer as a core metric. You can see how users who are engaging more with AI coding tools have an outsized impact on the code shipped, measured by PR throughput per month.

We also closely monitor the sentiment of engineers internally regarding AI tooling. As strong positive sentiment increases, we’re seeing the share of negative sentiment go down.

Most importantly, developers feel less friction using AI to accelerate their work because we’ve made it easier to adopt tooling according to what they feel works best for their team.

The executive roundtable

The heart of the evening was a roundtable discussion designed to cross-pollinate ideas across different industries. To facilitate this, we divided attendees into three cohorts, rotating the groups for each question so that every leader could learn from three different peer groups.

The discussion centered around three core pillars:

Measuring impact. What are the top three ways attendees are measuring AI-driven engineering productivity gains and what are the top three ways of measuring the resulting business impact?
Leadership alignment. Describe three ways of aligning with company leadership on the progress and pace of AI deployment and use for productivity.
The human element. What are the top three ways attendees are recruiting, evaluating, and growing their workforce for AI competency and productivity? What lessons can be applied to make non-developers more productive?

Following the structured session, the conversation continued over a cocktail hour, where leaders shared further insights into the commitment to craft required to lead in the age of AI.

What we learned, and what’s next

The overarching themes that emerged from the roundtable discussions centered around the following:

Balance. Productivity gains must be carefully balanced against potential trade-offs in quality and long-term maintenance costs.
The role of leadership. Management, particularly technical leadership, is pivotal in establishing and enforcing effective AI usage norms.
Formalization. Formalizing AI competency within career frameworks signals a long-term commitment to its strategic importance.

Still, there are a number of open questions, such as: If AI is giving us more capacity, where is that capacity actually going? For Dropbox, this capacity is currently being channeled into areas like addressing tech debt, executing migrations, and improving reliability.

However, a key challenge remains in effectively connecting these productivity gains to tangible business outcomes—a challenge also voiced by many attendees during the roundtable. Therefore, the focus for 2026 will be on mapping productivity directly to specific outcomes, extending operational rigor beyond engineering teams, and ultimately driving end-to-end product velocity.

A huge thank you to everyone who made the trip to our San Francisco studio and contributed to such a memorable event. If you missed out this time, keep an eye on our events page for future opportunities to connect!

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Engineering VP Josh Clemm on how we use knowledge graphs, MCP, and DSPy in Dash

Wed, 28 Jan 2026 10:00:00 -0800

I was recently a guest speaker in Jason Liu’s online course on RAG offered by the education platform Maven. I did some mini deep-dives into what we’ve been doing at Dropbox with knowledge graphs; how we’re thinking about indexes, MCP, and tool calling in general; some of the work we do with LLM as a judge; and how we use prompt optimizers like DSPy. This is an edited and condensed version of my talk. Visit Maven to watch the full video and hear my Q&A with Jason and his students. — Josh Clemm, vice president of engineering for Dropbox Dash

~ ~ ~

I don't know about you, but I probably have about 50 tabs open right now—and at least another 50 accounts for other SaaS apps. It’s completely overwhelming. It means your content is all over the place, and that makes it very, very hard to find what you're looking for. The good news is we have all these amazing LLMs coming out every day that can tell you about quantum physics. But the bad news is they don’t have access to your content. All of your work content is proprietary. It's within your walled garden. It means most LLMs can’t help when it comes to your work.

That’s why we’ve been building Dropbox Dash. It doesn't just look at your Dropbox content. It connects to all your third-party apps and brings it into one place, so you can search, get answers, and do the agentic queries that you want to do at work.

Here’s a brief primer on our tech stack and how Dash works.

The context engine that makes Dash possible

First, we have our connectors. This is where we're building custom crawlers and connecting to all these different third-party apps. It’s not easy. Everything has its own rate limit, each has its own unique API quirks, each has its own ACL and permission system, etc. But getting that right is essential and getting all that content in one place is the goal.

Next, we're doing a lot of content understanding—and in certain cases, enriching the content itself. So, first we normalize a lot of the different files that come in and get it into a format like markdown. Then, we’re looking at extracting key information. We're going to be looking at titles, metadata, trying to extract links, and generate different embeddings.

For documents, this is fairly straightforward. Just grab the text, extract it, throw it in the index, and you're done. Images require media understanding. CLIP-based models are a good start, but complex images need true multimodal understanding. Then you get to PDFs, which might have text and figures and more. Audio clips need to be transcribed. And then finally you get to videos. What if a client has a video like this very famous scene from Jurassic Park. How would you find this later? There's no dialogue, so you can't really rely on pure transcription. This is where you would need to use a multimodal model and extract certain scenes, generate understanding for each one, and then store that.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

After we understand the incoming content, we take it a step further to model all these pieces of information together as a graph. Meetings may have associated documents, associated people, transcripts, or prior notes. Building that cross-app intelligence is essential to providing better context for our users. This is where we're going to start to do the knowledge graph bundle that I'll talk more about later in depth.

From there, all that information (embeddings, chunks, contextual graph representations) flows into our highly secure data stores. Today we use both a lexical index—using BM25—and then store everything as dense vectors in a vector store. While this allows us to do hybrid retrieval, we found BM25 was very effective on its own with some relevant signals. It’s an amazing workhorse for building out an index.

Finally, we apply multiple ranking passes on any retrieved results so they are personalized and ACL’d to you.

Altogether, this is what we call our context engine. And once you have that, you can introduce APIs on top of it and build entire products like Dash.

Why we chose index-based retrieval

Okay, but why build an index? Why did we even go down this route in the first place? Well, there's a bit of a choose-your-fighter kind of mentality in the world right now between federated retrieval and indexed retrieval. The difference is very classic software engineering. Are you going to process everything on the fly? That’s federated retrieval. Or are you going to try to pre-process it all at ingestion time? That's index-based retrieval. And there are pros and cons to each approach.

Federated retrieval is very easy to get up and running. You don't have to worry about storage costs. The data is mostly fresh. You can keep adding more MCP servers and new connectors. But there are some big-time weaknesses here. You're at the mercy of all these different APIs or MCP servers which are going to differ in speed, quality, and ranking. You’re also limited in what you can access. You can access your information, but you probably don’t have access to company-wide connectors—meaning you can’t access content that’s shared across the whole company. And you have to do a lot of work on-the-fly in the post-processing. Once the data comes back, you have to merge information and potentially do re-ranking. And if you're using a lot of chatbots today with MCP, you're going to see that token count go up and up. It takes a lot of tokens to reason over this amount of information.

On the flip side, with index-based retrieval, you do now have access to those company connectors. And because you have time on your side, you can pre-process that content and create these really interesting enriched data sets that don't exist on their own. You can also do a lot more offline ranking experiments. You can try different methods to improve your recall, and it’s very, very fast. But it's also a ton of work—and a lot of custom work. This is not for the faint of heart. You have to write a lot of custom connectors. As for ingestion time, you're going to have freshness issues if you're not good with understanding rate limits. It can also be extremely expensive to host this information, and then you have to decide how to store it. Am I using a vector database, like classic RAG from many years ago? Am I going the BM25 route? Do I want to do hybrid? Do I want to do a full graph RAG, which is what we ended up going with? There are a lot of decisions you have to make.

Making MCP work at Dropbox scale

Now what about MCP? There was a lot of hype when MCP burst onto the scene about a year ago. Everybody was talking about it: “You don't need any of these APIs anymore, you just add MCP to your agent.” Sounds great, right? But there are some major challenges with how MCP is typically implemented.

MCP tool definitions, in particular, take up valuable real estate in your context window. We’re noticing quite a bit of degradation in the effectiveness of our chat and agents (very classic context rot). So with Dash, we're trying to cap things to about 100,000 tokens. But those tool definitions do fill up quickly. The results are quite significant, especially if you're doing retrieval. You're getting a lot of content back, and you're immediately going to fill up that context window. It's going to be very problematic. It’s also incredibly slow. So, if you’re using MCP with some agents today, even a simple query can take up to 45 seconds—whereas with the raw index, you're getting all the content coming back very quickly, within seconds.

Here are some of the ways we’ve solved for that:

We've got our index, and we can wrap that around a tool. Let's call it a super tool. And so instead of 5-10 different retrieval tools, we just have one. This helps a ton with cleaning things up overall.
Modeling data within knowledge graphs can significantly cut our token usage as well, because you're really just getting the most relevant information for the query.
Tool results come back with a huge amount of context, so we actually choose to store that locally. We do not put that in the LLM context window.
Finally, we use a lot of sub-agents for very complex agentic queries, and have a classifier effectively pick the sub-agent with a much more narrow set of tools.

Our approach to knowledge graphs

The next question that comes up a lot: are knowledge graphs worth it? Well, let’s look at how a knowledge graph works.

You start by modeling these different relationships across these various apps. For example, say you've got a calendar invite. It might have attachments, meeting minutes, a transcript. Of course, it also has all the attendees, and maybe there's even a Jira project associated. Every app that we connect with has its own concept or definition of people, and so coming up with a canonical ID for who someone is is very, very impactful for us overall. Being able to model something like that is incredibly powerful. You can go view somebody's profile on Dash today, but it also helps a ton in relevance and retrieval.

Say that I want to find all the past context engineering talks from Jason. But who is Jason? How do you know that? Well, if you have this graph—this people model—you can then go ahead and fetch that and add that to the context, and it's not having to do a ton of different retrieval overall. Fantastic. And we use normalized discounted cumulative gain (NDCG) a lot to score the results to retrieve. But just by doing this people-based result we saw some really nice wins.

The architecture itself is complicated. I won't talk a ton here, but it's important to realize we're not just storing a one-to-one mapping of source doc to end doc. We do want to derive and create more unique characteristics. And the other key insight here is we're not storing these graphs in a graph database. We did experiment with that. The latency and query pattern were a challenge. Trying to figure out that hybrid retrieval was a challenge. And so we ended up building these graphs in a more unique way. We’re staging it more asynchronously, we're building out these relationships, and then we create these knowledge bundles. So again, it's not necessarily a graph, but think of it almost like an embedding—like a summary of that graph. And it becomes these little contexts that contain all this information. And with that context, we actually just send it on through the exact same index pipeline that we have for all the other content. So things will get chunked and things will generate embeddings for both lexical as well as semantic retrieval.

Using an LLM to judge relevancy

Alright, we've indexed all this content. We've got content understanding. We've done a ton of work on trying to model these relationships. But did the retrieval quality actually improve? How do we know?

Take Google search, for example. You have your 10 blue links, and the audience for those results are humans. If your results are high quality, the humans will tell you by clicking. You can quickly get some amazing signal this way. The model is either working or it isn’t.

In the world of chat, you're still retrieving results, but it's not for the human. It's for this large language model. And so you no longer have those humans to help us out. So what do you do? That's where you want to use LLMs as a judge. Broadly speaking, what you're trying to do is judge how relevant a piece of information is between, say, one and five, and then use that to improve over time.

Humans can still help here. Sometimes they give you thumbs ups and thumbs down on the quality of your results. You can also bring in human evaluators to help you. When we started these experiments, we asked ourselves: How accurate can we get our judge to match what a human will do? And so we had a bunch of our engineers label a ton of documents to see how much of a disagreement there was between the human and the LLM as a judge. The first prompt for our judge wasn’t bad—8% disagreed—but the lower, the better.

Next, we continued to refine the prompt. You know, classic prompt tuning like “provide explanations for what you're doing.” And sure enough, disagreements went down. Then, we just upgraded the model itself to OpenAI’s o3. It's a reasoning model, far more powerful, and guess what? Disagreements with the humans went down further.

Finally, a big problem with using an LLM as a judge in a work context is that it doesn't know things like acronyms. If I were to say, “What is RAG?”—and hopefully it knows what RAG is—what if it hasn’t been trained on that? Sometimes, the judge needs to go get that context. And so, this is a little tongue-in-cheek, but we call this RAG as a judge. It can't just be using pre-computed information. Sometimes it has to go fetch some context itself. And with that, we dropped disagreements even further.

Prompt optimization with DSPy

There's a growing community around prompt optimizers, and one of the technologies in particular we've been using is DSPy. It helps optimize your prompts. It tries to get the most accurate information based on a set of evals. So by bringing in DSPy, we got even better results overall.

It might be impossible to get to zero disagreements. Even humans—multiple humans—will disagree on the relevance set. But we're going to keep grinding on this. And even if we can't get to zero, we're actually quite pleased with some of the results we're getting with DSPy.

One thing to note: We saw some really interesting emergent behavior happening with DSPy. Instead of simply telling us what the improvements could be, we noticed we could create bullet points with the different disagreements and then have DSPy try to optimize the bullets themselves. So if there were multiple disagreements, it would try to reduce those disagreements overall, and we started to create this really nice flywheel and ended up getting some nice results.

There are some other benefits of DSPy. So first, obviously, is prompt optimization. It helped us quite a bit in our LLM-as-a-judge area. Again, that's a prime place to think about DSPy right now, because LLMs as a judge have very crystal clear rubrics and evals. You know exactly what the outcome should be. You just need to have the ultimate prompt, and it’s really good for that. We're going to start to experiment with DSPy across our entire stack. We have over 30 different prompts today throughout the Dash stack, whether that's in the ingest path, LLMs as a judge, some offline evals, as well as our online agentic platform approach.

The next one is prompt management at scale. I mentioned we've got about 30 prompts overall, and at any given time we might have 5 to 15 different engineers tweaking these prompts and trying to get more improvements. And it's a little silly if you think about it. You've got this text string that you've checked in to your code repository; but then there's an edge case, this chat session didn't work. So you go in and fix it, but then something else breaks, and it becomes a bit of a whack-a-mole. And so it's very powerful to just define things in a more of a programmatic way and let these tools spit out the actual prompt themselves. It just works better at scale.

And the last really great benefit we like is around model switching. So, every model out there is a bit unique. They have their own quirks, and there's always different ways to prompt them. And anytime you bring in a new model, you have to spend a bunch of time optimizing the prompt again. But with DSPy, you just plug the model in, define your goals, and out spits the prompt that works. So you can do this model switching far more rapidly—and this is really beneficial for modern agentic systems, because you just don't have one giant LLM. You're going to have a planning LLM, you're going to have all these smaller sub-agents, and those sub-agents might be very narrowly focused. You probably want to pick a model that's highly tuned to that particular task, so having something like a prompt optimizer is really powerful.

Make it work, then make it better

To wrap things up, here are some key takeaways:

We do find the index is superior. It is a lot of work, so don’t approach this lightly. Understand that you have to build up quite a bit of infrastructure and different data pipelines to get this working. Thinking through your data, storage, how you want to index, the retrieval—it's a lot of work, but worth it at scale.
Cross-app intelligence absolutely does work. You want to create those relationships. You want to be able to bring in the org chart whenever you're adding different prompts. But it also isn't easy. If I knew the exact prompts everybody was going to ask 10 times a day, I would go build a more optimal bundle of that knowledge and store that, so it's very, very fast and accurate. You just don't have that benefit all the time.
On the MCP side, we highly recommend you limit tool usage. Instead, try to think about super tools. Explore tool selection. Potentially have sub-agents with limits on tool calls. Really guard your context window.
Investing in effective LLM judges is incredibly important. A lot of times that initial prompt is all people do. They're like, “Alright, good. Done. It's good enough.” But if you can grind that down and get the accuracy to improve, it really lifts all boats—and you're going to see some really nice outcomes across the board.
Prompt optimizers do work at scale. They work at any scale, but they’re absolutely essential at scale.

My final, overall takeaway is the classic software engineering concept of: make it work, then make it better. A lot of the techniques and things I've described here are things that we've been doing over the last few years with a big engineering team working on this day-in and day-out. If you're just getting started, absolutely invest in those MCP tools and everything on the real-time side. And then, over time, as you start to see what your customers are doing and you start to get some more scale, look for opportunities to optimize overall.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Inside the feature store powering real-time AI in Dropbox Dash

Thu, 18 Dec 2025 10:00:00 -0800

Dropbox Dash uses AI to understand questions about your files, work chats, and company content, bringing everything together in one place for deeper, more focused work. With tens of thousands of potential work documents to consider, both search and agents rely on a ranking system powered by real-time machine learning to find the right files fast. At the core of that ranking in Dash is our feature store, a system that manages and delivers the data signals (“features”) our models use to predict relevance.

To help users find exactly what they need, Dash has to read between the lines of user behavior across file types, company content, and the messy, fragmented realities of collaboration. Then it has to surface the most relevant documents, images, and conversations when and how they’re needed. The feature store is a critical part of how we rank and retrieve the right context across your work. It’s built to serve features quickly, keep pace as user behavior changes, and let engineers move fast from idea to production. (For more on how feature stores connect to context engineering in Dash, check out our deep dive on context engineering right here.)

In this post, we’ll walk through how we built the feature store behind Dash’s ranking system, why off-the-shelf solutions didn’t fit, how we designed for speed and scale, and what it takes to keep features fresh as user behavior changes. Along the way, we’ll share the tradeoffs we made and the lessons that shaped our approach.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

Our goals and requirements

Building a feature store for Dash wasn’t just a matter of picking something off the shelf, and there are a few reasons why. For one, our infrastructure is split across two very different worlds: an on-premises ecosystem designed for low-latency service-to-service communication, and a Spark-native cloud environment where feature engineering and large-scale data processing happens. This split ruled out standard cloud-native feature stores and forced us to find a way to bridge both systems without slowing down development velocity.

On top of that, Dash’s search ranking system brought its own scaling challenge. A single user query doesn’t just pull up one document. Instead, it triggers our ranker to evaluate many files, each requiring dozens of behavioral and contextual features. What starts as one search quickly fans out into thousands of feature lookups across interaction history, metadata, collaboration patterns, and real-time signals. Ultimately, our feature store had to handle those kinds of massive parallel reads while still meeting strict, sub-100ms latency budgets.

Relevance also depends on speed and capturing user intent in real-time. If a user opens a document or joins a Slack channel, that signal should show up in their next search—within a few seconds—which meant building an ingestion pipeline that could keep up with user behavior at scale.

Finally, we had to reconcile two very different computation patterns. Some features naturally fit real-time streaming, while others depend on batch processing of historical data. We needed a unified framework that could support both efficiently, thereby reducing cognitive load for engineers and giving them a faster path from idea to production-ready features.

Designing our hybrid feature store

After surveying the feature store landscape—Feast, Hopsworks, Featureform, Feathr, Databricks, and Tecton—Feast stood out for two reasons. First, its clear separation between feature definitions and infrastructure concerns meant our machine learning engineers could focus purely on writing PySpark transformations rather than the serving, storage, or orchestration complexity. Second, Feast’s modular architecture and extensive adapter ecosystem made it straightforward to integrate with our existing infrastructure. (An adapter refers to a Feast-provided interface that integrates its framework with different backend systems.) Its AWS DynamoDB adapter was particularly crucial, allowing us to leverage Dynovault—our in-house DynamoDB-compatible storage solution—to meet latency requirements while lowering costs.

Our Feast-based architecture combines three key components, each optimized for its role.

Feast gave us the orchestration layer and serving APIs, but we swapped out its Python online serving path for our own Go service so we could actually hit the concurrency and latency numbers we needed.

Cloud-based storage took care of the heavy lifting of offline indexing and storage, while Spark jobs handled feature ingestion and computation.

Dynovault handled the instant feature lookups needed for each search query. Co-located with inference workloads and leveraging Dropbox’s hybrid cloud infrastructure, Dynovault avoids the delay of public internet calls and reliably delivers ~20ms client-side latency while balancing cost and geographic scalability.

Around this core architecture, we added observability through job failure monitoring, freshness tracking, and data lineage visibility. The result is a streamlined experience: engineers choose a data source, write PySpark transformations, and request features where needed, while the infrastructure abstracts away offline and online data management, pipeline orchestration, low-latency serving, and data freshness guarantees.

Making search fast

With the architecture in place, the next challenge was meeting Dash’s sub-100ms latency requirements. Feature retrieval sits directly on the critical path of search and LLM answer generation, so even small delays compound quickly at scale and degrade Dash’s snappy search retrieval experience.

Our initial feature-serving implementation was built in Python using the Feast SDK. While parallelism helped at moderate scale, profiling revealed that CPU-bound JSON parsing and Python’s Global Interpreter Lock became the dominant bottlenecks under higher concurrency. Moving to multiple processes temporarily improved latency, but introduced coordination overhead that limited scalability.

To remove these constraints, we rewrote the feature serving layer in Go. Using lightweight goroutines, shared memory, and faster JSON parsing, the Go service delivers true concurrency without the coordination costs we hit in Python. Today, it handles thousands of requests per second while adding only ~5–10ms of processing overhead on top of Dynovault’s client latency, consistently achieving p95 latencies in the ~25–35ms range.

This shift allowed us to meet Dash’s latency targets reliably and ensured that feature serving wouldn’t become the limiting factor as search traffic and feature complexity continued to grow.

Keeping features fresh

Speed matters only when the data itself is fresh. Stale features can lower ranking quality and hurt user experience, so our feature store had to reflect new signals as soon as possible, often within minutes of user actions.

The challenge was scale. Many of Dash’s most important features depend on large joins, aggregations, and historical context, which makes fully real-time computation impractical. We needed an ingestion strategy that balanced freshness with reliability, without overwhelming our infrastructure or slowing development. To do that, we built a three-part ingestion system.

Batch ingestion handles complex, high-volume transformations built atop the medallion architecture (a layered data model that organizes data from raw to refined stages). Rather than rewriting every feature on each run, we added intelligent change detection so only modified records are written to the online store. This reduced write volumes from hundreds of millions to under one million records per run and cut update times from more than an hour to under five minutes.

Streaming ingestion captures fast-moving signals such as collaboration activity or content interactions. By processing unbounded datasets in near-real time, it ensures features stay aligned with what users are doing in the moment.

Direct writes handle lightweight or precomputed features by bypassing batch pipelines entirely. For example, relevance scores produced by a separate LLM evaluation pipeline can be written directly to the online store in seconds instead of waiting for the next batch cycle.

Together, these approaches allow Dash to keep feature values fresh without forcing all computation onto a single ingestion path, maintaining ranking quality while scaling to real-world usage.

What we learned

Building a feature store at Dropbox scale reinforced a few hard-earned lessons about systems design. On the serving side, Python’s concurrency model became a limiting factor for high-throughput, mixed CPU and I/O workloads. Even with careful parallelism, the Global Interpreter Lock capped performance for CPU-bound work like JSON parsing, and moving to multiple processes introduced new coordination bottlenecks. Rewriting the serving layer in Go allowed us to remove those tradeoffs and scale concurrency more predictably.

On the data side, infrastructure changes mattered, but understanding access patterns mattered just as much. By recognizing that only 1–5% of feature values change in a typical 15-minute window, we were able to dramatically reduce write volumes and ingestion time. This shift turned hour-long batch cycles into five-minute updates, improving freshness without increasing system load.

These optimizations came together in a hybrid architecture that balances flexibility and performance: Feast for orchestration and consistency, Spark for large-scale computation, and Dynovault for low-latency online serving. Rather than relying on a single vendor solution, this approach let us tune each layer to its strengths while keeping training and serving aligned.

Ultimately, this work underscored the value of a middle path between building everything from scratch and adopting off-the-shelf systems wholesale. By combining open source foundations with internal infrastructure and tailoring them to real constraints, we were able to build a feature store that fits the needs of Dash today and, ultimately, can evolve with it in the future.

Acknowledgments: Special thanks to all current and past members of the AI/ML Platform and Data Platform teams for their contributions, as well as our lovely machine learning engineers who spin up the magic with the tooling we build.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

Building the future: highlights from Dropbox’s 2025 summer intern class

Wed, 26 Nov 2025 11:00:00 -0800

This summer, the Emerging Talent team proudly welcomed 43 interns to Dropbox as part of our 2025 Camp Dropbox Intern Program. Representing 27 colleges and universities—including six international institutions in Canada, Poland, and Ireland—this year’s cohort brought a wealth of diverse perspectives and experiences. Of the group, 28 interns joined our Engineering teams, and over the course of 12 weeks (May through September), they immersed themselves in meaningful work, continuous learning, and our Virtual First culture.

The Dropbox Intern Program is thoughtfully designed to cultivate growth, spark innovation, and build lasting connections. Interns benefited from more than 6,000 hours of dedicated one-on-one mentorship, tackled high-impact projects aligned with team and company goals, and explored hands-on applications of AI. Many of these projects supported the development of Dropbox Dash, our AI-powered universal search product. Robust programming—including Virtual First events, ERG activities, and the in-person Emerging Talent Summit—created further opportunities for connection and community. By the end of the summer, these interns had made meaningful contributions across our engineering organization.

Below, our interns share what they worked on this summer, from big technical wins to moments of creativity, collaboration, and growth that shaped their time at Dropbox.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

“I tackled the Dropbox file history tracking system. As an engineering intern working in a large production database for the first time, I learned the importance of strongly tested, verifiable code and thoughtful system design. I really aligned with the core Dropbox values of Be Worthy of Trust and Keep It Simple at the software level. This solution simplifies our metadata infrastructure, significantly reduces operational costs, and shows how thoughtful refactoring of legacy systems can deliver both technical elegance and substantial business value.”

—Rhea Rai, Filesystem Data

“During my internship with the ML Platform team, I worked on a system that monitors the health of ML model deployments. By integrating with internal inference services, AI Sentinel gives machine learning engineers real-time operational visibility they previously had to gather manually. The result is greater deployment confidence and faster iteration cycles, ensuring reliable ML model deployments that power Dash’s intelligent features at scale.”
—Ben Juntilla, ML Platform

“I worked on reducing front end latency in Magic Pocket. Elevated PUT latencies during scheduled disk restarts can delay updates in workflows like Dash connectors, leaving users with outdated or missing content. To address this, I built a cache to track storage health and added a filtering option to skip degraded volumes. This health-aware routing reduces slow writes and gives operators greater control, ensuring Dropbox delivers timely, accurate search results.”
—Albert Joon Sung, Storage Core

“I worked on an AI-powered tool built on top of our internal migration platform that automates code migrations. Developers can launch auto-migration jobs on selected folders for specific migration types. Successful runs open a pull request automatically; otherwise, you can run the command locally and submit changes manually. The tool is fully customizable via the CLI or as part of an automated workflow. With it, I completed two major migrations.”
—Ahmed Ibrahim, Web Developer Experience

“I built tools that give machine learning engineers access to the most up-to-date information in the Dash persistence store. With this, downstream teams can train models on fresher data and pull in additional metadata fields from third-party systems without waiting for the Connector Platform team to redownload or repackage anything.”
—Eddie Ormseth, Connector Platform

“I worked on expanding the unified search platform (USP) to support more than 20 languages. The USP powers search across Dropbox products like Replay, and my project integrated a language detection pipeline into both indexing and retrieval. This enables accurate, efficient multilingual search without the overhead of traditional solutions. By delivering native language support ahead of this year’s Dash launch, my work helps Dropbox scale globally, improve the developer experience, and unlock richer search for international customers, bringing us closer to our vision of an AI-first, universally accessible search platform.”
—Rishi Peddakama, Retrieval Platform

“I explored advanced anomaly detection techniques for our Vortex2 metrics system. Traditional static alerting can miss sudden, meaningful shifts in data that don’t cross predefined thresholds (or trigger too often when changes are expected). To address this, I developed adaptive detection methods that adjust to evolving patterns. These improvements streamline alert creation and reduce alert fatigue, enhancing the on-call experience. By accounting for seasonality, the new anomaly detection functions also enable faster response times and improve the overall developer experience.”
—Yonatan Ginsburg, Metrics

“I developed a seamless document preview experience within Dropbox Dash, allowing users to quickly view file content without leaving their search context. This enhancement supports the Dropbox mission to accelerate workflows by reducing context switching and increasing engagement. I built interactive UI components, integrated PDF viewing, and implemented dynamic follow-up features linking to AI-powered chat.”
—Francesca Venditti, Find & Discover

“This summer on the Analytics Platform team, I worked on optimizing large-scale Databricks queries and ETL pipelines to reduce compute cost and latency. I developed an optimization recommendation system that flagged high-cost query patterns, expensive table-column filters, and under-allocated compute resources, complete with actionable sourcing information. I also prototyped and documented an Airflow pipeline to migrate a 500 TB mobile events log to liquid clustering, paving the way for broader adoption of modern data layout techniques.”
—Sanjith Udupa, Analytics Platform

“I built an extensible AI web-automation agent for Dropbox. I also connected Dropbox backend APIs via searchFile and uploadFile actions to fetch and upload files, using open-source foundations. By keeping tool sets small and modular, developers can quickly compose reliable, task-specific automations, like form filling or proofreading. As demand for automating repetitive web tasks continues to grow, integrating automation tools into Dash will significantly improve the user experience.”
—Alan Zhu, Conversational AI

Responses have been lightly edited for length and clarity.

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.

How Dash uses context engineering for smarter AI

Mon, 17 Nov 2025 11:00:00 -0800

When we first built Dash, it looked like most enterprise search systems: a traditional RAG pipeline that combined semantic and keyword search across indexed documents. It worked well for retrieving information and generating concise answers. But as teams began using Dash for more than just finding content—for example, asking it to interpret, summarize, and even act on what it found—we realized that retrieval alone wasn’t enough. The natural progression from “what is the status of the identity project” to “open the editor and write an executive summary of the projects that I own” required Dash to evolve from a search system into an agentic AI.

That shift introduced a new kind of engineering challenge: deciding what information and tools the model actually needs to see to reason and act effectively. This has been popularized as context engineering, the process of structuring, filtering, and delivering just the right context at the right time so the model can plan intelligently without getting overwhelmed. We started thinking about how these ideas applied inside Dash itself, including how the model planned, reasoned, and took action on a request. Instead of simply searching and summarizing results, it now plans what to do and carries out those steps.

At the same time, adding tools into Dash’s workflow created new tradeoffs around how context is managed. Precision in what you feed the model is critical in any RAG system, and the same lesson applies to agentic systems. Supplying the model with only the most relevant context, and not just more of it, consistently leads to better results. Below, we’ll walk through how we’ve been building better context into Dash.

Dropbox Dash: AI that understands your work

Dash knows your context, your team, and your work, so your team can stay organized, easily find and share knowledge, and keep projects secure, all from one place. And soon, Dash is coming to Dropbox.

Learn more →

Engineering context precision in Dash

As Dash gained new capabilities—like contextual search and assisted editing—we noticed something unexpected: more tools often meant slower, less accurate decision making. A “tool” here is any external function the model can call, such as search, look-up, or summarization. Each new capability expanded the model’s decision space, creating more choices and room for confusion. Even well-designed tools made the model spend more time deciding how to act instead of acting. The problem wasn’t broken tools; it was too many good ones. In human terms, Dash was facing analysis paralysis.

The Model Context Protocol (MCP), an open standard for defining and describing the tools a server provides, helps with this by outlining what each tool does and what inputs it takes. But as we experimented with MCP servers, we ran into limitations. Each tool we added came with its own description and parameters, which all have to fit inside the model’s context window (the space it uses to read and reason about information). In practice, these definitions also consume a significant number of tokens, a resource that directly impacts both cost and performance. Further, we noticed that the overall accuracy of Dash degraded for longer-running jobs. The tool calls were adding a lot of extra context. We were seeing similar patterns of what’s been popularized as context rot.

This led us to rethink context. Building effective, agentic AI isn’t just about adding more; it’s about helping the model focus on what matters most. In Dash, that means curating context so the model can make faster, better decisions through three core strategies:

Limit the number of tool definitions in the context
Filter context to only what’s relevant
Introduce specialized agents for tasks that demand deeper reasoning

Our principle is simple: better context leads to better outcomes. It’s about giving the model the right information, at the right time, in the right form.

Limit the number of tool definitions in the context

Our first insight was that giving the model too many options for calling tools led to worse results. Dash connects to many of the apps our customers use to get work done, and each of those apps provides its own retrieval tools, such as search, find by ID, or find by name.

Although we have the Dash Search index—our server-based search index that stores and manages documents and messages for fast and reliable retrieval—we did experiment with using other tools for retrieval. For example, Dash might need to consult Confluence for documentation, Google Docs for meeting notes, and Jira for project status to service one request. In our experiments with those other retrieval tools, we found that the model often had to call all of them, but it also didn’t do so reliably.

We solved this by replacing all of those retrieval options with a single, purpose-built tool backed by the Dash universal search index. Instead of expecting the model to understand and choose between dozens of APIs, we created one interface that handles retrieval across all services. The key idea is simple: Giving the model one consistent way to retrieve information makes its reasoning clearer, its plans more efficient, and its context use more focused.

These learnings also influenced our design of the Dash MCP server, which brings Dash’s retrieval to MCP-compatible apps like Claude, Cursor, and Goose with just one tool. It connects to the systems people already use and securely searches inside their apps. By keeping descriptions lean, more of the context window stays focused on the user’s request.

Filter context to only what’s relevant

Our next insight was that not everything retrieved from multiple APIs is actually useful for the task at hand. When we tried pulling data from several tools at once, we still needed a way to rank and filter the results so that only the most relevant information reached the model.

We built the Dash index to combine data from multiple sources into one unified index, then layered a knowledge graph on top to connect people, activity, and content across those sources. (A knowledge graph maps relationships between these sources so the system can understand how different pieces of information are connected.) These relationships help rank results based on what matters most for each query and each user. As a result, the model only sees content our platform has already determined to be relevant, which makes every piece of context meaningful. Building the index and graph in advance means Dash can focus on retrieval at runtime instead of rebuilding context, which makes the whole process faster and more efficient.

The key lesson is that everything retrieved shapes the model’s reasoning, so relevance is critical to guiding it efficiently. Sending only what’s essential improves both performance and the quality of the entire agentic flow.

Introduce specialized agents for complex tasks

Our third discovery was that some tools are so complex that the model needs extra context and examples to use them effectively. We saw this firsthand as we continued to expand the Dash Search tool. Query construction turned out to be a difficult task on its own. It involves understanding user intent, mapping that intent to index fields, rewriting queries for better semantic matching, and handling edge cases such as typos, synonyms, and implicit context.

As the search tool grew more capable, the model needed more instruction to use it correctly. Those details started to take up a significant portion of the context window, leaving less room for reasoning about the overall task. In other words, the model was spending more of its attention on how to search than on what to do with the results.

We solved this by moving search into its own agent. The main planning agent decides when a search is needed and delegates the actual query construction to a specialized agent with its own prompt. This separation allows the main agent to stay focused on planning and execution while the search agent handles the specifics of retrieval. The key lesson is that when a tool demands too much explanation or context to be used effectively, it’s often better to turn it into a dedicated agent with a focused prompt.

Looking forward

Context engineering for agentic AI systems is still an emerging discipline. While the strategies we’ve outlined—retrieval consolidation, relevant context filtering, and specialized task agents—work well for our use cases, we’re continuing to learn and iterate. As we continue to build the best tools for knowledge workers, we’ve found that the Dash index is a powerful resource for managing relevant context and helps us use other tools more effectively.

The work we’ve shared here focuses on one piece of the puzzle: Learning how to trim context down to what really matters, both in tool selection and retrieval. But context is expensive in more ways than one. It affects cost, speed, and how much attention a model can give to the task at hand. We’ve found that leaner contexts don’t just save resources; they also make the model smarter.

Next, we’re turning these lessons toward other parts of Dash’s context, like user and company profiles, as well as long and short-term memory. We think there’s even more performance to unlock by refining these areas, especially as we experiment with smaller and faster models.

Although our discussion centered on retrieval-based tools, action-oriented tools exhibit many of the same limitations. MCP continues to serve as a robust protocol, but effective scaling depends on reducing tool proliferation, investing in specialized agents, and enabling the LLM to generate code-based tools when appropriate, an approach that parallels our consolidation of retrieval tools into the Dash retrieval system. We’ve covered how Dash uses code-based tools in a previous blog post, and we see that other companies are approaching this problem with a similar mindset.

Moving forward, our focus is on making context even more efficient so the model can spend its attention where it matters most.

Acknowledgments: Rene Schmidt, Josh Clemm, Marta Mendez, Nishchal Arya, Roland Hui, Noorain Noorani, Tony Xu

~ ~ ~

If building innovative products, experiences, and infrastructure excites you, come build the future with us! Visit jobs.dropbox.com to see our open roles.