Ideally, we’d like data to be pristine. Most data practitioners accept trade-offs in data quality rather than spending copious amounts of time addressing underlying data issues. For finance reporting, the margin of error is much smaller.
Our data space concerns reporting data pertaining to shipments from Vinted’s network partners. We have oversight over carrier invoices for financial reporting. This is also corroborated with cost expectations based on shipments created and contractual terms. The data we receive from carriers comes in all shapes, sizes, and forms. Think CSV, JSONs, Parquet, Excel files. This requires us to be flexible while keeping a close eye on any schema or format changes which may inadvertently affect the accuracy and completeness of data.
Post-migration, we shifted-left in our testing approach. We implemented several tests: not null, accepted values, and expression validations close to the source. An example would be an accepted values test for cost descriptions, as they would impact whether there was a positive or negative sign applied to the invoice amount. Despite keeping only essential tests as errors (as opposed to tests that only raise warnings), our pipeline was being blocked by upstream errors.
The reality hit hard: our pipelines went from consistently high daily success to a significantly lower rate within two months. The variety of data we received changed much more frequently than we thought. Our initial attempts to stay on top of our input files were too strict.
We went back to the drawing board and asked ourselves - what is just enough? Getting bombarded by alerts and failing pipelines in the morning was eroding stakeholder confidence. Data quality that doesn’t arrive on time defeats its own purpose. We needed our pipelines to be available for consumption at the start of the working day.
Materiality is related to the significance of information within a company’s financial statements. If a transaction or business decision is significant enough to warrant reporting to investors or other users of the financial statements, that information is “material” to the business and cannot be omitted.
Financial accounting principles are a helpful tool for determining how to deal with data quality. We’re interested in two concepts: materiality, and the qualitative characteristics of financial information. The latter can also be understood through the Informational Quality Framework.
We revised our processes. The most crucial dates for financial reporting were at the start of the month and in the middle of the month, when financial reporting information is downloaded and reported on. At all other times of the month, the daily pipeline serves analytical purposes. In other words, localised errors with a new incoming invoice would not be significant enough to influence someone’s decision-making process. Therefore, we could afford to be less strict on localised errors.
Data quality can be viewed through the lens of the Informational Quality Framework:
In our context, this would mean:
The framework also highlights other qualities such as consistency, accessibility, and interpretability. The reference to the paper is included below.
With these quality dimensions defined, we still faced a practical challenge: how do we translate abstract concepts like “materiality” into concrete testing decisions? We needed a systematic way to determine which data quality issues truly mattered for business decisions versus those that were just “nice to have” perfect. This led us to develop a framework that combined business impact assessment with frequency patterns.
Setting guidance on materiality in view of accuracy and completeness grounded us in assessing the potential impact to the team. After several months of observing schema evolution with an overly strict testing regime, we had a sense of how frequent exceptions occurred.
This allowed us to conceptualise the risk-based testing framework, based on the issues’ impact and frequency. The framework helped us reduce daily pipeline failures while maintaining critical data quality checks.
We took a more cautious approach. High impact risks should be avoided at all costs. For low impact but high frequency exceptions, we reduce the risk by monitoring them more closely. Alerts are triggered when there are exceptions, and there are tests that run daily. In cases where the exception doesn’t happen often and is unlikely to have high impact, we accept the risk. We monitor them in weekly reviews, without the need to trigger an alert every day.
These are some examples:
We only kept high impact tests in the main run. dbt build by default runs all tests, whereas we opted to exclude a substantial number of tests which look out for low impact silent failures. This helped us make the main run leaner while preserving checks.
We tag tests and use exclusion flags to build models and run tests.
On a test-level, in the model’s configuration:
data_tests:
# Missing amounts break financial reconciliation
- not_null:
column_name: invoice_amount_eur
config:
tags: highimpact_highfrequency
# Duplicate tracking IDs corrupt cost allocation
- unique:
column_name: shipment_tracking_id
config:
tags: highimpact_lowfrequency
# New cost types appear occasionally, can be mapped later
- accepted_values:
column_name: cost_description
values: [delivery, return, surcharge, fuel_surcharge]
config:
tags: lowimpact_highfrequency
# Occasional date parsing errors, rarely material
- expression:
expression: "invoice_date >= '2020-01-01'"
config:
tags: lowimpact_lowfrequency
In dbt_project.yml:
models:
vgo_finance:
+meta:
excluded_tests:
# This way, only high impact tests run
- tag:lowimpact_highfrequency
- tag:lowimpact_lowfrequency
sources:
vgo_finance:
+meta:
excluded_tests:
# This way, only high impact tests run
- tag:lowimpact_highfrequency
- tag:lowimpact_lowfrequency
Our dbt project is split up into different tasks in Airflow. We have internal orchestration that reads the +meta.excluded_tests configuration and turns it into --exclude flags when calling the dbt CLI. See this article on our Airflow set-up: Orchestrating Success.
Stock dbt does not interpret meta in this way, so if you do not have similar tooling you should pass --exclude directly to dbt test / dbt build, as in:
# Main pipeline run - only critical tests
dbt build --exclude tag:lowimpact_highfrequency tag:lowimpact_lowfrequency
# Daily monitoring run - low-impact high-frequency tests for alerting
dbt test --select tag:lowimpact_highfrequency
This was an exercise of change management. In order to obtain a mandate to prioritise these tasks, we needed to first have a consensus that there was a problem. Thereafter, while gathering requirements, we wrote an RFC and took action.
The key is in recognising that when data quality issues arise, they’re very much related to process deficiencies. People who interact with the process need to be engaged as an essential part to ensuring the process succeeds.
First principles thinking is a problem-solving method that breaks complex issues down into their most fundamental, foundational truths, rather than reasoning by analogy or convention.
Inversion thinking is a problem-solving technique that flips challenges upside down by focusing on how to avoid failure rather than solely on how to achieve success.
On data quality issues, we zoomed in on testing at source to ensure completeness. We focused on key pieces of information like the invoice amount, the date, and the cost description. We tried to imagine how, and asked, our stakeholders would check for these issues. What would they want to know about the information, and what could help them resolve it as quickly as possible?
This is an iterative journey. Expectations are similarly built with time. When we started with alerting on source issues, the situation we wanted to avoid was not catching them at all. This resulted in an alert, but it wasn’t obvious what someone needed to do with it until they looked at a Google Sheet. This expectation evolved to us thinking - any solution cannot be unactionable; in other words, looking at the alert should allow someone to immediately know what to do with it.
The key insight: actionable alerts build trust, while noisy alerts erode it. We learned this the hard way when our Slack channel went from essential updates to a source of alert fatigue.
We will keep revising, and learning. For now, we have something that balances alerts (preventing fatigue) while ensuring quality and building trust.
We’ve also shared our work at the Forward Data Conference. Do check it out!
Some prior research referenced is:
Our story begins…
We have a huge codebase of a few hundreds Gradle modules, collected some good code and some legacy code during the past 14 years. We adopted the dependency injection idea from the beginning, first it was the Dagger version released by Square, then the second fully-static version was released by Google. A couple of years later, we adopted dagger.android, and the idea of having subcomponents per fragment looked fantastic back then (spoiler alert, it is not). Later, a simpler yet more powerful DI idea arrived as the Hilt framework, but it was too late to redo all fragments.
After modularization took momentum and module count grew rapidly, we began to envy Hilt’s way of installing dependencies instead of providing via large dagger modules. But it was hard to justify the time spent rewriting the code for the business.
Until one day we found the Anvil - Kotlin compiler plugin, which brings Hilt idea to contribute dependencies via annotation. And it was faster due to its dagger factory generation. We eagerly began adopting and even migrating Fragments from Android Injector to construction injection.
But technologies are moving fast, Kotlin released K2, and since Anvil was Kotlin compiler plugin, it required huge effort to adopt K2, and later Anvil moved to Maintenance mode.
So, even by upgrading Kotlin to 2.x, we were still stuck on K1 and 1.9 language features, without incremental compilation. K1 support nearing its end also added pressure. We had many options: Hilt, Kotlin-inject, and … Metro.
Metro was built using lessons learned from other DI frameworks, bringing together many solid ideas. It supports Kotlin idioms well and is fast, consistent, and easy to learn. However, it was difficult to justify switching at the time—at first glance, it seemed too risky to rely on a brand-new framework.
Metro has a major migration advantage that other frameworks don’t: robust, feature-rich interoperability with popular DI solutions like Dagger, Anvil, and kotlin-inject. That level of compatibility is something its competitors lack. In fact, Metro was the quickest path for us to adopt K2, since other frameworks would have required migrating more business code—adding not only time, but also risk.
We were evaluating all the options, but Metro was growing fast and the direction of its growth aligned with our needs closely. This has further solidified our choice.
Not gonna lie, the ride was not easy. We decided to migrate everything at once, without using any of the interoperability options, keeping scope and graph structure. First obvious thing to do was just mass-replace imports and annotation names.
- import javax.inject.Inject
+ import dev.zacsweers.metro.Inject
Funny thing about javax.inject.Inject - a lot of libraries are “leaking” it! So the IDE will always try to suggest it in autocomplete. At some point, we’ve had to set up a separate validating KSP processor just to fail the build when @Inject annotation from the wrong library was encountered in source code, since it can lead to subtle and hard-to-catch bugs. Later though, we were able to remove it from the compile classpath completely, which solved the autocomplete problem.
Also removing @JvmSuppressWildcards as they are not needed anymore.
Harder thing was to replace @ContributesMultibinding since Metro has two annotations: @ContributesIntoSet and @ContributesIntoMap.
But no worries, Metro will let you know if you make a mistake!
- @ContributesMultibinding(FragmentComponent::class)
+ @ContributesIntoMap(FragmentScope::class)
@ViewModelKey(AddressPluginViewModel::class)
class AddressPluginViewModel @Inject constructor(): ViewModel
boundType (and many other cases) can be fixed by regexp magic.
Pro tip: write the script instead of manually doing mass replace, it will help later doing upstream merges.
boundType = (.*)::class
️⬇️
binding = binding<$1>()
The other half was tricky. Do you remember Android Injectors from dagger.android I mentioned earlier?
We still have more than 100 fragments left…
But there is nothing code generation would not solve!
We made a crude implementation to generate graph extensions from the similar annotations (we made some shortcuts here).
From this code:
// Container only for Android Injector contributions
@InjectorModule(ActivityScope::class)
abstract class LegacyFragmentsModule {
@FragmentScope
@ContributesAndroidInjector(modules = [LegacyModule::class])
abstract fun contributesLegacyFragment(): LegacyFragment
}
We generated this:
@FragmentScope
@GraphExtension(
FragmentScope::class,
bindingContainers = [LegacyModule::class]
)
public interface LegacyFragmentInjectorGraph {
// Still using member injection
public fun inject(instance: LegacyFragment)
@ContributesTo(ActivityScope::class)
@GraphExtension.Factory
public interface Factory {
fun create(
@Provides instance: LegacyFragment
): LegacyFragmentInjectorGraph
}
}
@Inject
@ContributesIntoMap(
ActivityScope::class,
binding = binding<InstanceInjector<Fragment>>()
)
@ClassKey(LegacyFragment::class)
public class ShippingFragmentInjector(
private val graphFactory: ShippingFragmentsInjectorGraph.Factory,
) : InstanceInjector<Fragment> {
override fun inject(instance: Fragment) {
graphFactory.create(instance as LegacyFragment).inject(instance)
}
}
… and voilà, and another big chunk of code was done! The rest was easier. We took advantage of our existing ksp-powered code generation. We had custom codegen for a lot of things, since we needed to generate boilerplate for Anvil (yes, we are generating boilerplate for everything). Changing the codegen was not hard, mostly imports and annotation names, and boom, another couple hundred cases were fixed!
@ContributesFragment
class InfoFragment @Inject constructor() : Fragment
@ContributesViewModel
class InfoViewModel @Inject constructor() : ViewModel
Not everything was so smooth.
We learned in a hard way what it means to adopt the 0.x tool.
We found quite a few cases when the compiler was crashing due StackOverflowException, or the generated code was too slow.
We began with the 0.7.x version, and finished with 0.9.2 (0.9.3 was broken for us).
Most of the problems arise due to MemberInjector, which we don’t recommend to use.
Moreover, being a compiler plugin, Metro does not output much in build directories, like Anvil and Dagger used to do. At first, it makes debugging a bit harder, but once we’ve got accustomed to rich diagnostic reports which are hidden by a Metro Gradle plugin property, debugging has become much easier.
Another big problem was constant upstream changes. A few dozen developers produced a lot of changes daily, which made a 30min conflict solving ceremony each day. We strategically chose to do migration around the holiday season.
After summing everything, we have zero regrets. Of course, it was a bumpy ride, but a worthy one. We learned so much about compilers, how to do mass migrations, how to make good codegen. Big thanks to Zac Sweers, who put a lot of effort into fixing problems in a timely manner, and we hope that our small contributions to Metro will help others to have a smoother migration.
Two months later after migrating and solving all the issues we were able to enable this juicy K2 and a bit later Incremental compilation (which is a topic that deserves another article all for itself). The results are looking quite good for us. Apart from getting rid of existential dread which was caused by the fact that K1 support will be dropped sooner or later, we’ve got some solid CI build times improvements! For our large codebase, they look as follows:
| Build Scenario | Metro | Dagger/Anvil | Reduction | Savings |
|---|---|---|---|---|
| Best Case build; most tasks are cached | 3m 23s | 4m 33s | 25.64% | 1m 10s |
| Worst Case build; no tasks are cached | 24m 12s | 27m 05s | 10.65% | 2m 53s |
| Worst Case Release build; no tasks are cached | 37m 43s | 40m 09s | 6.06% | 2m 26s |
| ABI change in a core module that all feature modules depend on | 15m 46s | 17m 22s | 9.21% | 1m 36s |
These stats were recorded with Metro 0.9.x.
Metro continues to grow and improve, also improving the code it generates, and therefore, build times, so if we were measuring them with the latest version, the results would have certainly been even better!
Our local build times also improved greatly, incremental compilation is no joke! However for the sake of brevity, we will not include them here.
To sum it all up, Metro consolidated all the best practices from other popular frameworks, while leaving out the not-so-best practices on the side, allowed us to enable K2 and immediately experience significant build time improvements, while also unlocking incremental compilation, which means that the builds will be getting even faster.
The migration process, even in a big codebase with lots of legacy remainders lingering, even without using any interoperability capabilities, was interesting and as challenging as it should have been in such circumstances.
The developer satisfaction and confidence in the context of dependency injection has also increased with the arrival of Metro. It’s easier to reason about one DI framework, rather than two, especially when this framework is kotlin-first and kotlin-centric.
]]>Keeping the Lights On (KTLO) is an essential, yet often taxing, part of a platform engineer’s role. It represents the routine operational work required to keep the business running and the platform stable. For our team, this primarily involves maintenance on our search engine, Vespa - ranging from version upgrades and service restarts to draining traffic from nodes for hardware replacements. In the O’Reilly book Platform Engineering, the authors recommend that “KTLO work should account for no more than 40% of your team’s workload. Any more than that and you risk burning out your team”. I couldn’t agree more. While necessary, KTLO tasks are often labor-intensive and repetitive rather than intellectually challenging.
As Vinted grows, our infrastructure must follow. We have transitioned from managing a hundred nodes to over a thousand, and without intervention, the KTLO “tax” accrues exponentially. We faced a binary choice: scale the team linearly by hiring or scale our efficiency by reducing the manual burden.
Our maintenance wasn’t fully manual; we relied on Bash scripts and Knife commands. This was sufficient for a few dozen nodes, but as Vespa search engine became our default solution for search problems, our node count exploded. We reached a tipping point: we were no longer managing a single deployment, but dozens of unique deployments with varying maintenance needs. Our existing tooling simply couldn’t keep up with this complexity.
As we hit this limit, the flaws in script-based automation became clear:
To understand Temporal, you have to stop thinking about “running a script” and start thinking about “durable execution.” In a traditional environment, when you run a script to restart a Vespa search engine node, the state lives in the memory of the process running that script. If your laptop closes, the CI/CD runner times out, or the network blips, that state is lost. You’re left wondering: Did the node upgrade? Temporal changes this by acting as a fault-tolerant state machine. It records every successful step of your code in a backend database. If the execution is interrupted, Temporal simply spins it back up on a different worker and resumes from the last successful “event,” with all its variables and local state intact.
As our Vespa footprint grew to a thousand nodes, we could consider building an event-driven system - where one service would emit a “Node Down” event and another service would react. However, we realized that events are often the wrong abstraction for complex maintenance.
Here is why we chose Temporal’s orchestration over traditional events or scripts:
By leveraging Temporal, we automated far more than just routine maintenance. We extended our orchestration to include new cluster provisioning and other recurring operational tasks.Today, upgrades are scheduled automatically twice a month during weekdays. The system intelligently accounts for public holidays and traffic surges, ensuring we are online to respond if issues arise. We’ve integrated Slack for real-time progress reporting and the Temporal UI for deep-visibility monitoring, backed by a robust alerting suite for stalled or failed workflows. This automation has transformed our daily operations:
What follows is the story of that shift - and why the most interesting engineering challenges are still ahead of us.
Many of us grew up professionally inside the monolith. Everything lived in one place. Data was easy to reach. Consistency was immediate. Debugging meant reading a single flow and knowing exactly where things went wrong.
In the mid 2020, as our traffic increased to 150k requests per second in peak hours, that simplicity turned into a constraint. Some endpoints triggered hundreds of database queries. Others spanned dozens of logical databases. Latency between regions created unpredictable behavior. The entire platform lived inside one large failure domain, which meant any issue could cascade much further than it should.
The monolith didn’t just slow us down - it held us back from being truly global.
Our first step wasn’t to break the monolith apart. It was to understand it. A few engineers introduced Domain-Driven Design as a way to map responsibilities and expose natural boundaries inside the application. This quickly became more than a technical exercise. It gave teams clarity about what they owned. It highlighted places where responsibilities were tangled together. It made development faster simply because people weren’t stepping on each other’s toes anymore. Eventually, it guided how we restructured teams and how we planned the future of the architecture.
DDD didn’t give us all the answers, but it gave us the vocabulary to find them.
It took us at least two years to understand the domains, as we identified almost 300 of them. But once we understood the domains, another problem became obvious: the entire system relied on synchronous communication. And while that worked fine in a single region, it didn’t survive real-world distributed conditions.
Every synchronous call added latency. Every tight integration increased fragility. And any workflow that needed data across regions suffered from unpredictable delays.
Shifting to business events changed that. Instead of expecting a remote service to respond in real time, services could publish state changes and let other domains react whenever they were ready.
For multi-step workflows, we are introducing Saga-style orchestration. Instead of trying to fake distributed transactions, we are embracing compensations, retries, and eventual completion. These ideas required new habits, new coding patterns, and a new mindset - but they let us operate reliably across geographic boundaries.
This was the moment the platform started to behave more like a distributed system and less like a stretched monolith.
In the next part, we will move from why the architecture had to change to how it operates at a global scale. We will look at the concrete decisions behind our multi-region model, why we centralized writes while distributing reads, and how events and projections make that possible. This is where the platform stops being just distributed in theory and starts delivering predictable performance worldwide.
]]>Operating across continents forces you to rethink even the assumptions that once felt foundational. One of the biggest decisions we made was choosing not to shard our primary data models across regions. Instead, we embraced a model where all writes happen in the primary site and read-only projections are replicated around the world. This gives us a single source of truth for writes while still providing fast, local reads to users regardless of geography.
It wasn’t an easy choice. It meant accepting that global consistency would always lag by at least a few moments and that network behavior would play a real role in how fresh data appears in different regions. But it also avoided the complexity and operational cost of a fully sharded, multi-writer system - complexity that becomes hard to justify unless the business absolutely demands it.
What this approach gave us, though, was predictable behavior under load. We started building our projections to tolerate replication delays, survive partial failures, and recover automatically when regions fall behind. Most importantly, we are learning to evaluate every domain and feature through a new lens: how does its read path behave globally, how sensitive is it to freshness, and what happens when the network isn’t cooperating?
Today, this model allows us to serve hundreds of thousands of requests per second worldwide while keeping write logic centralized and robust. Eventual consistency is being built into how the system works, not an edge case we try to hide. And that clarity is making our platform both more resilient and easier to evolve as we plan to expand to more regions.
One of the biggest improvements came from separating how we write data from how we read it. Features like feeds, search, and listing pages need fast, region-local access. Depending on remote services simply isn’t an option once you operate across continents.
The answer was to build read-optimized data projections generated directly from our event streams. Each team could decide what their projection looked like, how it should be optimized, and where it should live geographically. This reduced cross-team dependencies and made performance far more predictable.
Through the mid of 2026, the projections should power many of our most visible features. They’re a key part of our strategy for low latency and global scale.
None of this would have been possible if we hadn’t shifted how teams think about building software. The monolith encouraged a mindset where consistency was free, and data lived everywhere. Distributed systems demand the opposite. And we are talking about ~50 teams.
Teams learned to design for retries, compensations, idempotency, and partial failure. They had to build experiences that hold together even when some events arrive late or in a different order than expected. And perhaps most importantly, they had to take real ownership of domain behavior from end to end and not just code paths.
Conversations changed. Instead of debating individual endpoints, teams talk about flows, boundaries, event lifecycles, data freshness, and recovery. This shift has made our architecture stronger, and it has made our engineering culture stronger too.
We are now ready to run a hybrid architecture built around services, events, and globally replicated projections. The foundations are in place, but we’re very much in the middle of the journey. A significant part of our work today is focused on strengthening the platform itself: improving our async tooling, defining clear standards for how projections and consumers should be built, and making sure our infrastructure can sustain the traffic patterns we’re seeing — and the ones we know are coming.
We’re still refining the rules for how events flow through the system, how projections handle late or conflicting updates, and how consumers recover after interruptions. A lot of energy is going into making the development experience smoother: better local tooling, more predictable event schemas, cleaner testing patterns, and clearer guidelines for how domains should emit and react to events. At the same time, the infrastructure side is evolving to support larger volumes, faster replication, and better observability across regions.
There’s plenty left to do. Some domains still need to be extracted. Some projections need to be redesigned for scale. Some need to be designed from scratch. Our event propagation paths can get faster, and our recovery mechanisms can become more automated. The long-term goal is to reach a point where operating a distributed, event-driven system feels no more complicated to an engineer than working inside the monolith once did, but with all the resilience, clarity, and global performance benefits of the new world.
We’ve built the basic shape of the platform we want. Now we’re tuning it, scaling it, and making it something teams can rely on with full confidence as the company keeps growing.
If you’ve spent enough years in engineering, you can tell when a team is solving real problems versus rearranging abstractions. The work we’re doing sits firmly in the first category. We’re building systems that have to hold together across continents, under real traffic (we have already reached 300k requests per second, and it is growing steadily), in environments where eventual consistency, replication delays, and partial failures aren’t theoretical edge cases - they’re everyday constraints we have to design for.
You’d be joining a group of people who care deeply about getting the fundamentals right. The problems are complex in a way that rewards good engineering instincts: modeling domains cleanly, designing robust asynchronous flows, understanding how events propagate through a large system, and building projections that remain fast and correct under load. There’s room here for engineers who enjoy thinking holistically, who appreciate clarity in domain boundaries, and who like improving the systems that everyone else will depend on for years.
You’d have influence, not in the “we have a committee for that” sense, but in the way where well-reasoned ideas actually shape how the platform evolves. If you see a gap in our tooling, you can fix it. If you find a better pattern for consumers, you can drive its adoption. If you notice a weakness in our replication or event flows, you can help redesign them. This is the kind of environment where senior engineers don’t just write code - they leave fingerprints on the architecture.
And perhaps most importantly: we’re not done. The foundations are in place, but many of the hardest challenges are still ahead. We’re scaling across new markets, pushing more traffic through the system, and tightening the guarantees we provide while keeping the developer experience simple. If you’re the kind of engineer who enjoys working on systems that matter, who wants real ownership, and who’s motivated by building the kind of platform that other teams can stand on with confidence, we’d love to talk to you.
You could make a meaningful impact here - not someday, but immediately.
]]>When we started migrating Vinted’s data infrastructure to the cloud, we set out to create a decentralized way of working. The idea was simple: teams know their data best, so they should be fully empowered to build, own, and operate their pipelines without a central platform team getting in the way.
In that early phase, this worked reasonably well. Teams were moving fast, experimenting, and shipping value. They orchestrated their pipelines independently, inside their own domain. But as the platform grew, reality caught up with us: handling dependencies between decentralized teams requires a sophisticated solution.
In practice, teams were constantly using each other’s data assets. A marketing model would rely on product events; a finance report depended on operational data; a machine learning feature set pulled from three different domains. The business logic was naturally cross‑cutting, but our orchestration model pretended that domains were islands. This led to a subtle but very real problem: coordination moved from code into endless meetings.
We were left with a pretty hefty task to solve: how do we make sure that these domains naturally fit together, as to complete Vinted’s puzzle of data pipeline orchestration?
The goal of our decentralized setup was to let teams work autonomously, without constantly leaning on a central data platform. They got their own infrastructure in GCP, their own dbt project, and were expected to run their own pipelines on an Airflow instance provided by the data platform team.
At the same time, we didn’t have Airflow experts scattered across the organization, and we didn’t want to create that requirement. Asking every team to hand‑craft DAGs and become fluent in Airflow would distract them from doing what they do best: creating impactful data models that positively influence Vinted.
In fact, the “classic” way to run dbt with Airflow leans into that idea: keep Airflow simple and let dbt handle the complexity. You schedule a small number of tasks, often just a dbt run executed as a Bash command, and dbt resolves the full dependency graph internally. Airflow doesn’t try to mirror dbt’s model-level lineage; it just triggers the job and reports whether it succeeded.
We followed the same philosophy, but adapted it to our scale and constraints. A single end-to-end execution was simple, but it didn’t fit our cost profile: if something failed late in the run, the easiest recovery was often to rerun the whole job, which meant recomputing (and paying for) a lot of already-finished work, including some very large tables. To keep retries cheaper and failures more contained, we split execution into layers: one Airflow task per dbt “layer”. Airflow would call dbt with “run the staging layer,” “run the fact layer,” “run the mart layer,” and dbt would take care of the rest. Inside the job, dbt figured out the dependency graph within that layer and executed the models in the right order
This kept things approachable, but it had sharp edges. If an unrelated staging model broke, the entire staging layer task would fail and everything downstream would be blocked. The figure shows an example:
The dbt lineage shows that mrt_items should complete without problems in the case that int_orders fails. However, due to the fact that Airflow doesn’t have this granularity, it never even starts the jobs downstream from the intermediate layer. Data that didn’t actually depend on the broken model still arrived late. This was only the beginning of our troubles.
This “dbt handles the graph, Airflow runs the job” approach works extremely well at small scale. However, once you’re dealing with thousands of models spread across ~20 teams, the lack of model-level transparency becomes a real operational problem, especially when dependencies cross team boundaries. When something breaks, it’s no longer obvious what’s actually blocking what, which missing dependency caused the failure, or who owns the upstream piece that needs fixing. In a decentralized setup, that ambiguity is expensive: tracing issues turns into detective work, and responsibility becomes harder to pinpoint. The same lack of visibility forces teams to wait for an entire upstream pipeline to finish, because they can’t reliably tell when the specific piece they depend on is actually done. You’d hear questions like:
We had successfully decentralized ownership, but we had accidentally introduced fragility in the hand‑offs between teams.
We believe decentralized teams are what we need to scale, as Vinted grows. So we needed a way to remove the cognitive load of “orchestration trivia” from domain teams, especially around cross‑domain dependencies.
The key design goal was this: Let teams think in terms of data models and lineage, not in terms of pipeline scheduling and cross‑pipeline wiring.
To get there, we focused on two things:
We already had the perfect source of truth for dependencies: the dbt manifest. It knows which model depends on which, how data flows through the domain, and where the boundaries between sources and transforms lie.
So we built a DAG generator that:
By unfolding the lineage into a task‑per‑model structure, we gained granularity and flexibility. Suddenly, we weren’t just running “the staging layer”, we were running concrete, addressable units of work that mapped directly to dbt models. That opened the door to do something much more powerful across domains.
With task‑per‑model pipelines in place, the next step was to actually wire domains together. Practically, that meant setting up sensors that could wait on upstream work in other domains and only move forward when the right data was ready. Conceptually, the problem is simple (“don’t start this until that has finished”), but at platform scale the implementation details matter: who do you wait on, how do you express that, and how do you keep those relationships from turning into spaghetti?
Airflow already has an opinionated way to model cross-DAG dependencies: Airflow Assets. They’re event-driven, first-class citizens, and looked like the perfect fit for connecting domains without tight scheduling coordination.
Unfortunately, we found ourselves running into a hard limitation quite early on: Airflow Assets operate at the DAG level. An Airflow Asset update can trigger an entire downstream DAG, but we needed something more precise. Our pipelines are owned end-to-end by domains, and we wanted to keep that ownership boundary intact: upstream domains shouldn’t be “starting” other teams’ pipelines, and downstream domains shouldn’t have to understand (or care) how upstream work is split across DAGs or tasks. What we needed was task-level unblocking inside larger pipelines: resume this specific unit of work as soon as that specific upstream unit is ready.
We found a more fitting candidate in the ExternalTaskSensor. It lets a task in one pipeline wait for the completion of a specific task in another DAG, exactly the fine-grained dependency we were after. However, this came with two obvious downsides. First, if teams wired sensors by hand, we’d end up with a fragile web of hard-coded references that’s difficult to validate, painful to refactor, and easy to break silently. Second, the mechanism is polling-based and timeout-driven, and in real life upstream tasks sometimes finish after a downstream sensor has already timed out, turning “just rerun it” into an operational habit.
So we set out to enrich this candidate to ensure we solve both downsides. To achieve this, we built an Asset Registry: a central catalog of all tasks and their relationships. It knows which domain, pipeline, and dbt model each task belongs to, and how tasks depend on each other across domains. We use it in CI/CD to validate that upstream references are valid and to attach metadata like “when should my data be available?” and “which task should I poll for completion?”. This metadata is collected automatically, as it is already available in the dbt manifests.
For engineers, this means they don’t wire pipelines together directly. They simply say “this model depends on that model,” and the combination of DAG generator and Asset Registry turns that into concrete task‑level dependencies, distributed amongst decentralized data pipelines, using ExternalTaskSensor behind the scenes. This effectively solved the wiring problem. One down, one to go.
To solve the timeout problem, we use the registry too. When an upstream task completes, even if it’s late, we look up all downstream sensors that depend on it (including the ones that have already timed out) and mark them as satisfied via a completion callback. Downstream pipelines then continue automatically, without manual restarts.
From an engineer’s perspective, this complexity is invisible. They don’t restart stuck runs, chase timing mismatches between teams, or track who depends on what in their heads. The platform reconciles dependencies as tasks complete and makes the behavior transparent and deterministic: you can always ask the registry who depends on a task and why something did or didn’t run.
Not only does our approach solve the dependency issues, it also sheds light on a complex and decentralized data landscape. In an intrinsic web of domain dependencies, it can become tricky to understand who depends on which data assets you own. This created a risky environment for our engineers to make changes to assets they own. Upon introducing breaking changes, like altering the schema, they were tasked with finding out which domains were using this asset. Often this resulted in many back-and-forths and meetings that could otherwise have been avoided.
Our Asset registry unlocks the ability in CI/CD to understand which model is going to be changed, and which teams depend on said model. We can simply collect these scenarios, and post them in the body of the PR the engineer is working on. By adding the Slack channel, we provide a simple and effective way to understand who to reach out to.
A standardized DAG generator has become one of the most valuable pieces of our platform. Because every pipeline is created through this generator, we effectively hide DAG authoring from users and constrain them to a small, curated set of building blocks. Under the hood, those map to a limited number of Airflow operators and patterns we control, which means we only need to test and maintain a narrow surface area instead of a zoo of custom DAGs.
The trade-off is that Airflow has a huge ecosystem of operators and built-in features, and our generator only exposes a small subset of them. Sometimes that means we can’t use a capability straight out of the box, or we have to reimplement parts of it inside the generator to keep the interface consistent. Still, the leverage we get from standardization is worth it.
The interface for users stays stable: they describe models and dependencies in the same way, regardless of what’s happening underneath. This gives us freedom to change the generator’s output when we need to. If we want to tweak retries, swap an operator, or adopt a new Airflow feature, we update the generator and regenerate the DAGs. Teams don’t have to manually configure anything in their pipelines.
This approach really paid off when we upgraded to Airflow 3. We adapted the generated DAG structure and operators, rolled out the new generator, and were done. For engineers, the migration was almost invisible; for us, it was a controlled platform change instead of a manual cleanup of dozens of hand‑written DAGs.
And for most of our engineers, that’s exactly how it should be: they think in terms of data, while the platform quietly does its job in the background.
We had the privilege to present this solution in more detail at the Airflow Summit in Seattle, PyData Amsterdam and Astronomer’s The Data Flowcast. Please find the links here if the above piqued your interests!
]]>The low-recall aspect of the keyword-based search is challenging when dealing with the content which is both very visual and multilingual. Returning “no results” even when there are relevant items is a missed business opportunity. To address the challenge, the initial experimentation with the dense embeddings-based retrieval can be tracked back to internal hackathons as early as 2022.
To prove the business value and to maximise learning, we implemented filling of low-recall search sessions with some dense retrieval matches. A little increase in recall turned into improved search metrics which gave more confidence in the technique.
Attempts to include dense retrieval matches in all search sessions started in the spring of 2024. When a technical approach and business value was proven in one market with one dominant language, it was time to work on scaling the solution.
Some 50 AB tests later, the dense retrieval is fully enabled. It required numerous model improvements and a bunch of engineering wizardry which are covered in the remainder of this post.
At its core, our system is a Two-Tower Model:
The frozen pre-trained multilingual CLIP model is our base for further fine-tuning. We train a “projection head” with GELU activations and LayerNorm. That not only complements CLIP’s general-purpose knowledge with our search domain specifics, but also makes training fast and efficient.
An item’s representation is a fusion of signals, including its image and metadata (brand, color, category, price, etc.):
These vectors are concatenated and passed through a final fusion layer. This allows the model to learn the complex interactions between all features (e.g., how a specific brand relates to a specific image style).
Including the textual item information into the model turned out to be challenging but remains a promising direction for future improvements, perhaps a much larger training dataset or different incorporation techniques are worth a try.
The two towers are trained together using contrastive learning. The model learns to pull positive (relevant) query-item pairs together while pushing negative (irrelevant) pairs apart. For every (query, positive_item) pair, we force the model to distinguish it from 7,000–10,000 random negative items.
To get this training to converge, we use a full recipe of such practices:
With the same training recipes we saw model training improvements by scaling dataset by 10x to more than 100 million positive pairs.
For deployment, converting the item model to ONNX introduced complexities in handling categorical feature preprocessing. Our solution is to integrate the preprocessing logic directly into the ONNX graph by using a masking technique to filter and manage out-of-vocabulary (OOV) inputs.
Removing irrelevant nearest neighbors requires picking an arbitrary similarity value. Doing this not only requires manually-tuning combined further adjustments with AB testing, but also mandates adjustment after every model retraining. Furthermore, it could be adjusted per query based on some estimated recall value during query time.
The entire implementation is within a Vespa application package.
In the high-level architecture diagram above we see that within the Vespa stateless layer query and feeding containers are physically separated. The separation allows adding resources individually and on demand. Query cluster nodes run the query model configured as a standard Vespa embedder. The feed cluster nodes contain the item tower that is invoked in a custom Vespa document processor that invokes a model evaluator whose result is added to the document update operation. Both query and feed nodes communicate with the same content cluster. The content cluster nodes contain HNSW index on the portion of the dataset. All models files and configuration is managed within a single Vespa application package.
It is also worth mentioning that to initially calculate image embeddings of only the primary photo for all the items took about 1 month. The image embeddings weigh around ~ 1 TB of memory (10^9 items * 512 dimensions * 2 bytes per dimension). The item embeddings weight half as much.
Combining the fast HNSW search with filtering is a major performance challenge. This combination requires extensive optimization, even in a high-performance system like Vespa, to stay within a tight latency budget. Below we list such optimizations roughly sorted by complexity.
The worst-case scenario for ANN search has high tail latencies because a lot of vector comparisons are needed. One simple approach to lower the latency is to make HNSW smaller by splitting the data into more content nodes and executing more searches in parallel. Currently, we have 30 nodes per content group.
Vinted has some markets connected, e.g. members in Spain can see French items. Markets that are transitively connected form a logical bloc. We leveraged that to create smaller HNSW indices per bloc. This was achieved by creating 3 separate Vespa indices that are deployed to the same content cluster.
This trick not only made HNSW searches more manageable, but as a side effect, it also sped up all search requests. However, the indexing and querying now need to be routed to the correct indices.
Vespa dynamically decides which nearest neighbor searches strategy to apply based on the estimated hit ratio. Thresholds can be tuned to balance latency vs. resource utilization. During the benchmarking we’ve discovered that the sweet spot was to set ranking.matching.approximateThreshold to a values that translates to ~1M documents per index per content node with 8 threads per search request.
Note that there is no perfect threshold.
Search requests have a tight latency budget of 500 ms. Even with the tuned thresholds, we’d still get some timeouts due to how the approximate nearest neighbor search is executed. The worst part is that such failures end in 0 results sessions. Knowing that the exact nearest neighbor search doesn’t have such a failure mode we implemented a retry strategy:
ranking.matching.approximateThreshold=1 with a timeout of 150 ms.The sequence diagram below explains the flow.
Typically, a Vespa timeout means that the summary fetching was skipped. Such responses are mostly unusable as the document data is not in the response. To “recover” such requests, we leveraged the match-features to return a document ID as a tensor. The strategy eliminated most of the timeouts.
The key requirement was to integrate the dense retrieval into the existing setup. This means:
It might be hard to believe, but all of the above happened for real user queries.
A query zx750 with a brand filter has ~300 lexical matches and with the tuned distance threshold the nearest neighbor search matches another ~7000 items. The problem with those 7000 items is that they are mostly random things, and the lexical matches are spread across those ~7300 hits. The poor matches are due to BERT-based tokenizer producing tokens [CLS]z##x##75##0[SEP] that tend to give high similarity with random items. If the brand filter is removed, we get the expected ~450 (~300 lexical and ~150 nearest neighbor) matches.
It turns out that the dynamic switching between exact and approximate the nearest neighbor search strategies introduces these failure modes. It boils down to how the targetHits param is handled in each strategy. Each nearestNeighbor query clause has the targetHits param. However, it is only a target, not a limit! Meaning that Vespa is free to add nearest neighbor matches. And it happens when the execution strategy is the exact nearest neighbor. For example, when a filter is added to the search request, surprisingly, there can be more matches than without a filter. Also, targetHits is per content node. So, the total limit is targetHits * number_of_content_nodes. Not ideal. Note that the approximate nearest neighbor execution strategy returns exactly targetHits Per content node.
This behaviour originates in the matching phase. So, later in the ranking we can already implement a workaround.
Having consistency requirements and the technical limitation in mind, there were only a few paths to go:
Taking the blue pill and ignoring the problem (1) was not really an option as it would mean either killing a promising project or leaving the expensive feature as a filler for low results queries. (2) Two requests felt unattractive as it would introduce plenty of complexity (e.g. faceting) to make the codebase toxic to the extent that nobody would dare to touch it.
We decided to proceed with taking the red pill and creatively pushing the complexity down to Vespa (3). We believe that reading Vespa documentation and following a ranking profile is easier than reading a hacky implementation code.
A.k.a. RRF is a Vespa ranking feature available in the global-phase. It is defined as rrf_score = 1.0 / (k + rank). The score depends on an arbitrary parameter k and the rank (position) within a virtual list of search results ranked by some ranking feature, e.g. nearest neighbor similarity.
If you squint, when k=0 then rrf_score looks like a position. We can find a number that represents N-th position rrf_score@N=1/N, e.g. rrf_score@160=1/160=0.00625.
All documents matched by the nearestNeighbor query clause has a non 0 itemRawScore rank feature. However, to answer if the document is matched only by the nearestNeighbor the we need to know if the document was not matched by other query clauses.
The YQL conceptually looks like this:
SELECT *
FROM documents
WHERE filters AND (lexical_match OR interpretations OR nearestNeighbor)
If a document is a lexical match then textSimilarity(text_field).queryCoverage > 0 must be true.
By interpretations we mean a query rewrites into filters on metadata, e.g. a query `red dress` becomes a filter color_id=10 AND category_id=1010. The complication is that there might be multiple interpretations, and we need to know if the document is a full match of at least one interpretation. There is no easy way to calculate such condition. A workaround is to introduce a boolean attribute field where the value is always false.
field bool_const type bool {
indexing: attribute
attribute: fast-search
rank: filter
}
Then rewrite each interpretation AND’ing with the always TRUE bool_const=FALSE condition, e.g.
color_id=10 AND category_id=1010 AND bool_const=FALSE
Now, if the rank feature attributeMatch(bool_const) > 0 then the document is fully matched by at least one interpretation. To distinguish between interpretations a field per interpretation would be needed.
Finally, we can identify if the document is only a nearest neighbor match:
function match_on_nearest_neighbor_only() {
expression {
if(itemRawScore(dense_retrieval) > 0
&& textSimilarity(text_field).queryCoverage == 0
&& attributeMatch(bool_const) == 0,
1,
0
)
}
}
Matches from all content nodes are available in the container node in the global ranking phase. It supports a rank-score-drop-limit parameter that can be used to remove matched documents whose score is lower than some constant numeric value. This feature was contributed to Vespa.
The trick to filter out documents is to change the score of some matches to a value lower than the rank-score-drop-limit.
A nice thing is that this parameter can be passed as an HTTP request parameter to have a control per request or for AB testing.
Here is the implementation that limits the nearest neighbor matches count using the tricks described above:
global-phase {
expression {
if(reciprocal_rank(itemRawScore(dense_retrieval), 0) >= 1.0 / query(nn_hits_limit)
|| match_on_nearest_neighbor_only == 0,
relevanceScore,
-100000.0)
}
rerank-count: 20000
rank-score-drop-limit: -100000.0
}
It works like this: when the document is either a nearest neighbor match up to query(nn_hits_limit) position or it is not only a nearest neighbor match then the document gets the same score as calculated in the previous ranking phases (i.e., no change). Otherwise, the score is set to -100000.0 (way outside the range of the normal scores). We take 20,000 documents (meaning “all”) to rerank. After reranking the documents with score <-100000 are filtered out.
Yet another complication is that every result ordering has to deal with the same issue of limiting the number of nearest neighbor matches. The required sorting has to be implemented with ranking profiles because global-phase prevents using the order by clause. Here is an example rank profile for the “newest first” sorting:
rank-profile dense_retrieval_newest inherits dense_retrieval_global_phase {
match-phase {
attribute: first_visible_at
order: descending
}
first-phase {
expression: attribute(first_visible_at)
}
}
Another way to see this complication is that it creates an opportunity for secondary sorting to use some clever scoring that might include personalization or something. While secondary sorting with order by are limited to an indexed attribute value.
Using the global-phase reranking as a filter added quite a bit of work to the query container nodes which increased latencies. To amortise for that, we’ve experimented with a newer JVM as Vespa ships with a relatively old OpenJDK 17.
In the past, we’ve had great results with GraalVM when running Elasticsearch. Why not try it with Vespa? And it turns out that it works pretty well! The tail latencies dropped by double-digit percentages. Later we’ve also configured ZGC so that JVM garbage collection pauses would cause fewer timeouts.
Using the combination of GraalVM and ZGC not only helped with this project but also proved to help with other ultra low-latency use cases.
Even though the current implementation gets the job done, we are not happy. The logic is not particularly complicated, but it has many components, and therefore it is easy to get lost. We’ve added plenty of integration tests that prevent introducing bugs when changing something remotely related.
The worst part of it is that additional requirements can be implemented only by adding even more complexity. To get rid of this complexity, either the consistency requirements need to change, or performance of ANN should be significantly improved, or the model is improved so that all items that pass distance thresholds are relevant.
This reminds of the Peak of Complexity model introduced by the Java architect Brian Goetz. We’ve just passed the complexity peak, and we’re in the virtuous collapse phase where simplification leads to even more simplification. When looking into parts of the solution, it is not uncommon to hear questions like what took you so long?
The dense retrieval is a significant improvement to the item search. The benefits were proved in lots of AB tests. However, as it is typical for search, good work leads to even more work: the model can get better, the integration into the ranking can be improved, nuance can be introduced when and to what extent the dense retrieval is applied, resource utilization can be optimized, etc.
Even though a lot of work is ahead of us, we are proud of what has been achieved. The overall architecture is relatively simple and contained, which allows for the team autonomy. Due to multiple performance optimizations, we’ve reached a <0.02% error rate. The techniques mastered for this feature have laid the foundations for other advancements such as image search or advanced personalization.
We hope that this long blog post was interesting and sheds some light on what it takes to work out a significant search feature at the billion-scale e-commerce dataset.
]]>match-features, Vespa can apply a new optimisation, which can be a lifesaver when data is frequently redistributed.
When ranking, Vespa requires additional information that cannot be easily stored within the documents being ranked (e.g. the statistical cross features between the user and the document such as a counter of how many interactions the user has made with the documents of this category can’t be stored within an item itself). Then, you need to pass them as parameters via ranking features. An important question is: where do you store and fetch that information from?
When Redis became a bottleneck for this task, we decided to try Vespa itself. Why?
And of course, such tasks require single-digit millisecond latency.
Initially, the use e of Vespa for use cases worked well, and looked like a great success. However, the proverbial honeymoon ended when the number of schemas (50+) and the update rate (500M+ per hour) skyrocketed. We noticed that sometimes tail (p99+) latencies bumped to 100ms+, seemingly out of nowhere, but the bump sometimes correlated with the high feeding bursts (meaning right after the feeding burst). Such high latencies are unacceptable when the latency budget is 50 ms.
We noticed that the spike in tail latencies always happened after a burst of feeding requests. E.g. features that are recalculated hourly/daily for all Vinted users (i.e. 100+ million records) create such bursts. When latencies started spiking more and more frequently, it was a signal to have a closer look to see what the cause was.
The diagram above shows that a spike in p99 latency occurred after a feeding burst.
After inspecting the logs during such a latency spike, there were multiple records such as the following:
Docsum fetch failed for 36 hits (64 ok hits), no retry
The log says that some document summaries have failed.
After a guru meditation session at the Vilnius office sauna (which is intended for such type of work), we concluded that the data is moving around the cluster and it causes problems for document summary fetching.
This theory was quickly confirmed by checking the dashboard on data redistribution.
With this evidence, the problem to solve was clear but first, we need to familiarise ourselves with how Vespa executes queries.
This is the typical query execution flow:
The diagram above shows that a request first comes to the Vespa container node. Then, it is scattered to all content nodes (typically over the network) of an available content group. Responses with local Top-K hits are then gathered in the container node. The Top-K global matching documents a .fill() request is once again sent to the relevant content nodes to fetch the document summary (i.e. document data or calculated values).
When a data redistribution is ongoing during the query handling, it might happen that between the query execution and .fill() (that typically takes a couple of milliseconds), the documents is moved from one content node to another (or the content node is down, or some other unexpected situation that happens in distributed systems). To handle such a situation, Vespa queries all known content nodes for the summary data, potentially doing multiple retries.
Typically, summary fetching takes ~1 ms, but we’ve seen summary fetching taking ~100 ms.
A small nuance to note about the query execution flow is that, with the first response from content nodes, the matchfeatures can be returned.
Match features are rank features, added to each hit into the matchfeatures field. The feature was added to Vespa in 2021. The values can be either floating point numbers or tensors (but not strings, booleans, etc.). Typically, they are useful to record the feature values used in scoring for further ranking optimisation.
A clever trick to encode non-numeric data, e.g. a string label, is to convert it into a mapped tensor. If you squint a little, the mapped tensor looks like a regular JSON object.
By knowing the problem and being familiar with matchfeatures, we can draft a workaround for summary fetching.
Luckily, we’ve already thought about such an optimisation! What if everything we needed could be fetched with the select match-features from …? Summary fetching would then not be required, and .fill() could be eliminated.
The enthusiasm led to a quick proof of concept; however, the benchmarks surprisingly showed no improvement at all. This led to an inspection of the query trace, in which we found that the summary was being fetched! This was confirmed with the metrics on the Vespa side, on docsum operations.
It was high time to roll up our sleeves and do some open source work. The feature was released with Vespa 8.596.7.
Open source contributions take time (review, accept, release, adopt cycle can take weeks), but we needed the solution quickly Vespa is extremely flexible, and we could alter the platform ourselves with aplugin by adding your bundle JAR file into the components/ directory, and configuring the search chain.
Let’s explore the Vespa application setup. First, we need to create a rank profile that encodes data into tensors.
schema doc {
document doc {
field my_feature type string {
indexing: attribute
}
}
rank-profile fields inherits unranked {
function my_feature() {
expression: tensorFromLabels(attribute(my_feature))
}
match-features {
my_feature
}
}
}
Then, we need to specify the fields rank profile when querying. As a bonus, we can disable the query cache, because it helps during the summary fetching and asks for the short version of tensors:
{
"yql": "select matchfeatures from doc where true",
"ranking": "fields",
"ranking.queryCache": false,
"presentation.format.tensors": "short-value"
}
The response looks like:
{'root': {
'id': 'toplevel',
'relevance': 1.0,
'fields': {'totalCount': 1},
'coverage': {
'coverage': 100,
'documents': 1,
'full': True,
'nodes': 1,
'results': 1,
'resultsFull': 1},
'children': [{
'id': 'index:content/0/c4ca42388ce70a10b392b401',
'relevance': 0.0,
'source': 'doc,
'fields': {
'matchfeatures': {
'my_feature': {
'MY_LABEL_VALUE': 1.0
}
}
}
}]
}
Third, the match-features need to be converted into a usable form. That is, either in a custom searcher, or in your application.
Without summary fetching, the query execution is much simpler.
In the diagram above, one network round-trip is eliminated when compared to the typical query execution. Also, this eliminates all the potential summary fetching problems because documents are findable even during data redistributions.
When the solution was deployed, we immediately noticed a drop in tail latencies. But the most important thing was that there were no more latency spikes during data redistribution!
When the change was deployed, the p99 latencies dropped from ~9 ms to about 3 ms. And the latency spikes are gone.
Currently, the mean query latency with ~7.5k RPS per container node is around 430 microseconds.
The max latencies (pro tip: always monitor max latencies) are typically ~20 ms. Those ~200 ms spikes are due to packet loss in the network layer (not Vespa specifics).
Even though the optimisation is nice, the journey is not yet finished. There are other ways to get even more out of Vespa. Here are several ideas:
sddocname to the response. However, having sddocname is filled only on receiving the summary.This new trick of selecting only the matchfeatures in Vespa 8.596.7, helps eliminate not only a network round-trip, but also problems and latencies associated with summary fetching. The overhead of converting attributes into tensors and transmitting slightly more data over the network in our setup was negligible. Of course, this optimisation is not a silver-bullet for all use cases, but when summary fetching is problematic, it really helps.
Kudos to the team for this great work! And thanks to everyone who helped!
A fun fact is that the initial hypothesis for latency spikes was the pauses of the JVM garbage collector. However, after setting up the generational ZGC the latency spikes were still there. Garbage collector is almost never a root cause.
]]>This post summarizes our findings, shows how to spot and fix the issue, and gives tips to prevent performance drops from unintended clocksource changes. Fixing this simple config can lead to immediate, measurable improvements in throughput and system efficiency.
Some “identical” servers were running slower, with increased latency and CPU usage. We needed to understand why.
A clocksource is how the Linux kernel keeps track of time (“read the clock!”).
More technical details:
🔎 We dug in and noticed a pattern:
Example log snippet:
Apr 15 18:22:57 srv kernel: TSC synchronization [CPU#0 -> CPU#8]:
Apr 15 18:22:57 srv kernel: Measured 120 cycles TSC warp between CPUs, turning off TSC clock.
Apr 15 18:22:57 srv kernel: tsc: Marking TSC unstable due to check_tsc_sync_source failed
We’re not 100% sure. It could be hardware or firmware quirks, or the server being powered off for long periods.
Objective: How much does the clocksource really matter for high-throughput workloads? (Benchmarked via Envoy to Redis proxying.)
On a dedicated server, we ran separate instances of a custom Go benchmark app: one aimed at each Envoy. These apps continuously sent SET and GET commands to Redis at a constant rate, while steadily increasing the number of goroutines at regular intervals, resulting in a steadily growing Redis command RPS over time.
Envoy’s Redis metrics were collected every 10 seconds using a standalone Prometheus server.
HPET = Performance Killer for High-Throughput Workloads. Stick with TSC whenever possible - otherwise, expect increased latency and higher CPU usage.
When does it happen?
The following screenshots show a timeline of the selected clocksource. Periods with no color in a server’s timeline indicate that the server was offline.
# See available clocksources
cat /sys/devices/system/clocksource/clocksource0/available_clocksource
# See current clocksource
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
# Change current clocksource
echo "tsc" | sudo tee /sys/devices/system/clocksource/clocksource0/current_clocksource
As of November 11 12:01, 2024, there are 1,000,000,000 documents in the search index. And that number keeps on growing!
Figure 1: Searchable documents count
When I joined the Vinted search team in late 2019 there were about 100M searchable documents. An order of magnitude growth in absolute numbers in just 5 years!
This significant event of leaping into the billion-scale was surprisingly uneventful. The diagrams below show that the search systems CPU utilization is low and under control while it correlates with the incoming number of requests per second. All in all, systems run smoothly without breaking a sweat.
Figure 2: Vespa Search node CPU utilization
Figure 3: Requests per second (RPS)
With a constant stream of search requests of various types coming in, our search engine, Vespa maintains mean latencies well below 20 milliseconds in the data layer.
Figure 4: Mean query latency
Thus, the efforts to migrate search engine from Elasticsearch to Vespa have already paid dividends. According to our internal benchmarking, the current setup could serve the same requests even if we doubled the number of searchable documents without any changes.
Even though the current search system gets the job done by searching through mountains of documents our ambitions are bigger. We are constantly working on improving the search system so that each member query is served with the best deals. At the same time, cool new features such as adding semantic (a.k.a. vector) search, reverse image search, etc. are being worked out. Initial testing looks promising and we can’t wait to release them to our members.
At the same time, Vinted has set ambitious business goals to rapidly expand in terms of both: geography and content verticals. This will stretch our search technology to its limits. However, the search team is confident and excited about those future challenges because of our solid engineering foundations.
With all that said, kudos to the team for constantly shipping mountains of great work! We are certain that this is not the last chapter in the Vinted Search Scaling story. Stay tuned!
]]>