OLX Engineering - Medium

Empowering Innovation Through Structure

Sara Mendes — Wed, 18 Mar 2026 16:01:03 GMT

Our Work Organization Framework Journey

What Makes Cross-functional Collaboration Actually Work?

In today’s cross-functional environments, the most impactful initiatives rarely live within the boundaries of a single team or person. Whether we’re building new product features, improving infrastructure, or tackling technical debt, success depends on seamless collaboration across multiple teams and roles, including engineering, product, design, analytics, and beyond.

Think about when you’re making a big dinner for friends. You might be the main cook, but you need someone to run to the store for ingredients that you forgot, maybe your roommate to chop vegetables, someone to keep track of the timing so nothing burns, and someone to set the table. When everyone knows their role and does their work at the right time, you end up with a perfect meal.

The magic happens when everyone’s working in sync.

But I get that when someone mentions ‘frameworks’ or ‘structure,’ our eyes might glaze over, thinking about red tape and endless meetings. But here’s the thing: good structure isn’t about slowing you down, it’s actually what lets a group of people create something amazing together that none of us could pull off on our own.

At OLX, we see structure as:

A clear path from “Hey, I have this crazy idea” to “Holy cow, we actually built it.”
Everyone knowing their role, so you’re not wondering if that task is yours or why three people are doing the same thing.
Being able to see what’s happening without chasing down updates or playing detective.
Communication that actually works, meaning no more “I thought you told Sarah, but Sarah thought you told Mike.”
Real checkpoints where we can high-five our wins and course-correct when needed.

Any intentional structure is better than no structure at all.

From a human psychology and team dynamics perspective, here’s why structure matters:

Clarity reduces anxiety: when people know what we expect of them, they work with confidence and can actually focus on doing great work.
Visibility builds trust: when everyone can see what’s happening, trust naturally follows.
Accountability drives results: when it’s crystal clear who owns what, things actually get done. No more “I thought you were handling that” moments.
Recognition motivates innovation: when people see their work making a difference, they start generating their own ideas to make things even better.
Predictability enables planning: when your team knows what to expect, you can allocate time and resources as a human, not a psychic.

The Cost of Missing Structure

Yet without proper structure, cross-functional collaboration often becomes a source of frustration rather than a source of innovation.

Here’s what happens:

Great ideas get lost in communication gaps: Someone has a brilliant idea in a meeting, everyone gets excited, and then it just disappears into the void.
Priorities misalign between teams: Your team thinks X is the priority, but the other team is laser-focused on Y, and suddenly you’re working against each other.
Dependencies become bottlenecks: You’re stuck waiting for Team A to finish their part, but they’re waiting for Team B, who’s waiting for Team A; it’s a game of ping-pong.
Projects lose momentum when ownership isn’t clear: Three weeks pass, and nobody’s really sure who’s supposed to be driving this thing forward.
People lose visibility into the work they’ve contributed: You put in tons of work, but when the project wraps up, it’s like your contribution never existed since no one sees it, no one mentions it, and you’re left wondering if it even mattered.

It doesn’t really matter whether you create a process that fits your team’s style or try something completely different. The real difference comes when everyone knows the ‘rules of the game,’ people can focus on being innovative instead of constantly figuring out what they’re supposed to do or who they should talk to.

What Made Us Stop and Think

By 2024, at OLX Motors, we had a classic case of “too much of everything, not enough of anything.” Everyone was launching initiatives left and right, but most were half-baked from the start, so we decided to rethink how we tackled Product, Data, Marketing, and Tech (PDMT) initiatives.

We noticed that while individual teams were doing solid work, we weren’t always connecting the dots between projects and teams. Sometimes great ideas start strong but lose steam along the way, or people are brought into projects too late, harming their prioritization.

And nobody, not even managers, could tell you what was actually happening across the organization. Someone would ask, “Hey, what are all the things our teams are working on?” and we’d all just kind of scratch our heads and start making lists on the spot.

Planning for the future was nearly impossible when we couldn’t even track what was on people’s plates right now, let alone determine whether any of it connected to our actual business goals.

So we thought, “What if we could create a better way to organize how we work together?”

That’s how our Work Organization Framework (WOF) was born, and it’s really helped us get more out of our collaboration and maintain momentum.

Image from an AI generator

Our Challenge

Before implementing WOF, our teams were experiencing the exact “cost of missing structure” we talked about earlier. We had all those classic symptoms:

Great ideas were getting lost in the shuffle, with no central place to track progress
Many initiatives started with unclear requirements and priorities
Lack of sustained support forced teams to abandon projects
We missed opportunities to align initiatives with our strategic goals and OKRs
Managers struggled to see the full scope of their team members’ impact beyond day-to-day teamwork
Teams couldn’t effectively plan future work, considering their actual capacity constraints

Meet our Work Organization Framework

The WOF was designed as a structured, centralized approach to managing PDMT initiatives across our Motors organization. At its core, the framework addresses three fundamental needs:

Visibility — Making all initiatives transparent and trackable, while giving us a clear view of each team’s actual capacity and current commitments. This enables us to make smarter prioritization decisions rather than piling more work onto already stretched teams.
Alignment — Make sure we’re all rowing in the same direction to support our strategic goals.
Accountability — Know who’s responsible for what.

Taking this into account, our WOF starts by operating through two key roles:

Champions — Think of them as connectors: people from different teams/roles who become the go-to person for specific areas (managers, leads, staff). For example, if you’re working on something related to the Development Experience area, there are Champions in that space who can tell you what’s already happening, help you avoid duplicating work, and keep everyone connected.
Drivers — These are the people who actually make things happen: If Champions are the connectors, Drivers are the ones pushing initiatives forward day to day. They’re the ones at the operational level, also finding collaborators who can help them achieve it. They coordinate cross-functional work and report status.

We split our Champions and Drivers across five areas, so people can focus and build depth in their specific domain rather than juggling everything.

These areas represent the key organizational functions that cover our main activities and initiatives. Each area addresses an important aspect of how we operate and achieve our strategic objectives.

Governance: Emphasizes the strategic direction and oversight of Product, Data, Marketing, and Tech (PDMT) efforts, aligning them with business goals.
People Management & Culture: Focuses on the human aspect of PDMT teams, emphasizing growth, engagement, and a positive work culture.
User Experience & Quality: Focuses on maintaining high product standards across all user-facing aspects, including user experience, accessibility, performance, testing practices, and regulatory compliance, ensuring our products deliver value while meeting quality expectations.
Operational Excellence: Focuses on the performance and reliability of software in production environments.
Development Experience: Focuses on improving the development experience for our developers to make everyone’s life easier.

How Things Actually Work: From Idea to Done

Here’s what happens when someone has an idea or gets assigned to figure something out:

Someone from any role spots an opportunity — The Driver identifies an opportunity or receives an assignment.
Their manager gets involved — No rogue projects here. The manager looks at it and says, “Okay, this seems legit. Let me connect you with the Champion.”
The Champion does a sanity check — Strategic alignment and priority confirmation: “Does this actually help us hit our goals? Are we already working on something similar? Is this the right time?”
Time for the Driver to dig deeper — fill the documentation, check who needs to be involved, and what success looks like. This step is where we stop and think before we start building.
Plan it out — Create JIRA tickets, identify dependencies, map out who’s doing what, timeline, and tracking mechanisms. How long will this take? Who do we need? When can we get this done?
Build the thing — People start actually working on it. The Driver tracks progress and unblocks people when they get stuck.
Did it work? — We check if we actually achieved what we set out to do, not just “shipped something”; we do an impact assessment.

What We’ve Achieved

The framework has made a real difference across our organization, though we’re still learning and improving as we go. The impact has been noticeable:

Illustrations from Denustudios and VectorElements in Unsplash

Real WOF in Action: The Accessibility Monitoring Story

Let me show you how this actually works with a real example. Last year, we needed to tackle accessibility compliance across all our platforms. Not exactly the most exciting topic, but critical for meeting legal requirements and serving all our users.

The Challenge

Every team was handling accessibility differently. We had no way to catch issues before they hit production, and with 10+ teams building user interfaces, we were constantly fixing them after they’d already reached users.

How WOF Made It Happen

The Drivers (Gustavo di Mambro and Victoria Sordo, from two different teams) identified the opportunity and worked with their managers to get it prioritized.
The Champion helped align this with the strategic goals of the User Experience & Quality area.
Cross-team collaboration happened naturally. Instead of trying to convince 10+ teams one at a time, we had a clear process for coordinating the rollout.

What They Built

An automated system that checks every page for accessibility issues, integrated right into our testing pipeline. Now, when developers push code, they get instant feedback if they’ve introduced new accessibility problems. No more discovering issues weeks later.

The Results

10 teams adopted the tooling, and we saw immediate improvements. For example, our Listing Page went from 120 accessibility violations on desktop to just 43, and from 104 to 30 on mobile.

How WOF Changed the Game

Before the framework, this would have been a six-month struggle to get buy-in from different teams, with unclear ownership, and it probably would have died when priorities shifted. With WOF, it took one quarter to build, pilot, and roll out across the organization.

That’s the impact we’re talking about, technical work that actually moves the business forward, with clear ownership and measurable results.

As of December 2025, we’ve completed 29 initiatives across our areas, have 16 in progress, and 4 in the planning stages. In total, 37 people have been involved in these initiatives.

How We Keep Getting Better — Learning Through Retrospectives

Despite the framework’s significant impact on initiatives outside the teams, we decided to run a retrospective with Champions and Drivers to identify opportunities for improvement.

We conducted the session with 5 Champions and 5 Drivers using a two-step approach:

First, we identified what was working well and should continue, and areas needing improvement.
Then, we defined specific improvement actions and assigned ownership for each.

What We Discovered

During our retrospective, the Champions and Drivers painted a clear picture of our framework’s performance. Here’s what emerged:

The Wins: Five Game-Changing Improvements

Our Champions and Drivers were clear about what’s working:

The Growth Areas: Six Opportunities to Strengthen Our Framework

While celebrating our successes, our Champions and Drivers also identified exciting opportunities to make our framework even stronger:

Our Actions

From Feedback to Action: Our Five-Point Plan

We’re already tackling these actions to strengthen our framework, ensuring teams can confidently work on cross-functional initiatives that benefit everyone.

Building Your Own Framework: Key Takeaways

Here’s what we’ve learned from our WOF journey:

-> Measure what matters: Track both output (initiatives completed) and outcomes (strategic impact).
Our 29 completed initiatives drove measurable improvements across all areas. Here are some examples of the different areas:

Developer Experience initiatives cut pipeline times by up to 44% (saving teams 94 hours per week) and replaced many of the different notification commands across 60+ services with a unified Notification Center that automatically sends consistent deployment updates to Slack, New Relic, OpsLevel, and Sentry.
Operational Excellence initiatives migrated 45+ components to AWS Secrets Manager, achieved full EU backup compliance for all Tier 1 and 2 databases, and standardized user identification across 71 services, with 73% now using modern UUIDs instead of legacy user IDs, simplifying future integrations.
Governance initiatives eliminated manual tech debt tracking by using structured Jira workflows,
User Experience & Quality initiatives successfully rolled out Tailwind v4 as our styling standard (with 10 engineers trained) and deployed automated accessibility testing across 10 packs, reducing violations dramatically-our main listing page dropped from 120 to 43 violations.

-> Culture eats process for breakfast: The framework only works when people buy into the “why.” Invest time in explaining the value, not just the steps.

-> Iterate relentlessly: Our retrospective revealed 6 areas for improvement, even after success. Good frameworks evolve. We’ll definitely do another retrospective down the road.

Making It All Work: Clear Roles and Responsibilities

One thing we learned early on is that good intentions aren’t enough. You need everyone to know exactly who does what, and that came up loud and clear in our retrospective; that’s why we have built this RACI Matrix to spell out all the responsibilities, so there are no doubts. Here’s exactly who does what in our WOF process:

Who’s Responsible for What

Responsible (does the work), Accountable (final approver), Consulted (provides input), and Informed (updated on progress)

What This Actually Means

-> Owners set the strategic direction for their area and ensure initiatives align with broader goals. They move work forward, provide monthly updates, coordinate with other areas when work overlaps, and help teams when they get stuck. We introduced this role specifically to speed up decision-making, since Owners are part of the leadership team, teams no longer have to wait for multiple escalations to get strategic guidance or approvals.

-> Champions are still the connectors; they help prioritize work and keep initiatives moving in the right direction. They’re also our subject-matter experts who monitor the health of their areas and identify investment opportunities. Their recommendations directly shape Motors’ objectives.

-> Drivers are the ones actually getting things done, from proposing, refining, and executing initiatives day by day.

-> Impacted Teams get consulted when their work might be affected, so there are no surprises.

-> Leadership (HoE/P, DoE/P, PDMT) gets the strategic view without drowning in operational details.

Everyone knows exactly what’s expected of them, and you’ve got clear escalation paths when coordination gets tricky.

Final Thoughts

Cross-functional collaboration doesn’t have to be chaotic. With intentional structure, clear roles, and continuous improvement, your individual contributors can build something that actually helps people get things done together without losing their minds.

It’s not rocket science, just clear expectations, defined roles, and the willingness to keep tweaking things when they don’t work. The framework itself isn’t magic. What matters is giving smart people the clarity they need so they can focus on building great things instead of figuring out who’s supposed to do what.

The WOF completely changed how we tackle big technical projects at Motors. We created a structure that helps instead of hurts, and visibility that doesn’t feel like micromanagement.

But honestly? The biggest win was solving that frustrating problem every manager knows: when your individual contributors do incredible cross-functional work that somehow becomes invisible. That problem’s gone. When people step up to tackle the strategic projects, everyone can see exactly what they contributed.

We’re definitely still learning. Our retrospective gave us a whole list of things to improve. But the proof is in the results: people actually want to use this thing, our projects don’t die halfway through anymore, managers finally know what their teams are working on, and teams work intentionally on well-prioritised items.

The journey of innovation never ends, but with the proper framework in place, every step forward is visible, valuable, and celebrated.

Thank you for reading!

Empowering Innovation Through Structure was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

OLX Masterclass Frontend 2026

Isaac Queiroz — Wed, 25 Feb 2026 16:16:00 GMT

The always-evolving challenges of OLX create a rich environment for innovation, experimentation, and knowledge sharing. In our 2026 Frontend Masterclass, we have three sessions that will shed light on our current development flow.

Join us on March 3rd, 2026, from 6:00 pm to 9:00 pm at our OLX Offices in Poznań!

Do you want to join us and know what’s trending in the latest frontend technology? Check the event information and join here!

Can’t join it in person? Sign in to this form to receive the online recording of the session!

Our Agenda

With a broad schedule, our speakers will deliver a mix of technical presentations and demonstrations on multiple subjects.

MCP Servers in Action: From Figma Design to React Component in Minutes

by Karol Grabowski

Part of what changed in our day-to-day flow is how AI is always available to help us. AI assistants and agents are powerful out-of-the-box tools; we can augment our reach by configuring Model Context Protocols (MCPs) to take our setup a step further.

Join Karol to explore how the Model Context Protocol turns AI assistants into powerful development partners with direct access to your tools. This talk covers the MCP ecosystem, real OLX integrations, and includes a live coding session demonstrating the Figma-to-code workflow.

CSS is no longer “this simple” — it got _layers_, quite literally

by Tomasz “Comandeer” Jakut

At the same time that we advance with AI and multiple experiments to learn from our users and make our flows as pleasant as possible, we are still styling our components with the usual suspects, right? Not always!

Join Tomasz Jakut and deep dive into CSS’s new capabilities that can redefine how we approach styling websites: from @Layer and nesting to container queries, scoped styles, and user preferences. New native features can refresh well-known techniques and make others obsolete.

One-product, three brands — Our strategy for serving different countries from a single repo

by Lilian Galezewska

We all know how OLX can scale around Europe, but have you ever questioned what is behind the websites that serve our millions of users? In this talk, Lilian will help us understand how a single repository can be responsible for so much!

Discover how we efficiently manage multiple websites from a single repository, balancing localisation, configuration, and design to cater to the unique needs of our international brands.

OLX Masterclass Frontend 2026 was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Importance of offline evaluation to guide model choice

Tiago Cabo — Tue, 07 Oct 2025 15:11:01 GMT

The Importance of Offline Evaluation to Guide Model Choice

Introduction

Recent advancements in open-source AI models make it challenging to justify the development of custom models, given the high quality of existing options. This also applies to embedding models, which are available in impressive quality.

At OLX, we utilize a model called Item2vec to generate similar item recommendations. For more details, please refer to our blog post.

Item2Vec: Neural Item Embeddings to enhance recommendations

In this work, we developed an embedding model that not only improved recommendations but was also used by other teams across OLX, such as the search team. For an in-depth look at this application, see our post on Hybrid Search, where we tried to balance the benefits of semantic and lexical search.

Hybrid Search — Where Keywords Meet Vectors, Enabling Classifieds Discovery

This article discusses our evaluation of open-source embedding models compared to our existing internal model.

Open Source Embeddings

An embedding model converts inputs, such as text or images, into vectors. These vectors are then used to compute similarities, often cosine distance, where closer vectors are deemed semantically related.

Cosine similarity

Many architectures types for embedding models support this, including word2vec, GloVe, RNN, LSTM, and transformers. This leads to hundreds of publicly available models, making selection a challenge. Common benchmarks play a crucial role in this process. After exploring various options, we chose the MTEB: Massive Text Embedding Benchmark.

We selected this benchmark due to its widespread community adoption and extensive features, such as:

56 datasets across 8 tasks
Support for up to 112 different languages
Easy extensibility via the repo

MTEB Huggingface

Benchmark MTEB

Hugging Face provides a comprehensive leaderboard showcasing top-performing models across tasks (retrieval, classification, etc), reflecting advancements in machine learning and AI. The higher, the better. This resource offers insights into effective models for academia and industry. As of August 2025, users can explore the leaderboard to identify leading models, compare performance, and stay updated on trends in model development.

Hugging Face MTEB leaderboard

The current dashboard features over 200 models, but we cannot simply choose the top one due to memory and latency constraints for internal deployment.

We also face limitations regarding modality and language. We filtered for text-only models with multilingual capabilities, which is crucial since our users speak Portuguese, Polish, Russian, Bulgarian, Ukrainian, and Kazakh. We aim to use a single model that supports all these languages.

Below are the models considered after applying these constraints.

Scores for considered models

*The models were chosen a couple of months ago, so the current score may differ.

Offline evaluation

Having chosen the models for our experiments, we had two options:

Pick one for an A/B test to see which performs better in production.
Validate which model performed better against our data.

We chose the second option and created an offline dataset for comparison. Our goal is to find similar items. This use case typically appears on the item page.

To do this, we made the following assumptions:

Consecutive interactions between items indicate similarity.
Clicks are less relevant than favorites or chat replies.
More recent interactions hold greater value than past ones.

Based on these assumptions, we built the following dataset.

Dataset size by country

Example of pairs:

Data example

Score refers to the business rule that combines multiple interactions for the same pair. In our case, this was a weighted sum, decided with business interest in mind. A higher score indicates a stronger signal. We created versions of this dataset for each language.

The results aligned with our expectations, as we noted that improvements in offline recall corresponded with the MTEB score.

Benchamrk results with internal data

For the rest of the countries, multilingual-e5-small was the best. Below the results:

Benchmark results by country

The multilingual-e5-small model performed best across all languages, making it our choice for testing.

Note: Title and descriptions were used in the encoder.

Fine-tuning

The main advantage of this setup is that building the dataset according to the BEIR scheme allows us to easily fine-tune our embeddings model. Here is an outline of our approach, using MultipleNegativesRankingLoss loss function.

We used the Sentence Transformers library to fine-tune the model. Training and Fine-tuning Embedding Models with Sentence Transformers v3

Below are the offline results after fine-tuning, using recall@5 as metric.

Results after Fine-tuning

As expected, low-resource languages like Uzbek (uz) and Kazakh (kz) benefit the most from a fine-tuning stage.

We found that after 40 epochs, additional training did not yield significant improvements. In the next steps, we plan to increase the data amount.

A/B test consideration

Despite improvements in offline evaluations, we initially faced challenges. Similarity captures how closely two items align based on embeddings, but relevance takes into account user context and intent, which pure embeddings may miss.

Although the ads were similar, they were less relevant to each user compared to the existing model. This resulted from several features present in the base model but lacking in the open-source version, including:

Location
Price

To address these problems, we added a rerank layer after the retriever. Our first attempt, consisted on a weighted average of normalized location distance and similarity score. Parameter was chosen using grid search.

Key takeaways

The use of open-source embeddings represents a significant shift in retrieval methods. They provide teams with greater flexibility, allowing for easy model swaps.
Offline evaluation is crucial and informs model choice, but real-world user relevance requires additional business logic and online testing.
Embedding models create a shared foundation for diverse retrieval and ranking needs across teams.

Acknowledgements

Recommendations team
Search team
GenAI team

Acknowledgments

A special thanks Mahnaz Namazizavareh and Inês Soveral for their effort on reviewing this article. Also to the Recommendations and Search team teams for all their hard work and support.

Thank you for reading! If you enjoyed this article and want to know more about OLX and our work, please visit our website or feel free to reach out to me on LinkedIn.

Importance of offline evaluation to guide model choice was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Hybrid Search — Where Keywords Meet Vectors, Enabling Classifieds Discovery

Inês Soveral — Tue, 09 Sep 2025 15:01:44 GMT

Hybrid Search — Where Keywords Meet Vectors, Enabling Classifieds Discovery

In the midst of a fast-paced technological revolution, where user expectations grow increasingly sophisticated, the quality of search in the classifieds space has never been more critical. In fact, users expect the search box to understand what they are looking for and produce relevant search results; if this is not the case, they will easily move on to any competitor who provides this experience.

At OLX, our search system traditionally relied on keyword matching between user queries and ad titles and descriptions — a straightforward but rigid approach. While functional, it often led to low recall or even zero results pages (ZRPs). To mitigate this, we gradually developed an extension chain logic — which will be explained in detail in later sections — to address specific edge cases and expand recall. Over time, however, this logic became increasingly complex and difficult to improve upon, calling for a disruptive solution to further evolve our search system.

In this article, we share why and how we transitioned to Hybrid Search as a retrieval strategy for our double-sided marketplace. We walk through the key changes required to support semantic search, highlight the challenges we faced, and detail the solutions put in place to overcome them. Finally, we reflect on the tangible improvements and practical benefits observed after this shift.

What is Hybrid Search and what value does it bring?

Despite its lack of flexibility, keyword matching remains highly effective in e-commerce. It returns results that exactly match the user’s query terms, offering clear traceability and helping users understand why specific ads appear. This is particularly useful when users know precisely what they’re looking for, and when combined with structured filters, a feature of the OLX marketplace.

However, keyword matching falls short when queries are vague, misspelled, or phrased differently from how sellers describe their listings. In such cases, vector search shines by capturing the underlying intent of a query.

Semantic search is inherently robust to synonyms, spelling errors, and phrasing variations, even being natural language friendly. These characteristics make this approach especially valuable in large or unstructured catalogs, such as OLX, where ad content is user-generated and often inconsistent.

Finally, vector search is known to improve recall and significantly reduce ZRPs, even eliminating them when no minimum similarity threshold is enforced between query and ad embeddings.

Hybrid Search combines the precision of keyword matching with the semantic depth of vector search in an ensemble-like approach, resulting in a more robust and comprehensive recall set. By leveraging the strengths of both methods, it anchors retrieval in exact matches while expanding on meaning through semantic understanding — effectively bringing out the best of both worlds.

In the context of e-commerce, Hybrid Search is the industry standard, powering the search systems of major platforms, including Amazon, Shopify, Walmart, Airbnb, and Etsy. At OLX, as a classifieds platform, our goal in adopting this approach is to boost recall by surfacing relevant ads that would previously have been missed and to decrease ZRPs, increasing the share of users who engage with ad listings.

Building embeddings

Our existing search architecture already included keyword matching, implemented in SOLR using BM25 scoring. To move toward Hybrid Search, we focused on adding a semantic retrieval component by generating embeddings.

We evaluated multiple options for embedding models and narrowed our focus to two multilingual candidates: mCLIP and multilingual-e5-small. Multilingual support was a non-negotiable requirement, as users often search in multiple languages within each market, and having the possibility to use a single model across markets would also reduce operational complexity and maintenance overhead.

We initially considered mCLIP, a multilingual extension of OpenAI’s CLIP, which in turn is a vision–language model, including two main components: an Image encoder and a Text encoder, that learns to connect images and text in a shared embedding space. Its main innovation is that it can understand natural language prompts and relate them directly to images without requiring task-specific fine-tuning. This was a logical choice as our original motivation was to explore the possibility of including image data in the ad representation. While that idea was quickly dropped after an unsuccessful experiment, we continued evaluating mCLIP’s text encoder on its own, using it to embed ad titles. The decision to use only titles and not include ad descriptions comes from the limitation this encoder has of a fixed maximum context length of 77 tokens.

Although mCLIP is known to perform reasonably well for search, its main strength lies in image–text retrieval. This led us to consider multilingual-e5-small as a more targeted alternative, as this model had also been included in an analysis comparing pre-trained embedding models done by the recommendations team for their use-case and rendered the best results.

Unlike mCLIP, E5 models are specifically optimized for semantic search and retrieval tasks. Despite its smaller transformer architecture, which may result in slightly lower accuracy compared to larger models, the E5-small variant offers lightweight inference, making it well suited for production systems. It also supports inputs up to 512 tokens, enabling us to embed both ad titles and descriptions.

For our initial iterations, the goal was to demonstrate the value of vector search within a hybrid architecture. To accelerate development, we used both models out of the box with no additional fine-tuning on OLX-specific or domain data.

Combining retrieval sets

Once both keyword matching and vector search were in place, the next challenge was to effectively combine their results into a unified retrieval set. Traditionally, keyword search retrieved the top 1,000 ads based on their BM25 score, which were then passed to be re-ranked, as explained in the following section.

For Hybrid Search, we initially considered applying a similarity score threshold to vector search results, selecting only those above a certain value. The idea was to correlate similarity scores with ad relevance for each query. However, offline analysis revealed that similarity scores did not consistently separate relevant from non-relevant ads — making it infeasible to define a meaningful global threshold.

Dispersion diagram plotting vector similarity scores vs relevance

Given this scenario, we opted for a fixed-size allocation approach. A set number of results would be retrieved via vector search, with the remainder filled by keyword matches, ensuring the total retrieval set remained capped at 1,000 ads. We began with 35 results from semantic search, as suggested by an offline analysis balancing precision and recall, and later increased this number based on insights from online experiments.

From Two Sources to One Ranking

As described earlier, the combined retrieval set, from both keyword and vector search, is passed to a Learning to Rank (LTR) model for final scoring.

Our LTR model is based on XGBoost and uses a range of features derived from historical user behavior, ad characteristics, and keyword-based BM25 scores. We currently operate two iterations of this model:

Vanilla LTR: The initial version, trained using seller contact events, which we call successful events, as the sole target signal, meaning only interactions that led to contact are considered indicators of relevance.
VAS LTR: A later iteration that balances two objectives: maximizing relevance for users and ensuring visibility for Value Added Services (VAS), ads that have paid for additional exposure.

For more detailed information on the LTR models, please refer to this article.

Once vector search was introduced, we ran a detailed analysis comparing the LTR feature distributions between keyword and semantic results. We found that the existing model, trained only on keyword-based retrieval, consistently assigned lower relevance scores to vector search results, even when those results were relevant. This was because the model was blind to semantic signals, burring ads with high similarity scores if there was no keyword matching, which would be particularly problematic for VAS ads retrieved via semantic search, which suffered from poor ranking positions.

To address this, we decided to retrain the LTR model, including the vector similarity score between query-ad pairs as a feature.

Initial Experimentation and Results

For all initial experimentations, a smaller market was used as a low-risk way to learn and prove the value of Hybrid Search before moving on to bigger markets.

Vector Search in Extensions

Our first experiment with vector search was conducted in the search extensions space: a chain of fallback retrieval steps designed to address edge cases and mitigate ZRPs, as outlined in the introduction. Over time, this logic had grown increasingly complex, making further improvements difficult without a disruptive alternative.

Illustration of the original extension chain logic

This is where good meets great. By introducing vector search within the extension flow, independent from keyword matching and the LTR model, we can isolate and measure its standalone impact. This approach also allows us to test its effectiveness without the overhead of full integration, offering a low-risk path to prove value.

The experiment involved adding a semantic search step in the existing extension chain, after only expanding to nearby locations. This new chain link used mCLIP to embed both query and ad title, retrieving results based on vector similarity alone.

Illustration of the semantic step added extension chain logic

Results were overwhelmingly positive, with a more than 100% increase in successful (seller contact) events per user. The overall conclusion was that users in the experiment viewed more relevant ads, as the increase in interactions attests.

This confirmed our hypothesis: vector search can meaningfully enhance recall and relevance, particularly in scenarios where keyword matching fails. Below is an example of a query that benefited from this semantic component.

Hybrid Search Enablement

Having proven the value of semantic search in the extensions layer, we moved toward implementing Hybrid Search in the main flow. This meant integrating vector search with the keyword-based retrieval set and re-ranking the combined results using our Learning to Rank (LTR) model. At the time, the model in production in this market was the Vanilla LTR, trained exclusively on keyword search data. As discussed previously, this model needed retraining to incorporate vector-specific features. However, this presented a cold start problem: there was no vector search data in the main flow to train on.

Given engineering limitations and time constraints, we opted for a pragmatic solution: introducing a linear re-ranking model to temporarily replace LTR. This approach allowed us to combine the two retrieval sets while ensuring that vector search ads had enough visibility to generate interaction data for future model training.

To minimize performance degradation, we built this linear model on top of the existing logic used before LTR — the Edismax scoring function — which linearly combined different BM25 scores with predefined weights. We preserved these weights and a recency feature, which we further boosted to protect the ranking of VAS ads, since these benefit from recency resets upon service purchases. To ensure visibility for vector results, we added the vector similarity score to the ranking function and assigned it a high weight.

We then launched an A/B test using this linear re-ranking strategy, with the sole purpose of collecting data for retraining the LTR model. The results of this experiment were never expected to be strong in terms of performance: our goal was simply to “do no harm.” While the variant underperformed relative to control, the drop was smaller than anticipated. More importantly, the experiment succeeded in producing the data we needed for retraining. An offline analysis based on this data also led to an important insight: increasing the number of vector search results from 35 to 70 improved the quality of the hybrid recall set.

Flow charts illustrating the differences between control and variant for Hybrid Search enablement test

Hybrid Search in the main flow

Finally, we launched our end-goal experiment to validate Hybrid Search in the main search flow. As part of our long-term plan to adopt the VAS LTR model across all markets, we retrained this model and included it as one of the test variants. However, because the market in question had been using the Vanilla LTR up to that point, and changing the ranking model represented an extra change, we also re-trained and included it as a variant. This resulted in an A/B/C test structure, allowing us to assess how Hybrid Search performs with both ranking models.

Unfortunately, the test did not reach convergence for most metrics due to traffic being split across three variants. The results were inconsistent with expectations and with what we had observed during the extensions experiment. Further investigation revealed a bug in tracking that rendered web data unreliable.

We then re-launched the experiment, this time focusing solely on the Vanilla LTR variant to avoid further traffic fragmentation. Our primary goal was to validate the impact of Hybrid Search, not to compare ranking models, after all. To support this new experiment, we used Android data from the previous test (which was unaffected by the tracking bug) to retrain the Vanilla LTR model.

Due to the earlier tracking issue, we had to run two separate experiments:

One on apps, where the LTR was retrained on Android data as usual.
Another on mobile web (mWeb), where the LTR was also retrained on Android data, a deviation from standard practice, which typically relies on web data.

The goals of these experiments differed: on apps, the goal was to validate how Hybrid Search interacts with the retrained LTR; on mWeb, the goal was primarily to collect interaction data on vector results for future LTR retraining.

Flow charts illustrating the differences between control and variant for the Hybrid Search experiment

The app experiment showed promising results, with an approximately 3% increase in the share of users who completed a successful event, along with positive movement across most metrics related to organic results. Overall, search efficiency improved, as reflected by a reduction in the number of searches per session.

Even more encouraging were the results from mWeb, which showed a ~5% increase in both the share of users with successful events and in ad click rates. Results were consistently positive across all key performance indicators.

Given these strong outcomes, we decided to proceed with a full rollout of Hybrid Search on all platforms in this market. 🚀

Where to next?

Having demonstrated the value of Hybrid Search within the OLX marketplace, our next step is to expand and refine the solution. We aim to roll it out across all markets, starting with our largest. There, we will:

Continue to iterate by fine-tuning embedding models
Experiment with more effective ways to blend vector and keyword search;
Test alternative ranking strategies to further improve relevance and user experience

Hybrid Search is not a finished project, but an evolving capability — one that we believe will shape the future of discovery across OLX.

References

Büttcher, S., Clarke, C. L. A., & Cormack, G. V. (2008). Hybrid search: Effectively combining keywords and semantic searches. In S. Bechhofer, M. Hauswirth, J. Hoffmann, & M. Koubarakis (Eds.), The Semantic Web: Research and Applications. ESWC 2008. Lecture Notes in Computer Science, vol 5021. Springer. https://doi.org/10.1007/978-3-540-68234-9_41
Chen, Guanhua, et al. “mCLIP: Multilingual CLIP via Cross-lingual Transfer.” Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL 2023), 2023, pp. 10582–10600. https://aclanthology.org/2023.acl-long.728.pdf
Wang, Liang, et al. “Multilingual E5 Text Embeddings: A Technical Report.” arXiv preprint arXiv:2402.04782 (2024). From fuzzy information to community detection: an approach to…

Acknowledgments

A special thanks to Tiago Cabo and Mahnaz Namazizavareh for their collaboration in selecting the embedding models, Jakub Krzesłowski for his support in getting these models to production, and to the Search and Ranking teams for all their hard work.

Thank you for reading! If you enjoyed this article and want to know more about OLX and our work, please visit our website or feel free to reach out to me on LinkedIn.

Hybrid Search — Where Keywords Meet Vectors, Enabling Classifieds Discovery was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speed with Rigor: testing smarter with group sequential design

Gabriela Lewenfus — Tue, 05 Aug 2025 17:25:29 GMT

generated by chatgpt

Introduction

In the world of A/B testing and experimentation, it’s tempting to check results frequently and stop tests earlier if they seem significant (a practice known as peeking). However, this practice dramatically inflates the false positive rate, leading to misleading conclusions.

Imagine you’re running an A/B test, and after just 10 days, the results look like a clear win. You might be tempted to end the experiment early and move forward with the release of the new feature. After all, why wait another two weeks as originally planned? But here’s the catch: the more often you check the results, the higher the chance you’ll spot a “win” that isn’t real. Frequent peeking increases the risk of false positives — in other words, drawing the wrong conclusion just by chance.

To prevent peeking while ensuring statistical power, a common practice is to set a minimum sample size that must be reached before ending the experiment. However, this fixed horizon approach can extend the experiment’s duration and delay decision-making.

Group sequential testing (GST) provides a statistical framework to monitor results at predefined intervals while maintaining control over error rates. At OLX, we migrated from fixed horizon experiments to GST to achieve faster insights, lower costs, and greater efficiency, enabling our teams to make smarter, data-driven decisions with speed and confidence.

In this post, I’ll explain how GST works, how it can accelerate experiments while still preventing the pitfalls of peeking, and how we can use it even to stop ineffective tests early. Finally, I’ll discuss real-world trade-offs and implementation challenges.

Peeking is a trap

Before we dive into GST, let’s first understand what “peeking” is and why it’s so concerning. Peeking occurs when the experimenter repeatedly checks an experiment’s results and finishes it once a statistically significant effect appears. This inflates the false positive rate, leading to unreliable conclusions.

The probability of making at least one false positive across N checks can be expressed as:

from Wassmer, G., & Brannath, W. (2016)

Rᵢ denotes the event that we reject the null hypothesis at the i-th interim analysis, and Rᵢᶜ denotes its complement (the event that we do not reject the null hypothesis at the i-th analysis).

Each conditional probability accounts for the fact that the previous checks did not result in stopping the test earlier. These probabilities combined are not equal to just the significance level of the test, α, because the results at each analysis are correlated with each other (data is accumulated). The actual error rate inflation depends on the correlation structure of the test statistics across interim analyses, which is complex to compute directly without simulation or statistical correction methods.

The phenomenon of peeking can be observed with a simulation. In the following code, variants a and b come from the same distribution, which means that we do not expect a statistically significant difference between both samples. Each day, 1000 new samples are collected and evaluated over 30 days. Despite a significance level of 5%, 27.9% of the simulations yielded a significant result.

def sim(n = 1000, days=30):
    a, b = [], []
    r = 0
    for i in range(days):
        a = np.concatenate((a, np.random.normal(1,1,n)))
        b = np.concatenate((b, np.random.normal(1,1,n)))
        _, p = stats.ttest_ind(a,b)
        if p < 0.05:
            r=1
            break
    return r

outcomes = [sim(days=30) for _ in range(1000)]
print(f"False positive rate: {np.mean(outcomes)}")

Output:

False positive rate: 0.279

If we only looked at the results once instead of 30 times, the false positive rate would still be around 5%, as expected. That’s because the fixed-horizon approach is specifically designed to prevent inflation of false positives.

How GST avoids false positive rate inflation

Like fixed-horizon testing, GST controls false positives using predefined stopping rules. However, unlike fixed-horizon tests, GST bases these rules on the test statistic — not on reaching a specific sample size.

Firstly, instead of evaluating the statistical test at a single point in time, GST establishes predefined checkpoints, known as stages or interim analyses. At each stage, the test statistic is compared against a predefined boundary (it will be explained in the next section!). If it exceeds the boundary, the test can be stopped earlier with a conclusive result. Each stage has a pre-calculated boundary designed to ensure that the overall significance level of the test remains consistent with the predefined significance level (e.g., 5%) set in the experiment design.

In both figures, the predefined boundaries are in blue and the critical values sequentially calculated are in green. On the left, the critical value crosses the boundary on the 3rd stage and allows the experiment to stop earlier. On the right, the critical values never cross the boundary and the experiment lasts until the last stage.

Efficacy Boundaries

Efficacy boundaries are predefined statistical thresholds that tell us when we can confidently stop an experiment early because there’s strong evidence of an effect. These boundaries are usually expressed in terms of a test statistic, like a z-score. If the test statistic crosses a boundary value (let’s call it uₖ) at a given checkpoint, we can reject the null hypothesis and stop the test early — that is, we’ve likely found a real effect.

For those more familiar with p-values: this is equivalent to the p-value falling below a certain threshold (αₖ) at that same checkpoint. A high z-score corresponds to a low p-value, so crossing a z-boundary means the p-value is small enough to justify stopping.

We use z-scores here because they make boundaries easier to visualize — but conceptually, it’s the same logic as comparing p-values to a cutoff. To summarise, the efficacy boundaries advantages are:

They reduce experiment duration by stopping earlier when there is strong evidence for it.
They maintain the overall significance level (e.g., 5%) despite multiple looks at the data.

Efficacy boundaries are based on special formulas that make sure the overall false positive rate stays under control, even if we check the results multiple times.

Example: Suppose you plan to check your experiment results at three equally spaced moments. Instead of using the standard α = 5% at each stage, you can have different stopping rules per stage:

First check: p-value < 0.005 to stop early.
Second check: p-value < 0.014 to stop early.
Final check: p-value < 0.044 (usual threshold).

By construction, the sequence [0.005, 0.014, 0.044] ensures that the overall Type I error rate remains at 5%, even with multiple analyses. Let´s take a look at O’Brien-Fleming boundary (OBF), to better understand it.

O’Brien-Fleming boundary

Let’s call uₖ the critical value corresponding to αₖ at stage k. The implementation follows:

The OBF constant depends only on K (number of observations) and the level of significance. Each critical value can be calculated from this constant. The OBF constant is calculated by solving:

This expression makes it clear that by construction, the overall significance level of the test does not exceed the pre-defined α even with multiple analyses.

The following image shows an example of a 5 stages two-sided hypothesis test in which the efficacy boundary is crossed at the 4th stage. If no boundary is crossed, the null hypothesis can’t be rejected, and the experiment lasts the total amount of 5 stages.

Efficacy boundary crossed in the 4th stage

In early stages of the experiment, the p-value threshold, αₖ for stage k, is small, which means that only large observed effects are able to cross this boundary, whereas at the end of the experiment, when there’s more data, αₖ is closer to the significance level of the test.

The following code shows how to calculate the OBF constant.

https://medium.com/media/35a66b47779dcf4220a145a2afd307a0/href

Another common approach is Pocock boundary, which uses a constant significance threshold across all stages but has some trade-offs in terms of power.

Incorporating Futility Boundaries

And there’s more … We’re not done with the advantages yet!
Futility boundaries are another powerful feature: they’re predefined thresholds that allow you to stop an experiment early if there is little to no evidence that the test will eventually show a significant effect. In other words, they help you cut your losses and move on when the data clearly isn’t heading toward a win. This feature is used to save time and resources by stopping tests that are unlikely to yield a conclusive result; reduce unnecessary exposure to ineffective treatments or changes; and allow quicker decision-making when early results suggest no meaningful impact.

At each interim analysis, if the test statistic falls below the futility boundary, or equivalently if the p-value is too large (e.g., p > 0.90), the experiment can be stopped due to lack of evidence.

An example of this design is Pampallona and Tsiatis (1994) family of functions parametrised by constants c₀ and c₁. It provides a flexible way to define futility boundaries in GST. This framework allows experimenters to control the probability of stopping early for futility while maintaining statistical rigor.

In this design, each stage has an upper and a lower critical value limiting the continuation region. Both critical values should be calculated by solving the same optimisation problem. The critical values are calculated as

where

Constants c₀ and c₁ are calculated by solving:

where Zₖ* — υₖ is standard normal under H1 and Zₖ* is standard normal under H0. We want to find c0 and c1 such that the specifications of the test are met (α and β). In these formulas, Δ is a parameter that defines the shape of the boundaries. If Δ=0, the efficacy boundary looks closer to OBF boundary.

The following image shows an example of a 5-stage two-sided hypothesis test. Efficacy boundaries are shown in blue while futility boundaries are shown in orange. The futility boundary is crossed at the 4th stage. If no boundary is crossed, the experiment would last until the 5th stage.

The key parameter, Δ, determines the shape of the futility boundary:

Δ<0 (Aggressive futility): Early stopping is easier if no strong effect is observed.
Δ=0 (Flat boundary): Maintains a relatively neutral stopping rule.
Δ>0 (Conservative futility): Requires stronger evidence to stop early, making it less likely to terminate prematurely.

Using futility boundaries reduces the risk of running pointless experiments, but setting them too aggressively can lead to stopping tests that might have shown a real effect if given more time.

Maximum sample size

But it’s not all free lunch! Due to early stopping, GST does not have a fixed sample size, for instance, if the experiment finishes halfway through the stages, then only half of the sample size will be required to conclude the experiment. However, if no boundary is crossed (z value remains in the continuation region during all the experiment), the experiment will finish once it reaches the maximum sample size (MSS). Different from the minimum sample size of the classic fixed horizon approach, the maximum sample size represents the sample size in the worst-case scenario (when there’s no early stop).

Now the trade-off: due to the multiple analyses, the MSS of GST is a bit larger than the minimum sample size of a fixed horizon. We can call the ratio between MSS of GST and the fixed horizon inflation factor (I). The inflation factor is a function of the number of all parameters of GST: K, α, and β (in case of futility boundaries).

The inflation factor for a two-sided experiment is calculated as

The inflation factor for Pampallona and Tsiatis (1994) design with two-sided hypothesis, alpha=0.05, beta=0.2, delta=0 can reach almost 1.25 when the number of stages is 20. This means that in a 20-stage design, the experiment can require at most 25% more samples than the corresponding fixed horizon approach.

Inflation factor per number of stages for Pampallona and Tsiatis (1994) design

Average sample size

Average sample size (ASN) is the expected sample size due to early stop. For instance, if H1 is true, one might expect to finish a five-stage Pampallona and Tsiatis (1994) design with α=0.05, β=0.2, Δ=0 when 78.8% of the minimum sample size of fixed horizon method is reached.

The following formula shows how ASN (under H1) is estimated:

where nₖ is the sample size at stage k, Zᵢ ∈ ψᵢ, i=1,..,k-1 represents the event of Z belonging to the continuation region at stages prior to stage k, and, consequently, Pₕ₁ is the probability under H1. ASN can also be calculated under H0.

Follow this code to calculate ASN when the alternative hypothesis is true. If we consider α and β equal to 5% and 20% respectively, and also consider a standardised MDE (absolute MDE divided by standard deviation) equal to 0.5, the MSS would have 32 samples per variant, while ASN would be 27, 84.375% of the MSS.

asn = calculate_asn(
    c0=1.9892,
    c1=3.9055,
    n_analysis=4,
    stage_size=9,
    alternative='two-sided'
  )
print(f"Average Sample Number: {asn}")

Output:

Average Sample Number: 27

Trade-Offs and Practical Challenges

While GST improves statistical rigor, it also introduces challenges. Implementing it requires careful planning, including estimating traffic accurately to determine when to perform interim analyses. I’ll share some lessons learned from real-world applications.

The increase in the number of interim analyses raises the MSS. However, conducting more frequent analyses also increases the chances of stopping the experiment earlier, which can significantly reduce the average sample size required to reach a conclusion. This trade-off must be carefully balanced to optimize efficiency while maintaining statistical power.
The analysis above assumes data is collected at consistent intervals. However, if actual traffic deviates from expectations, statistical integrity can be affected. A sudden increase in traffic can compromise Type I error control, potentially leading to inflated false-positive rates. Conversely, a traffic drop can result in wider confidence intervals, reducing the test’s sensitivity and making it harder to detect true effects. Accurate traffic estimation is crucial to maintain the validity of sequential testing. Fortunately, group sequential tests can be flexibly designed to accommodate unequal intervals between analyses, but the number of stages should still be estimated.

Conclusion

GST is a method that allows interim analysis without inflating the probability of false positives. By incorporating a futility boundary, the experiment can stop earlier, also when there’s strong evidence that the null hypothesis won’t be rejected if the experiment continues. This approach allows quicker decision-making and avoids waste of resources on ineffective interventions while maintaining the statistical rigor.

Compared to other sequential frameworks, GST provides clear, pre-specified decision rules based on functions, making it easier to communicate and interpret results. However, implementing GST comes with important trade-offs. The frequency of analyses directly impacts the MSS and average sample size required. Additionally, accurate traffic estimation and proper experiment design are crucial.

Despite these challenges, GST remains a valuable approach for experimenters seeking a balance between speed and statistical rigor.

References:

Wassmer, G., & Brannath, W. (2016). Group Sequential and Confirmatory Adaptive Designs in Clinical Trials. Springer International Publishing. ISBN: 978–3–319–32560–6.

…

Thank you so much for taking the time to read this article!
If you’d like to explore the code I used here, you can find it on GitHub.

Curious to learn more about what we’re building at OLX?
👉 Check out our website for more.

And if you’d like to connect, feel free to reach out on Linkedin — I’d love to hear from you!

Speed with Rigor: testing smarter with group sequential design was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Scaling recommendations service at OLX

Jordi Esteve Sorribas — Tue, 08 Jul 2025 15:03:23 GMT

Optimizing FastAPI at Scale: Lessons from OLX’s Recommendation Platform

Photo by Rosy Ko

In distributed systems, there is a motto that says ‘you are as slow as your slowest tasks’. In Python, thanks to the notorious Global Interpreter Lock (GIL), this issue is amplified: ‘your slowest task will make every other task slower’. In this article, I’ll walk you through the optimizations we made to scale a FastAPI service that now handles tens of thousands of requests per second, achieving a p99 latency under 10ms.

Introduction

OLX is a global online marketplace that enables users to buy and sell goods and services, primarily through classified ads. We have a clear vision: to create leading marketplace ecosystems enabled by tech, powered by trust, and loved by customers. Every month, we engage 45 million app users and support over 73 million active listings. To help users seamlessly navigate this vast inventory, we’ve integrated recommendation systems across multiple touchpoints in all our platforms. These recommendations are powered by the recommendations platform, which is responsible for delivering personalized suggestions across various contexts. Most, if not all, of these are served through a dedicated recommendations service.

Over the past few months, we’ve built and scaled this system within the data team, successfully shifting the ownership from a shared backend service to a service fully owned by the team to gain greater autonomy and flexibility. The team decided to build it with Python, as it is the go-to language for the data and machine learning team and is the most widely used language within both the team and the broader domain. While Python allows for rapid development and prototyping, working at scale has surfaced several challenges and trade-offs. It hasn’t been an easy journey, but it’s one that’s taught us a lot and significantly matured our infrastructure and processes.

To Async or Not Async

The service consumes data from various sources: ScyllaDB, DynamoDB, and internal OLX services. A key optimization we implemented was converting these I/O operations into non-blocking calls by leveraging FastAPI’s support for asynchronous programming.

FastAPI offers excellent documentation, including a dedicated guide to concurrency and async programming in Python. While we won’t go into the weeds of Python’s concurrency models, it’s helpful to understand the key distinction between the two common approaches:

Threading achieves concurrency by running multiple threads. When a thread performs a blocking I/O operation, Python can release the Global Interpreter Lock (GIL), allowing other threads to execute in the meantime.
Async I/O is based on an event loop, where tasks voluntarily yield control while waiting for I/O operations to complete. This enables other tasks to run without blocking the entire application.

For I/O-bound workloads, async is generally the preferred approach; it’s more lightweight, avoids thread management overhead, and thus scales better. However, async programming comes with a steeper learning curve and can introduce subtle bugs or performance issues if not used carefully.

These trade-offs become especially relevant in FastAPI. The first place you’ll encounter this trade-off is when defining path operations in FastAPI, where performance can hinge on how endpoints are declared: either with def (synchronous) or async def (asynchronous). This seemingly small decision has significant implications and the flow is encapsulated within this function and shown in Image1.

Image1. FastAPI decides whether to run the path operation in the event loop or a thread pool based on function definition.

Declaring a route with def runs the function in a separate thread, which allows you to perform blocking I/O safely, without stalling the main event loop. However, this comes with the overhead of managing threads. Starlette, the ASGI framework FastAPI is built upon, uses a default thread pool of 40 tokens. On the other hand, declaring a route with async def runs it on the event loop, and all I/O operations must be awaited. This offers better performance if all the libraries, or wrappers, used are async-compatible.

It’s also worth noting that this distinction between def and async def applies not only to route handlers but also to FastAPI’s dependency injection system. Whether a dependency is defined synchronously or asynchronously affects how it’s executed: either resolved in the thread pool or the event loop. Following the same rule, the response validation happens either in a thread or the event loop. Therefore, understanding how many tasks are offloaded to the thread pool is important, as it’s relatively easy to reach its limit and may require manual configuration. In reality, achieving a fully async service is tricky, especially when dealing with multiple data sources, legacy libraries, and a mix of sync and async libraries.

Our primary data source is ScyllaDB, but the official Python driver does not support the asyncio protocol (i.e, await and async def keywords). While it’s technically possible to wrap the driver’s future-style interface to make it awaitable, we chose to offload the query to a thread instead. The reason is simple: the driver internally already uses threads, so wrapping it in async didn’t offer meaningful performance gains and could potentially block the event loop.

Even after making the ScyllaDB queries non-blocking, we saw no improvement in service performance. The reason is that operations against DynamoDB were still blocking the event loop. In Python, the standard way to interact with DynamoDB is through boto3, which, despite numerous community requests, is still not async-compatible. This forced us to choose between two options: either offload the blocking calls to a thread pool or adopt aioboto3, an async wrapper around boto3. After running several experiments in the wild, we found that aioboto3 offered better performance in terms of latency, the reason being that aioboto3 enables non-blocking I/O operations by leveraging Python’s asyncio event loop with minimal overhead.

The other data source was other APIs. Fortunately, Python has strong support for asynchronous HTTP requests. While there’s not much to add here, it’s worth pointing to this GitHub issue in the httpx repository, which compares the performance and design trade-offs between httpx and aiohttp.

These optimizations resulted in a latency reduction of over 40% and a more than 50% increase in throughput.

Beyond Concurrency

By this point, all I/O operations had been converted to non-blocking calls. However, it became clear that the event loop was still being blocked at certain points. To monitor this, we used New Relic’s event loop diagnostics (Image2) and incorporated additional logging to pinpoint the bottlenecks.

Image2. Time series from New Relic’s event loop diagnostics. The blue line represents the time the event loop was waiting in seconds to regain control, illustrating the event loop before and after it becomes unblocked.

Optimizing Read Patterns

One of the first optimizations we made was avoiding fanning out queries to ScyllaDB. In scenarios where we wanted to provide personalized recommendations, the typical flow involves querying the data source to fetch the most recent user interactions, and then for each interaction query ScyllaDB to retrieve the most similar items. This created a fan-out pattern (Image3), one query per interaction, where even though the queries were concurrent, they were not run in parallel.

Image3. Diagram showing fan-out query pattern: a single request leads to multiple ScyllaDB queries based on user interactions.

To address this, we experimented with batching the queries into a single request to ScyllaDB. This significantly reduced overhead by minimizing network round-trips and easing the load on both the driver and the event loop. Most importantly, it helped prevent event loop stalls. As shown in Image2, this change led to a clear reduction in event loop blocking, ultimately reducing p99 latency by over 50%.

Minimizing Data Validation: Saying Goodbye to Pydantic

Data validation in FastAPI when using Pydantic takes place in three distinct stages: request input parameters, request output, and object creation. While this approach ensures robust data integrity, it can introduce inefficiencies. For instance, if an object is created and then returned, it will be validated multiple times: once during instantiation and again during response serialization. The situation can worsen if the object is re-created or used to construct another similar object, leading to redundant validations that consume unnecessary resources.

This redundancy might be acceptable when the objects being validated are small and simple, or when the Pydantic models themselves are lightweight. However, in the recommendation service, the system often needs to handle large volumes of data, several thousand ad ids plus metadata. In these cases, validating the same object more than once or a similar form becomes inefficient and can hurt performance.

In the plot above it shows that for a significant fraction of requests, the duration of building the Pydantic object took a significant amount of time. This wasn’t just an issue for the ongoing request, it also impacted all other concurrent requests. Given our use case, where we control most of the data we consume, we decided to drop Pydantic everywhere except for the input request parameters. We switched to using dataclasses with slots, which drastically reduced latency by more than 30%.

Image4. Distribution of time spent building the Pydantic model.

As a final note, it’s important to highlight that performance can be affected not just by the size of the data being processed, but also by overly complex Pydantic models and type annotations. For example, consider a response model defined as a Union[A, B]. In this case, FastAPI (via Pydantic) will validate first against model A, and if it fails against model B. If A and B are deeply nested or complex, this leads to redundant and expensive validation, which can negatively impact performance.

Latency Roulette: When Garbage Collector (GC) Enters the Game

Even after removing Pydantic, we continued to see unexpected latency spikes during model construction. The bimodal distribution shown in Image5 was particularly puzzling. The input content, object size, and schema remained constant, yet performance varied significantly. After ruling out obvious causes, we realized that something outside our direct control might be at play: the garbage collector (GC). Disabling it entirely wasn’t an option, but tuning its behavior was.

Image5. Distribution of time spent building the model using dataclasses. While the improvements over Pydantic (Image4) are notable, there’s still room for further optimization.

The GC is triggered when the number of allocations minus the number of deallocations exceeds certain thresholds for each generation. Python’s GC organizes objects into three generations (0, 1, and 2). New objects start in generation 0. If they survive a collection, they get promoted to the next generation, more details here. Given the bimodal distribution in the above image, we hypothesized that the default Python GC settings might be suboptimal for our high-throughput needs.

By increasing the generation 0 threshold and adjusting the thresholds for generations 1 and 2, we aimed to reduce the frequency of garbage collection, thereby lowering tail latencies without significantly increasing memory consumption. However, this requires careful tuning: too infrequent collections can cause p99 latencies to spike, as a larger number of objects accumulate before collection.

To test this hypothesis, we ran a series of local load tests comparing the default GC settings with optimized configurations. Our goal was to strike a balance between the number of GC runs and the duration of each run. By increasing the generation 0 threshold from 700 to 7000 objects and raising the thresholds for generations 1 and 2 from 10 to 20, we observed improved tail latencies and a reduction in the frequency of micro-pauses.

After deploying this configuration in production, we saw a 20% overall latency reduction in our service. More notably, the latency for homepage recommendation requests, which return the most data, improved dramatically, with p99 latency dropping from 52ms to 12ms (Image6). Shout out to Denis Belyakov for running the load tests and finding the best GC parameters!

Image6. Latency reduction of homepage recommendations. The y-axis is latency in milliseconds and x-axis is the time.

Deployment and Infrastructure

One of the standout features of ScyllaDB is its shard-per-core awareness, which allows clients to route queries directly to the relevant CPU core on the target node. To validate this behavior, we added logging to check connections across all shards and noticed that one shard was never being connected to, indicating an issue with shard connectivity (Image7).

Image7. Snipped code used to monitor shard aware connections to the Scylla cluster

Independent of the recommendation service, we enhanced the deployment strategy by using topology spread constraints and node selectors.

Learnings and Conclusions

In this journey, we learnt a lot about FastAPI and the Python ecosystem, Scylla and other technical dimensions. While we gained valuable technical insights and reached our goals, a few key takeaways stood out:

Debugging and reasoning in a concurrent world under the reign of the GIL is not easy. You might have optimized 99% of your request, but a rare operation, happening just 1% of the time, can still become a bottleneck that drags down overall performance.
FastAPI and Python enable rapid development and prototyping, but at scale, it’s crucial to understand what’s happening under the hood. You’ll likely need to run multiple rounds of profiling and optimization. At that point, it’s worth questioning whether switching to a different language might be more efficient.
Load test like there’s no tomorrow. You’ll uncover unexpected performance issues only when you simulate real-world traffic and usage patterns. Internally, we used vegeta to run these load tests with great support from our SREs.
The majority of these steps were executed in the presented order. However, there may be confounding variables that were overlooked, for example, retaining Pydantic and different GC thresholds might have produced similar results.
Start small, test, and extend. I can’t stress enough how important it is to start with a PoC, evaluate it, address the problems, and move forward. Down the line, it is very difficult to debug a fully featured service that has scalability problems.

Scaling recommendations service at OLX was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unlocking Flexibility in Configuration: The Power of Hydra

Catarina Goncalves — Mon, 02 Jun 2025 17:02:34 GMT

As software systems continue to grow in complexity, the demand for effective configuration management tools becomes not just important but essential. Our work on Learning to Rank (LTR) projects exemplifies this challenge.

At OLX we use LTR to improve the search results based on their relevance. This involves training models to rank ads based on various features, ensuring that the most relevant ones appear at the top of the list. The resulting sorting can be found under the ‘Recommended Ads’ option, the default option currently live on our OLX websites in several countries.

Given the potential to enhance user experience and increase engagement — while balancing the visibility of paid (boosted) ads— this approach is also being tested in other business units, such as Motors and Real Estate.

More projects meant increased complexity, with multiple repositories holding various configuration files. These projects were spread across different teams and repositories, leading to duplicated efforts and repeated issues for similar tasks. This growing redundancy pushed the team to centralize all LTR projects within a single repository.

We quickly recognized the need to enhance the experience for Data Scientists and Machine Learning Engineers who would work directly with this new centralized framework.

Enter Hydra.

Hydra — The configuration tool

Developed and maintained by Facebook AI Research, Hydra stands out when it comes to configuration management. It empowers developers with a sophisticated yet easy-to-use solution that simplifies even the most complex configuration setups.

With features such as hierarchical organization, runtime command-line overrides, and seamless integration with optimization tools like Optuna, Hydra offers flexibility and efficiency. These qualities made Hydra the ideal choice for our project, and we have successfully integrated it to enhance our configuration management workflow.

And don’t worry too much about a learning curve — Hydra has extensive documentation, with numerous examples, that allowed us to familiarize ourselves quickly. We’d even say it was one of Hydra’s main selling points!

Getting Started

To get started with Hydra, you can install it using the following command:

$ pipenv install hydra-core

Basic Usage

In your Python script, where you’d like to use Hydra configuration files, you must include the specific imports and decorators, specifying the appropriate configuration path:

import hydra
from omegaconf import DictConfig, OmegaConf

# main_script.py

@hydra.main(version_base=None, config_path="../hydra_config", config_name="base_config")
def example_function(cfg: DictConfig) -> None:
  print(OmegaConf.to_yaml(cfg))

if __name__ == "__main__":
  example_function()

The @hydra.main decorator is what allows us to connect our code with Hydra directly. With this decorator, you specify the path to the configuration file you wish to use as well as the actual file, specifying its name. It is placed right above the function that requires the information found within the configuration files. Without it, you’d need to manually load, parse, and manage the different configuration files.

Folder Structure

To fully leverage Hydra’s capabilities and its Hierarchical Structure— explained in more detail in the next section — we first need to establish a well-organized folder structure for our configuration setup. Key questions to consider include:

What are my project’s requirements?
What should the default settings be?
How will the configuration scale with future requirements?
How can we organize shared parameters to maximize reusability?

Once these questions were addressed, we designed an ideal configuration structure tailored to our project’s needs. This involved defining a high-level group — main folders organized by topics such as models, target definitions, and environment setup. Within each group, we then identified the relevant subgroups, such as the specific model configurations, target variations, and different environment settings.

Example — Modules in Configuration Folder

Imagine you are a Data Scientist and want to train an XGBoost Model with a simple binary target to rank OLX ads in Portugal (“pt” as the country code). With the above folder structure, you can import the desired subgroup from the presented modules, as exemplified below:

defaults: 
     - environment: development
     - model: xgboost
     - target: binary 
     - input_data: olx
     - _self_ 

brand: olx
country: pt

These imports are possible due to the defaults section. This allows us to create several configuration combinations without having to duplicate code — simply by importing the existing defined configuration for each group. And, if the model you want to train with is not yet defined, you can easily create your dedicated configuration file within the respective folder to be able to import it.

You might be wondering, “What else can we do with Hydra?” Let’s dive in and explore the tool’s key features.

Exploring the Tool

Amongst Hydra’s many features, a few stood out to our team as particularly valuable. Below, we will briefly highlight and explain the ones we leverage the most.

Hierarchical Structure

As previously mentioned, Hydra allows you to organize and manage configuration settings hierarchically. The configuration files can be organized into groups and subgroups, creating a hierarchical structure that makes it easy to manage complex configurations. This is also known as Config Groups.

defaults: 
     - environment: production
     - model: catboost
     - target: binary 
     - input_data: real_estate
     - _self_ 

brand: real_estate
country: ro

However, what if we simply want to change a single field within the existing imported configuration from a specific subgroup?

Configuration Override

After importing the desired defaults, we can override specific fields defined in the imported configuration files. Continuing with the above example, where CatBoost is set as the default model, we can customize its behavior by modifying its configuration fields. Let’s take a closer look at the CatBoost configuration file and the parameters it defines:

# catboost.yaml

estimator: CatBoostRanker
params:
  logging_level: Silent
  loss_function: PairLogit
  od_wait: 10
  depth: 6
  boosting_type: Plain
  bootstrap_type: Bernoulli
  learning_rate: 0.01
  random_seed: 16
  random_strength: 0
  allow_writing_files: False
  n_estimators: 100

Suppose you’d like to experiment with a different loss function for your CatBoost model. To do this, you can override its value after the defaults section—making sure to respect the existing hierarchy defined in the original configuration subgroup.

defaults: 
     - environment: production
     - model: catboost
     - target: binary 
     - input_data: real_estate
     - _self_ 

brand: real_estate
country: ro

model: # access imported model
  params: 
    loss_function: YetiRank # override default loss_function

This override allows you to create a new configuration file without having to copy all the parameters from the model or even create a separate file just to change the loss function. But what if you don’t want to add another configuration file or modify the current file, and instead, simply test different values for specific fields?

Command-line Mania

Apart from being able to override default values within each configuration file, Hydra offers runtime command line override options that allow you to test different values for any field without having to change the configuration file or create a new one.

These overrides provide flexibility in various ways:

You can modify a field’s value directly.
Append a new value using the + symbol.
Remove an existing field using the ~ symbol.
When you're unsure if a value already exists, you can use the ++ symbol. This either appends the field or overrides the existing value.

This functionality makes Hydra an ideal tool for experimenting with configurations and fine-tuning setups without risking changes to the base configuration.

But what if you’re unsure about the best value for a particular parameter and want to optimize it?

Multirun — Optuna Sweep

In that case, Hydra’s integration with Optuna offers a powerful solution for hyperparameter optimization with minimal effort. By simply adapting the configuration file, as shown below, and adding --multirun on our command line, we can find the best parameters to use in our model that match the defined criteria.

Required changes in the configuration file to use multirun

As you can see in the image above, we first override the hydra/sweeper to optuna (1). Then we make the desired changes to the sweeper (2), where you can specify, for each metric you are optimizing (3), the respective optimization direction (either maximize or minimize) (4), and the parameters to optimize (5). In the example, we are optimizing for both NDCG and NDCG@10, aiming to maximize these metrics while exploring the best values for learning_rate and n_estimators within a specific range.

Once the required changes are made, you can execute the script using the --multirun flag. This enables Hydra to perform multiple runs, systematically sweeping through the parameter space defined in your configuration.

$ python -m ltr.train - multirun -cn multi_run_example

Tip: You can use the -cn flag (shorthand for --config-name ) to choose a specific configuration and override the default one at runtime, allowing for more flexibility in your experiments.

Conclusion

Hydra’s powerful ability to handle complex configurations with flexibility, paired with its comprehensive documentation, makes it an incredible choice for handling configurations within complex projects that want a dynamic approach to store and change the used parameters.

For our team, embracing Hydra was a no-brainer, and it has become an integral part of our workflow. We rely on it extensively to preprocess data and train our different models — easily exploring numerous approaches with different configurations.

While this article highlights only the features we use most frequently, Hydra offers many other capabilities that we engage with every time we run our scripts, as well as others that we have yet to explore!

Made it this far? Hope you enjoyed the read!
If OLX sparks your curiosity, there’s more waiting at our website.

🤝 You can find me on LinkedIn! — always happy to connect.

References

[1] Facebook AI Research, 2023. Hydra documentation. [online] Available at: https://hydra.cc/docs/intro/.

[2] Facebook AI Research, 2023. Hydra — GitHub repository. [online] GitHub. Available at: https://github.com/facebookresearch/hydra.

[3] Optuna, 2023. Optuna — Hyperparameter optimization framework. [online] Available at: https://optuna.org/.

[4] XGBoost. XGBoost documentation. Available at: https://xgboost.readthedocs.io/.

[5] CatBoost. CatBoost documentation. Available at: https://catboost.ai/en/docs/.

Unlocking Flexibility in Configuration: The Power of Hydra was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

From RankNet to LambdaMART: Leveraging XGBoost for Enhanced Ranking Models

Enderson Santos — Mon, 10 Feb 2025 16:12:17 GMT

Photo by Andrea De Santis on Unsplash

In the ever-evolving world of machine learning, the journey from theory to application can be both fascinating and complex. This article aims to demystify part of this journey, focusing on two significant ranking algorithms: RankNet and LambdaRank. At first glance, these algorithms might seem daunting, but we’ll break them down into simpler terms, making them accessible to everyone.

RankNet and LambdaRank are at the heart of how computers learn to prioritize or “rank” items in a way that’s useful for us (Microsoft Research, 2010). We’ll dive into how XGBoost, a powerful machine-learning library, can be used to implement these ranking models. XGBoost stands out for its speed and efficiency, making it a popular choice among data scientists.

Ranking is not a classification problem

There are two main reasons why ranking is a unique type of problem in machine learning, not just another way to classify things. The first reason is that ads are linked to specific search queries. Imagine that we have the two ads below, and in our training data, both are relevant ads, meaning they have a label of 1 as our target. Just by using common sense, we can see that the first ad is for someone interested in an iPhone, and the second one is for someone looking for a car. We can’t say that both ads are relevant in general; they’re only relevant based on what the user is searching for.

Olx Ads 1

The second reason is that even when ads are related to the same search, the order in which we present them matters. What a user sees first can affect how they feel about the next ad. So, we can’t just decide if an ad is relevant by looking at its features alone during training. We have to ask, relevant compared to what? We need to compare ads to each other to understand their relevance. For example, imagine if the first ad is not relevant and the second one is. How would the first ad’s relevance change if the second ad wasn’t there? It might seem relevant then.

Olx Ads 2

In classification problems, the focus is only on the features to make a prediction, and this method is known as a point-wise approach. However, in ranking problems, we need to compare ads with each other and this is called a pair wise approach. We’ll explore how to do this later, but first, let’s talk about how we need to prepare our dataset before training our machine learning algorithms.

Dataset Preparation

Because of the first reason we have just talked about, we can’t just use the features of the ads and the target to train our models. We need to link the ads to a specific search query, which means adding another piece of information to our dataset. As shown in the example below, when creating our dataset, we need a column that acts as a unique identifier for each session, called the ‘ID Session’ below. Then, we need to record which ads were shown in that session, their features, and how relevant they were. So, in our training dataset, we’ll have several different listings (or sessions). In each listing, we’ll include all the ads that appeared in that session, the features of each ad, and a label indicating which ads were relevant and which were not (I’ll explain later how this column of relevance/label is built). The image below is just a simplified example of one single listing with only 3 ads.

Now let’s move to training.

Rank Net with XG Boost

After organizing the dataset, we’re ready to train our models. Below, there’s a code snippet showing how to do this with the XG Boost library. Remember, we’ve already split the dataset into training and validation sets. It’s important to split these randomly by session.

To use the RankNet implementation, simply add "objective": "rank:pairwise" to the objective function in your model's configuration as we did below.

from xgboost import XGBRanker

X_train = df_train[training_features]
y_train = df_train.pop('Relevance')
qid_train = X_train.pop('ID Session')

X_val = df_val[training_features]
y_val = df_val.pop('Relevance')
qid_val = X_val.pop('ID Session')


eval_set = [(X_train, y_train), (X_val, y_val)]

params = {
    "random_state": 42,
    "booster": "gbtree",
    "learning_rate": 0.04,
    "max_depth": 3,
    "min_child_weight": 1,
    "max_delta_step": 0,
    "subsample": 0.5,
    
    # L2 regularization
    "reg_lambda": 1,
    # L1 regularization  
    "reg_alpha": 0,  
    
  # Learning task parameters:
    "n_estimators": 150,
    "base_score": 0.5,
    "objective": "rank:pairwise", 
    "eval_metric": "ndcg",
}
fit_params = {"qid": qid_train, "eval_qid": [qid_train, qid_val]}


estimator = XGBRanker(**params)
estimator.fit(
X=X_train, 
y=y_train, 
eval_set=eval_set, 
early_stopping_rounds=50,
verbose=1, 
**fit_params)

Training the model isn’t too complicated, but now let’s try to understand how the model addresses the second problem we mentioned by using a pairwise approach.

function scores for each ad

Imagine we have three ads for a specific session, similar to the scenario described above. The first step is to gather the features of these ads, such as click-through rate (CTR), price, etc. Then, we input these features into a ranking function, in our case, an XGBoost model, which returns a score for each ad. Additionally, we need to know beforehand which ads are relevant, meaning their true relevance.

How to Know the True Relevance?

True relevance can be determined by manually labeling the ads. This means a person reviews the queries and ads to decide which ads are relevant to the queries and which are not. More commonly, we use feedback from users, such as their interactions with the ads, clicks, and likes, as a way to determine true relevance.

What the Function is Doing Under the Hood?

From this point, we need to compare each ad against the others to calculate the probability of one ad being ranked higher than another. For example, we calculate the probability of ad 1 being ranked higher than ad 2, then the probability of ad 2 being ranked higher than ad 3, and finally, the probability of ad 1 being ranked higher than ad 3. Let’s say the model scores for each ad are as follows: s1 is 0.3, s2 is 0.1, and s3 is 0.6. We calculate these probabilities using the sigmoid function. By plugging the numbers into the formula:

We can determine the probability of one ad being ranked higher than another. The outcomes of these calculations are shown in the picture below.

Now, we can use the true relevance to calculate the actual relevance of a pair of ads, regarding one ad being ranked higher than another. Let’s assume that ad 1 and ad 3 are relevant (based on user interactions), meaning they have a true relevance of 1, and ad 2 is irrelevant, so its true relevance is 0. We assign a value of 1 when comparing ad 1 to ad 2 because ad 1 is relevant and ad 2 is not. We add 0 when comparing ad 1 with ad 3 because they are both relevant. Finally, we assign a value of -1 when comparing ad 2 with ad 3 because ad 2 is less relevant than ad 3.

Now, we can incorporate this information into the loss function. The formula for the loss function is:

From here, the model will aim to minimize the number of incorrectly ranked pairs. This is the essence of a pairwise approach. It’s interesting to note that although we use a pairwise method to optimize the loss function when the model makes predictions — meaning, when it calculates the score — it still uses a pointwise approach.

Is a Lack of User Interaction a Sign of Ad Irrelevance?

You might also consider that just because a user didn’t find the second ad relevant in one particular listing, it’s not fair to conclude that the ad is irrelevant overall. This is where the beauty of the approach comes into play. The same ad might appear in different listings, and it could be relevant in those other contexts. Therefore, the model doesn’t learn from just one instance of the ad being deemed irrelevant; it learns from the ad’s performance across all listings it appears in.

This method allows the model to develop a more nuanced understanding of an ad’s relevance. It doesn’t penalize an ad too harshly for not being relevant in one specific listing but instead evaluates its relevance based on its average performance across multiple listings. This approach ensures that the model’s learning is robust and reflects the varying contexts in which an ad might be considered relevant or irrelevant.

Now, before exploring the implementation of LambdaRank in XGBoost and understanding how it enhances the RankNet algorithm let’s see how we evaluate the quality of the ranking listing.

Evaluating the Ranking Sorting Listing

Evaluating the quality involves using specific metrics that can measure how well the ranking algorithm has ordered the items in terms of their relevance to the query. We can use well-known metrics like precision and recall, however, the most common metric is Normalized Discounted Cumulative Gain (NDCG). These metrics help us understand the effectiveness of the ranking by considering factors like the position of relevant items and the graded relevance of results to the query. Understanding these evaluation metrics is crucial for assessing the performance of any ranking model, including those built with LambdaRank.

To use the NDCG rank metric in Python, you can utilize the scikit-learn library, which provides tools for calculating various evaluation metrics, including NDCG. Below is an example of how you might have a DataFrame with two columns: y_true for the true relevance labels, and model_score for the scores predicted by your model. To evaluate the NDCG for all listings at the top 10 positions (or any top k positions you prefer), you would calculate the NDCG for each listing, append each NDCG score to a list, and then calculate the average NDCG at the end. Here's a simplified example of how you might do this:

from sklearn.metrics import ndcg_score

ndcg = []
    
for _, row in listing.iterrows():
    ndcg.append(ndcg_score([row["y_true"]], [row["model_score"]], k=10))


float(np.mean(ndcg))

The NDCG metric compares the ideal ranking with the ranking we’ve obtained and measures how far we are from the ideal rank. This means it assesses how much we deviate from placing the most relevant ads at the top. NDCG takes into account not only the order of the ads but also the relevance of each ad, giving higher importance to the ads that appear at the top of the ranking. This way, it provides a comprehensive measure of the ranking quality, emphasizing the importance of having the most relevant ads as close to the top as possible.

From evidently.ai

The RankNet model works by minimizing the number of incorrectly ordered pairs, and after obtaining the final sorting order, we evaluate the quality of the listing using a ranking metric like NDCG. It would indeed be highly beneficial if we could incorporate this rank quality metric information directly into the optimization of the loss function during training. This is where LambdaRank comes into play.

Lambda Rank

LambdaRank builds upon the foundation laid by RankNet but with a significant enhancement: it incorporates the changes in a ranking quality metric (like NDCG) directly into the loss function (Microsoft Research, 2010). This means that during the training process, LambdaRank not only learns to correctly order pairs but also optimizes for the overall quality of the ranking in terms of a specific metric. By doing so, LambdaRank directly targets improvements in the ranking metric, making the training process more aligned with the ultimate goal of producing the highest quality ranking.

This approach allows for a more efficient and effective optimization process, as the model is directly guided by the impact of its predictions on the ranking quality metric. As a result, LambdaRank can lead to better performance in ranking tasks compared to models that do not directly optimize for these metrics.

To use the LambdaRank implementation, simply add "objective": "rank:ndcg" to the objective function in your model's configuration and this is it. Easy like that.

To understand the difference let’s take a look at the image below:

(2010). In Microsoft Research, From RankNet to LambdaRank to LambdaMART: An Overview. Retrieved from https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf

In the scenario described, we have two different listings. On the left, there’s a relevant ad (depicted in blue) positioned at the top, which is our goal. However, this listing has 13 incorrectly ordered pairs of ads. On the right, there are only 11 incorrectly ordered pairs, but there’s no relevant ad at the top. RankNet would consider the second listing better than the first because it has fewer wrong pairs, and during the optimization phase, it would give more emphasis (indicated by the black arrow) to boosting the second relevant ad at the bottom. This is because doing so helps minimize the number of wrong pairs more effectively.

On the other hand, LambdaRank operates differently. It would give a higher boost to the first relevant ad (indicated by the red arrow) because it’s closer to the top. This approach aligns more closely with our objective of having a relevant ad in the top positions.

But Is LambdaRank a Listwise Approach?

One final point to note is that many people consider LambdaRank to be a listwise approach because it directly optimizes the ranking of a list of items as a whole, taking into account the entire list during training. While it’s true that LambdaRank still compares pairs of ads, which might suggest a pairwise approach, it is typically classified under listwise approaches. This classification is due to its listwise objective and the way it leverages information from the entire list during the training process. However, it’s not a pure listwise approach because it does not compare all the ads in the listing simultaneously.

Conclusion

In conclusion, this article began as a reflection on the questions and challenges I encountered when I first delved into the world of ranking problems. My journey from grappling with the basics of RankNet to exploring the advanced capabilities of LambdaMART and leveraging XGBoost for enhanced ranking models has been both enlightening and rewarding. Through this exploration, my primary aim has been to distill and share the insights and knowledge I’ve gained in a manner that is accessible and useful to others facing similar challenges.

Whether you’re just starting or looking to deepen your understanding of ranking models, I hope this article serves as a valuable resource that helps demystify some of the complex concepts in machine learning for ranking tasks. Remember, the journey through machine learning is continuous, and each challenge we encounter is an opportunity to learn, grow, and contribute to the broader community’s knowledge. If you find this article useful don’t forget to clap 👏

References

Microsoft Research. (2010). From RankNet to LambdaRank to LambdaMART: An Overview. Retrieved from https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf
XGBoost Documentation. (n.d.). Parameters. Retrieved from https://xgboost.readthedocs.io/en/stable/parameter.html#:~:text=to%20each%20class.-,rank%3Andcg,-%3A%20Use%20LambdaMART%20to

…

Thank you very much for reading this article, and if you want to know more about OLX, don’t forget to check our website 😃 -> https://www.olxgroup.com/

This article was written by Enderson Santos!

Linkedin: https://www.linkedin.com/in/enderson-santos-wf/
Instagram: https://www.instagram.com/endersonsantoswf/

From RankNet to LambdaMART: Leveraging XGBoost for Enhanced Ranking Models was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Generative AI at AWS Summit Madrid 2024: An OLX Engineer’s Perspective

Juan Francisco Pinera — Mon, 20 Jan 2025 16:32:19 GMT

At the Madrid 2024 AWS Summit, the vibe was tremendous! IFEMA was visited by thousands of cloud aficionados, tech executives, and developers who were keen to learn about the newest advancements influencing technology in the future. I was especially eager to explore the generative AI (GenAI) field as an engineer at OLX and observe how these cutting-edge instruments are being used to address practical issues and deliver a better, more fulfilling experience to customers.

Popcorn machine inside one of the partners’ stands

It was evident that GenAI was the main attraction as soon as I entered the keynote hall. As presenters revealed brand-new services, discussed real-world use cases, and painted a picture of an AI-transformed future, there was an obvious energy in the air. But the summit was more than just hype; it provided a valuable blend of big-picture vision and practical, actionable insights that I could take back to different teams at OLX.

Bedrock: Your Serverless Gateway to Limitless Possibilities

A standout feature of the summit was AWS Bedrock, a fully managed service that has become essential for developing and deploying GenAI applications. Its serverless approach simplifies infrastructure management, allowing developers to concentrate on innovation.

Bedrock’s selling point is its array of foundational models (FMs), including Amazon’s Titan FMs and those from AI21 Labs, Anthropic, and Stability AI, all accessible through a single interface. This seamless integration with the broader AWS ecosystem, including services like Amazon S3 for data storage and Amazon SageMaker for deploying models at scale, empowers OLX to build robust, efficient, and scalable GenAI solutions.

LLMOps in Action: From Experimentation to Production-Ready AI

A memorable session was “Operacionalizando aplicaciones de IA generativa usando LLMOps, con Securitas Direct” (“Operationalizing Generative AI Applications Using LLMOps, with Securitas Direct”), which went beyond showcasing GenAI’s capabilities to address the real challenges of deploying and managing these technologies in practical settings.
The team from Securitas Direct discussed their use of LLMOps to enhance the reliability, scalability, and ongoing improvement of their GenAI solutions. They emphasized the importance of robust model monitoring, automated retraining, and deployment pipelines, and cross-functional collaboration among data scientists, engineers, and operational teams.

Robust model monitoring and management: Keeping a close eye on model performance in production is crucial to catch issues early and ensure optimal results.
Automated pipelines for model retraining and deployment: As new data becomes available, models must be retrained and redeployed efficiently to stay accurate and relevant.
Collaboration between data scientists, engineers, and operations teams: Breaking down silos between teams is essential for building and managing successful AI solutions.

At OLX, we are committed to integrating LLMOps into our GenAI projects, knowing that a solid framework is crucial for transitioning from experimental to production-ready AI. It did not take long until we started building the MVP for our internal GenAI Platform, following LLMOps best practices. Stay tuned for more updates!

Inspiring Use Cases: Cross-sector GenAI

The summit highlighted several innovative GenAI applications across various industries, including:

“IA generativa: tu asistente de ventas low-cost, con Screening Eagle” (“Generative AI: Your Low-Cost Sales Assistant, with Screening Eagle”): This session explored how GenAI can be used to automate and enhance sales processes, providing a powerful tool for businesses to reach new customers and increase revenue. The presentation showcased how Screening Eagle uses GenAI to generate personalized sales pitches, suggest relevant products based on customer interactions, and even automate follow-up communications, leading to a more efficient and effective sales process.
“IBM: Acelerando el desarrollo de aplicaciones web con GenAI” (“IBM: Accelerating Web Application Development with GenAI”): This talk from IBM provided a deep dive into how developers can leverage GenAI tools and frameworks to build faster, more efficient, and more engaging web applications. The presentation showcased how IBM is using GenAI to assist with code generation, automate testing, and even personalize user interfaces, demonstrating the potential of this technology to streamline development workflows and improve the user experience.

At OLX, we are also exploring GenAI's capabilities through projects like OLX Magic and Agentic Buying and Selling.

OLX Magic utilizes GenAI to assist buyers in finding relevant items, providing personalized recommendations, and automating parts of the purchasing process.
Similarly, Agentic Buying and Selling aims to enable continuous buying and selling capabilities, reducing the need for manual interactions and enhancing operational efficiency.

These internal initiatives align with the trends and innovations discussed at the summit, reinforcing the importance of staying abreast of the latest developments in GenAI. By learning from external successes and integrating these insights into our own projects, OLX can continue to innovate and provide superior user experiences.

How GenAI can improve processes across all stages of the supply chain

OLX: Harnessing GenAI to Build a Better Marketplace

Attending the AWS Summit Madrid 2024 wasn’t just about absorbing information; it was about bringing back actionable insights that can make us improve as developers and as a company at OLX. We’re excited to leverage the power of GenAI to build better products and enhance user experiences. But these could be improved even further and some ideas could be applied to our company:

Develop a Deeper Understanding of User Intent for Improved Search: The presentation by Clarity AI, “Optimizando costes y performance en despliegue de modelos fundacionales” (“Optimizing Costs and Performance in the Deployment of Foundational Models”), was incredibly valuable. They discussed the practical considerations for deploying and optimizing AI models, which are crucial for building an intelligent search engine. At OLX, we are always looking for new ways to improve the search experience and accuracy. Nowadays, simple keyword matching is no longer sufficient, as more advanced techniques, such as using embeddings for semantic search, are available. The integration of large language models (LLMs) into the core search flow is not straightforward and involves several challenges, including latency, costs, and specific tasks within the information retrieval pipeline. Amazon SageMaker was highlighted as a tool that can help manage these complexities by offering various inference options and optimizing costs.
Empower Sellers with AI-Powered Listing Creation Tools: The “IA generativa: tu asistente de ventas low-cost”(“Generative AI: Your Low-Cost Sales Assistant”) session with Screening Eagle showcased some really cool ways to use GenAI to assist with tasks like content creation and product recommendations. We’re investigating how to adapt these techniques to create AI-powered tools that would lead buyers and sellers at OLX toward a smoother and more successful experience.
Enhance Fraud Detection and Prevention: The session “Más allá del Boom: construyendo una arquitectura de IA Generativa” (“Beyond the Boom: Building a Generative AI Architecture”) was a fantastic deep dive into the architectural considerations for building robust and scalable GenAI solutions. By leveraging the power of LLMs, a system that learns and adapts to new fraudulent tactics can be created, keeping our users safe and building trust in our platform.

Educational Opportunities: AWS Skill Builder and Certification Programs

The summit also offered extensive learning resources and certification preparation sessions, which were invaluable for attendees looking to deepen their AWS and AI expertise. These resources are crucial for anyone aiming to advance their skills in cloud and AI technologies. The AWS Skill Builder sessions, with their diverse range of topics and formats, were a must. They were targeted to learners at all levels for various certifications. From introductory courses to deep dives into specific services, AWS experts were keen to help you achieve your goals and answer your questions.

The certification prep talks were also a highlight. Expert instructors provided detailed guidance on exam topics, test-taking strategies, and practical tips for increasing your chances of success.

Gamification as Core:
In the world of cloud services, keeping up with the latest products and learning how they work can be really hard. This is where gamification comes in, making learning fun and exciting. At AWS Summit Madrid 2024, many companies showed how they use gamification to show their products and smooth the learning curve.

Spot Invaders: Chaos engineering gamification arcade

Cloud services have become so big and complicated that old-fashioned ways of learning just don’t work anymore. So, companies have started using games and fun activities to teach people. For example, AWS Skill Builder has games, quizzes, and challenges to make learning more fun and effective. Gamification has several cool benefits:

More Engagement: Turning learning into a game makes people more interested and motivated.
Better Memory: Fun and interactive learning helps people remember things better.
Continuous Improvement: Gamified tools give instant feedback, helping learners see where they need to improve and track their progress.

Adding gamification to our learning isn’t just a trend; it’s a smart move to create a culture of continuous learning and innovation. As we enter the era of GenAI, gamification will be super important in helping our team learn the skills and knowledge they need to succeed; and products that incorporate it into their learning process may be favored over those that don’t.

OLX: Embracing the GenAI Era

The 2024 AWS Summit Madrid reaffirmed the direction of our strategic approach at OLX. We left the summit aligned with our plans to leverage GenAI and AWS technologies to further enhance our operations and products. AI is central to the future, and by continuing to integrate it across all stages of our processes, we are well-positioned to achieve our productivity goals efficiently.

Generative AI at AWS Summit Madrid 2024: An OLX Engineer’s Perspective was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Experimenting with Agile: Transforming Team Workflows

Sara Mendes — Wed, 08 Jan 2025 16:32:31 GMT

Discover how agile experimentation improved delivery, fostered collaboration, and aligned our team for success.

Photo by Tom Fisk

Let’s face it — development today moves at lightning speed, and consistently delivering high-quality results can be tough. We’ve all been there: watching deadlines slip by, juggling unexpected bugs, feeling the pressure rise, and managing last-minute changes that throw everything off track.

At OLX, we acknowledge these challenges and remain committed to continuous improvement, always seeking better ways to work and thrive. By fine-tuning our processes through experimentation, adopting agile methodologies, and making targeted adjustments, we were able to turn things around and see real progress. In this article, I’ll share the experiments our team has tried, how they were implemented, what did not work, what led to significant improvements in our delivery, and how we kept the team motivated and engaged throughout the journey.

The Power of Experimentation

Experimentation is a remarkable teacher and has been one of the cornerstones of how we improve at OLX. It allows us to test our assumptions, learn from failures, and discover what truly drives success. Embracing curiosity and being open to discovering new possibilities, even when the path ahead is uncertain, transforms challenges into opportunities for growth. By testing different approaches and learning from each outcome, we build resilience and adaptability, two qualities that are essential in an ever-evolving environment. In our journey to improve delivery performance, experimentation played an important role.
Each experiment, whether successful or not, gave us valuable insights into what works best for OLX’s diverse, multifunctional teams. Our focus has always been on identifying what delivers the most value for the business while keeping the team aligned and motivated.

Sprint Goals: A Shift in Perspective

With this experimental mindset, we focused on how we were managing our sprint goals within Scrum. Our team had continuously operated within a Scrum framework — perhaps not in its purest form but closely aligned with its core principles. This approach seemed to work well enough, as there were no significant complaints. Initially, our projects were true team collaborations; everyone worked together toward common goals. However, over time, this began to change. Projects became more fragmented, with backend, frontend, and app members working on separate initiatives. As a result, we lost a sense of unified team objectives. Scrum sprint goals started to resemble grocery lists, reduced to little more than ticket titles.

One of our experiments involved leaving them undefined at the start of a sprint. Curious about how it turned out? Keep reading!

From Scrum to Kanban and Back Again

Faced with multiple initiatives and feedback from the team about too many meetings and not enough focus time, we decided in a Ways of Working (WOW) workshop to switch from Scrum to Kanban. This change aimed to reduce meetings, increase focus, and speed delivery.

At first, the shift seemed promising, but it soon became challenging for developers to know what to work on next and stay connected with the team. For the Product Manager, it was equally challenging to communicate the progress and status of tasks being worked on and to ensure alignment with the shared strategy. This led to a more fragmented team, with members focused only on completing their tasks, unaware of what others were doing, even though we kept the daily attendance of the team. This experience highlighted the importance of visibility and collaboration in our workflow.

Recognizing these challenges, we eventually moved back to Scrum but made some adjustments: We dropped sprint goals and story points in favor of a more flexible approach. This hybrid method allowed us to retain the benefits of structured sprints while fostering a more collaborative environment.

Rethinking Daily Syncs: How Async Made a Difference

At OLX, we place a high value on ensuring alignment across our multifunctional teams. In response to feedback about ‘too many meetings’, our team decided to experiment with async daily updates. We set up a Slack bot that sends a message daily at 8 am CET, prompting team members to share their updates in a thread. This approach proved especially useful for team members returning from vacation, as they could easily catch up on what had been happening, including blockers and deliveries. Since the team found it effective, we made it a permanent part of our routine.

Learning Through Agile Training and Experimenting More

Building on the insights gained from our experiments, we learned that there is no one-size-fits-all approach. Each team must find its rhythm by being open to experimentation and adaptable to change. Despite these changes and improvements in teamwork, we still faced challenges in delivery performance. Our cycle time was longer than expected; one person predominantly handled testing, and the team was more focused on individual tasks. We also had many tickets in progress during each sprint, leading to constant context switching. While async dailies were effective, there were still days when team members forgot to update, leading to miscommunication and delays in resolving issues.

At OLX, fostering growth and empowering teams to succeed is a core priority. To support this mission, OLX invited Agile42, a company specializing in agile coaching, to collaborate with us. The training sessions provided more than just the basics of agile principles — they encouraged us to rethink our approach to teamwork and problem-solving.

Rather than just presenting us with what to change to improve our metrics and delivery, the coaches posed questions and scenarios that got us to consider:

What would a more collaborative approach look like?
How could we optimize our process to reduce cycle time?
How could we improve the quality and remove distractions?

This approach forced us to look beyond our usual ways of working and opened the door to more experimentation.

Photo by TonyNojmanSK

Putting Agile Training into Practice: Our Initiatives and Improvements

The collaboration with Agile42 provided us with a strong foundation in agile principles, but its real impact came when we began experimenting with our processes. Rather than merely following agile practices by the book, we tailored these principles to address the specific challenges we were facing. Here’s how we applied the training to make concrete improvements:

Reintroducing Sprint Goals

Remember that in the beginning, I mentioned that when we transitioned back from Kanban to Scrum, we initially decided to stop setting sprint goals because team members were often focused on different projects, and when creating goals, they looked like a list of groceries. However, through discussions with the Agile42 team and with a deeper understanding of the importance of sprint goals, we realized that not everyone needs to work directly on the goal-related project. Instead, each member can contribute in ways like testing or code reviews, ensuring team collaboration toward achieving a common goal.

This new perspective allowed us to set meaningful sprint goals that unified the team’s efforts, even if some roles weren’t directly involved in the project. Now, sprint goals serve as a shared purpose for each sprint, helping us prioritize effectively and stay aligned on what we aim to release by the end of each cycle.

Old vs. New Sprint Goals

As shown in the image above, in the old sprint goal, each role had its own focus, creating siloed efforts even within the same domain. For example, while Backend and Frontend were both working in the same domain, their goals were separate and redundant, leading to a lack of collaboration. Similarly, the App members were pursuing an unrelated task, diluting the focus of the sprint.

The new sprint goal, however, unifies the team under one clear objective: finalizing the integration of the Chat Service. While the development work primarily centered on the backend, the contributions from the frontend and app members — through strategic discussions and testing — were vital to the sprint’s success. This unified approach enabled everyone to prioritize, fostered collaboration across roles, and ensured that the team worked toward delivering meaningful value by the end of the sprint.

“A sprint goal gives the team a shared objective and a focus on something bigger than the individual stories. Without it, sprints can feel like just a bunch of disconnected tasks.” — from Mike Cohn, Co-Founder of Scrum Alliance Inc.

Revamping Daily Meetings for Better Engagement

After discussing our async dailies with the agile team, we learned that having no synchronous meetings could create potential issues. They pointed out that, without any sync dailies, team members could lose sight of the Sprint Goal, as it’s harder to maintain alignment when there isn’t regular conversation around how individual tasks connect to the broader objective. Additionally, relying entirely on async updates could lead to a disconnect within the team, as members might miss opportunities to review the board together and stay updated on each other’s progress.

In a retrospective meeting, the team decided to experiment with a mixed daily schedule. We implemented sync meetings on Mondays, Wednesdays, and Fridays to review progress face-to-face and go through the board together. These sessions started with a check-in on our sprint goals, helping everyone see where we stood and whether any blockers were preventing us from reaching them. Async dailies took place on Tuesdays and Thursdays — our designated no-meeting days — to allow for more focused work time.

This balanced approach gave us the benefit of face-to-face updates and board reviews while still preserving focus time on async days. It allowed us to stay aligned on sprint goals without overwhelming the team with daily meetings.

Fostering Team Collaboration in Testing and Code Reviews

Collaboration is at the heart of OLX’s agile teams, emphasizing that quality is a shared responsibility, a concept that resonated with us as a team. To put this into practice, we adopted a new approach: each team member picks only one ticket from the “To Do” column at a time and doesn’t start a new one until the current task is fully completed. This approach reduces context-switching, especially if a ticket that is “In Review” or “In Testing” needs to return to “In Progress” for adjustments. It also helps maintain workflow consistency and delivery predictability. When too many tasks are started at once, it clogs the pipeline, leading to delays and making it harder to accurately gauge completion times.

By focusing on one task at a time, each team member takes full accountability for their current work, minimizing distractions from partially started tasks. This approach also promotes collaboration in testing and code reviews; if a team member is waiting for their code to be reviewed or tested, they can look at the board to see if there’s a ticket from another team member that needs attention. This collaboration helps ensure steady sprint progress. If no other tickets are pending review or testing, team members have the opportunity to focus on their Individual Development Plans (IDPs) until their ticket progresses.

Reintroducing Catchup Meetings for Alignment and Share Metrics

Transparency and alignment are key to OLX’s success. Maintaining a consistent rhythm of feedback and transparency strengthens collaboration, improves delivery, and keeps the team engaged and motivated.

With this in mind, we reintroduced monthly catchup meetings at the start of each month. These sessions provide an opportunity to discuss OKR progress, review bugs, share key metrics, and cover other relevant updates, fostering transparency and accountability across the team. To support this, we also began using Jira’s Control Chart to visualize our cycle and queue times, which allowed us to track improvements (or setbacks) over time and spot outliers. This data-driven approach has helped us identify bottlenecks early, enabling more informed planning decisions and driving continuous improvement.

Wrapping It Up: Turning Challenges into Wins at OLX

Photo by David Hadley

Dealing with the challenges of modern software development requires flexibility, adaptability, and a constant drive to improve. Our journey — from trying out undefined sprint goals to adopting hybrid daily syncs and leveraging agile coaching — proved how important it is to stay flexible and learn from wins and setbacks.

By embracing experimentation with different approaches and seeking customized solutions, we set the stage for meaningful changes in our processes and team dynamics. The lessons learned along the way provided us with clarity and direction, helping us tackle the root causes of inefficiencies and foster a stronger culture of collaboration and accountability.

The results of these efforts speak for themselves. We not only improved our delivery performance — by reducing cycle times, better aligning priorities, and minimizing context switching — but also created a more cohesive and connected team. For example, our cycle time saw significant improvement, dropping from a median of 4 days in FY24 Q4 to just 2 days in FY25 Q3. Additionally, the number of delivery tickets has been growing steadily, with monthly increases ranging between 3–6%. This growth fluctuates depending on the time of year, as months with more team members on holiday tend to see a slight decrease in delivery volume.

Introducing meaningful sprint goals brought a shared purpose to our sprints, while the hybrid daily syncs balanced alignment and focus. Collaborative testing and code reviews further strengthened our teamwork, ensuring everyone felt involved in achieving success.

At the end of the day, these changes have transformed the way we work, proving that experimenting with agile principles and embracing change can lead to significant improvements. While challenges will continue to arise, we are now better equipped to face them together. However, as we’ve navigated through this journey of agile experimentation, we’ve also learned that making these changes requires more than just process adjustments. Managing teams under pressure is a key factor that we can’t overlook. Even with continuous improvements to our processes, we must stay mindful of how pressure can impact team dynamics and performance. To explore how to manage teamwork under pressure, check out my previous post on Managing Teamwork Under Pressure.

Thank you for reading!

Experimenting with Agile: Transforming Team Workflows was originally published in OLX Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.