Grab Tech

Enabling R8 optimization at scale with AI-assisted debugging

Thu, 12 Mar 2026 00:23:00 +0000

Grab is Southeast Asia’s leading superapp, providing a suite of services that bring essential needs to users throughout the region. Its offerings include ride-hailing, food delivery, parcel delivery, mobile payments, and more. With safety, efficiency, and user-centered design at heart, Grab remains dedicated to solving everyday issues and improving the lives of millions. As our app continued to expand, we identified platform-level performance challenges that were affecting user experience across the board. In this article, we share how we successfully enabled R8 optimization for the Grab Android app, achieving significant improvements in app size, startup time, and stability through innovative AI-assisted debugging techniques.

Introduction

Since 2024, our team observed a concerning trend: Application Not Responding (ANR) rates were spiking across the Grab app. Unlike typical isolated issues, the data revealed that ANRs were happening everywhere, not confined to specific features or modules. This pattern pointed to platform-level causes, with our analysis showing strong correlations between ANRs and several factors: memory pressure (particularly when garbage collection was triggered), ad-heavy user flows, complex layouts involving Jetpack Compose embedded within XML layouts, and XML views embedded within Compose code.

The Android community had long proven that R8 optimization (beyond basic code shrinking) could deliver substantial performance gains and app size reductions. As Grab has been adopting Jetpack Compose over the last two years, Google’s Jetpack Compose performance documentation specifically recommends R8 optimization for Compose-heavy applications. It became particularly relevant, making it a natural solution for our systemic performance issues.

In fact, enabling R8 optimization was not a new idea for our team. It had been identified and flagged as a high-impact solution multiple times over the years, yet each attempt fell short. Here’s why.

The challenge at scale

Our app operates at scale, with over 9 million lines of code and 100+ engineers working on it daily. While we had basic R8 shrinking enabled, advanced optimization had proven challenging despite multiple attempts over six years (with different tools and approaches over the years).

In 2022, we almost made it - successfully rolling out R8 optimization to GEA (our early access build), but unfortunately, we faced critical roadblocks that compelled us to put the project on hold. After analyzing our previous attempts and the current project situation, we identified three fundamental challenges that had to be solved simultaneously.

This article details how we overcame each challenge through targeted innovations: AI-Assisted Debugging for slow investigation cycles, Pragmatic Testing Strategy for validation at scale, and Optimized Feedback Loop for rapid iteration.

Understanding R8 optimization

Before diving into our solution, it’s important to understand what R8 optimization actually provides beyond basic R8 shrinking.

Figure 1. The R8 processing pipeline involves multiple interconnected phases that transform, analyze, and optimize code. Understanding this complexity helps explain both the benefits of optimization and why debugging issues become challenging.

What we had in place

With minifyEnabled=true and shrinkResources=true using proguard-android.txt, we already had:

Tree shaking (shrink phase): Removes unused/unreachable code.
Code minification (obfuscation): Renames classes/methods to short names.
Resource shrinking: Removes unused XML files and drawables.
Desugaring: Java 8+ compatibility.

What’s new with optimization

By switching to proguard-android-optimize.txt, we gained access to:

Method inlining: Replaces method calls with actual code, reducing call overhead.
Class merging: Makes code more compact by combining similar classes.
Constant folding: Pre-computes constant expressions at compile time.
Dead code elimination: More aggressive than tree shaking, removes unreachable branches.
Devirtualization: Converts virtual calls to direct calls when possible.

These optimizations work together to improve runtime performance while significantly reducing app size.

Three core challenges

With context on R8 optimization benefits, why is enabling it so difficult at Grab’s scale? After analyzing our previous failed attempts and the current project situation, we identified three fundamental challenges that had to be solved simultaneously.

Challenge 1: Slow debugging

R8 optimization issues are notoriously difficult to debug:

Code is obfuscated, class names become a, b, c.
Code is modified, inlined, merged, and optimized beyond recognition.
Stack traces are unreadable without proper mapping files when crashes occur.
Pinpointing the root cause requires manual reverse engineering.

Our limited resources compound the challenge: with only one engineer leading the project, most issues had to be either addressed directly or have solutions provided for other teams to fix. Manual decompilation, deobfuscation, and context gathering for each issue are inherently time-consuming, making the investigation cycle slow.

Challenge 2: Testing at scale

R8 optimization affects every corner of the app. Unlike feature-specific changes, enabling optimization transforms how the entire codebase is compiled, inlined, and optimized. A single misconfiguration or missing keep rule can break seemingly unrelated features across different modules and libraries.

When we first enabled R8 optimization, the impact was immediate and widespread: most of the app’s features simply stopped working correctly. This presented us with a deeper problem, not just how to test, but what kind of testing strategy would actually give us confidence to roll out to production.

In theory, R8 optimization works reliably with standard codebases that follow Google’s and the community’s best practices. However, the Grab app is a ~10-year-old project at a massive scale. Legacy code patterns, reflection usage, and SDK integrations accumulated over a decade create numerous edge cases.

This combination makes comprehensive testing necessary, but at our scale, it’s nearly impossible to execute due to:

Full regression testing would require significant effort from all teams across the organization.
Quality Assurance (QA) resource constraints make exhaustive testing impractical.
High-quality bar: At Grab, app stability and zero runtime errors are non-negotiable standards.

This creates a paradox: we need comprehensive testing precisely because we can’t guarantee standards everywhere, yet the scale makes such testing infeasible.

Challenge 3: Slow feedback

Due to the large scale of the project, compiling a build with R8 optimization enabled on a standard engineering laptop is physically impossible. This created a significant bottleneck: a slow feedback loop where every experimental change required a remote CI build to verify, with each R8-optimized build taking up to 2 hours to complete.

Additionally, R8 treats debug and release build types differently. At Grab, we have a QA build for QA testing. This is a debug build type with R8 enabled, pointed to our staging environment. We had to ensure this QA build’s R8 configuration matched our production build exactly. This alignment was critical for catching R8-specific issues during QA testing that would actually reflect production behavior

Our three-innovation solution

To overcome these three fundamental challenges, we developed a comprehensive strategy centered on targeted innovations that addressed each bottleneck.

Innovation 1: AI-assisted debugging

Solving challenge 1:

How do we speed up the investigation of R8 issues in obfuscated, optimized code at scale? The answer is in emerging AI technology that wasn’t available during our previous attempts.

The AI context at Grab:

Unlike 2022 and earlier attempts, the landscape had changed dramatically. After the LLM explosion, Grab proactively promoted AI (LLM) usage to boost engineering productivity. Over the past two years, Grab has dedicated 1-2 months annually for engineers to learn how to use AI efficiently. This investment in AI literacy became crucial for this project.

This year (2025), my team gained experience building MCP (Model Context Protocol) servers and identified an opportunity: applying this technology to solve the R8 debugging challenge.

Our solution:

At Grab, we use GitLab for Continuous Integration and Continuous Delivery (CI/CD). To tackle R8 debugging bottlenecks, we built a comprehensive solution combining:

Custom MCP tools.
AI-assisted GitLab CI integration.

Build MCP tools: Eliminate manual reverse engineering

Automatic Android Application Package (APK) decompilation: Parse and decompile APKs.
Stack trace deobfuscation: Automatically map obfuscated traces to source code.
Class/method context fetching: Pull relevant decompiled code sections for analysis.

AI and CI pipeline workflow:

We developed a systematic two-phase approach for investigating and fixing each runtime issue, combining AI assistance with parallel testing:

Phase 1: MCP server tools for debugging

Detect runtime issue: From End-to-End (E2E) tests, QA testing, or crash reports.
MCP tool orchestrates APK analysis: Coordinates decompilation tools for reverse engineering.
MCP tool provides decompiled code context: Pulls and decompiles problematic code sections.
Engineer and AI analysis: The engineer uses AI assistance to analyze the decompiled code context and note down multiple solution approaches.

Phase 2: GitLab CI integration

We leveraged the GitLab CLI tool (glab) and instructed AI to use it for interacting with our CI pipeline:

AI creates multiple Merge Requests (MRs): Using glab CLI, AI creates merge requests for different solution approaches from Phase 1, each triggering CI compilation.
Track progress: Maintain an MD file as the source of truth for the investigation, containing all notes about the issue (root cause analysis, test cases, test branches, CI build status).
AI fetches APK from CI: Using glab CLI to retrieve built APKs from completed CI pipelines.
Verify: Ask AI to use ADB install APK, then manually test the fix.
Iterate: If issues remain, loop back to step 2 for further analysis.

Why this worked:

Our approach functions as an AI assistant that:

Decodes the obfuscated code automatically.
Finds the relevant code sections without manual searching.
Suggests multiple solutions based on the context provided by the MCP tools.
Creates multiple test branches simultaneously and runs parallel CI builds to test different approaches.
Tracks everything to ensure no progress is lost on complex investigations.

Instead of testing solutions one-by-one (waiting 2 hours per build), AI creates multiple MRs in parallel, dramatically accelerating the verification process. Engineers focus on making decisions about which solutions to pursue while the AI handles both the mechanical work and the parallel experimentation.

The impact: Accelerating investigation

While investigating a single R8 issue might still take time, our MCP tools dramatically accelerated critical investigation tasks. Manual tasks that previously took hours (decompilation, deobfuscation, context gathering) were reduced to minutes. Additionally, AI assistance significantly sped up the analysis phase, helping engineers quickly identify patterns, suggest solutions, and explore multiple approaches in parallel, both analytically and through simultaneous CI builds, further accelerating the overall investigation process.

Innovation 2: Pragmatic testing strategy

Solving challenge 2:

How do we do testing at scale? How do we validate R8 optimization across a mature codebase containing more than nine million lines of code when comprehensive testing is necessary but impossible? Our solution came from a critical insight about R8 issues at scale.

Key insight:

From our experience, R8 issues tend to share similar root causes across the codebase. Legacy patterns like reflection usage, parser implementations, and dynamic class loading follow consistent patterns within a large codebase. This insight led to two key advantages:

Fix one, help many: Fixing one place often resolves issues in others.
Pattern recognition: Once we identify a pattern, we can search the codebase to find similar issues instead of waiting for QA to discover them.

If we could identify and fix these pattern-based issues, we could address many problems without testing every corner of the app. We decided to start with critical paths and expand from there. This “ripple effect” strategy began at the center with the most important flows, then expanded by identifying common root causes and similar patterns across the codebase.

With this foundation, we designed a validation pipeline that progressively increased confidence:

Progressive, Risk-Based validation strategy

Stage 1: E2E tests - pattern discovery phase: Fortunately, we had existing E2E tests covering most critical paths in the project, and they could be executed with R8 optimization enabled. Initially, all E2E tests failed after enabling optimization. This became our opportunity for pattern discovery: we systematically fixed issues and applied our pattern-based approach to resolve similar problems across the project.
Stage 2: QA smoke tests - coverage expansion: After E2E tests stabilized, we requested our QA team to run smoke tests on critical flows, especially those not covered by E2E automation. This caught additional edge cases and validated that the pattern-based fixes we applied were effective across different user journeys. We fixed any issues that appeared during this phase.
Stage 3: Daily QA build enablement - real-world integration: After confirming stability in controlled testing, we made a significant decision: enable R8 optimization in our daily QA build (the build our QA team uses for daily feature testing). This integrated R8 optimization into the normal development workflow without requiring additional testing effort.
Stage 4: Regression testing and Grab Early Access (GEA) - parallel production-scale validation: After confirming stability in daily QA builds, we moved to production-scale validation with two parallel tracks. Every release at Grab includes regression testing covering all critical paths and new features. With R8 optimization now enabled in the QA build, we ran regression tests using this build for a few weeks, providing sustained validation across multiple release cycles. One week after regression testing, we rolled out to GEA, Grab’s internal production release channel for Grab employees and partners. While GEA users typically receive features one week before general production rollout, for this R8 optimization project, we extended the GEA phase to 2 weeks, given the significance of the change. With hundreds of daily active users using the app in real-world production conditions during this extended period, we encountered only one remaining R8 issue during the GEA phase. This combination of regression testing and real-world GEA production usage gave us the confidence needed before full production rollout.

Testing Approaches That Don’t Work with R8:

Unit tests: Run on Java JVM, while R8 optimizations affect Android Runtime behavior - fundamentally different environments
UI tests with R8: Community solutions exist as Gradle plugins, but our tests run on Bazel - complex setup and reliability concerns

Pattern-based issue resolution:

Throughout these validation phases, when we identified R8 issues, we followed a systematic pattern-based resolution process.

Identify the issue: Catch the failure through E2E, QA, or monitoring.
Find the pattern: Analyze the root cause to identify if it’s a common pattern across the codebase.
Detect similar instances: Search the entire codebase to find the same pattern across different modules and the internal SDKs.
Coordinate fixes: Create tickets requesting teams to modify their code to prevent the same issue in their modules.

This approach required cross team coordination for fixing, but critically, not for testing. The difference is significant: asking teams to fix identified issues in their modules is much more scalable than requiring all teams to perform comprehensive testing upfront.

Production rollout results:

When we made it to production, only one issue escaped to production. Notably, we had actually detected this issue through our pattern-based approach during testing and created a ticket for the responsible team to fix it. However, with ongoing daily development, the team missed one instance when implementing the fix, which caused the production issue.

This demonstrates that while our testing strategy worked effectively, human coordination challenges can still occur at scale. With a project of this scale, having only one small production issue is considered a highly successful rollout.

This approach transformed an “impossible” comprehensive testing problem into a manageable, systematic validation process, reducing what would have been months of coordinated testing effort to days, proving that a smart strategy can overcome resource constraints.

Innovation 3: Optimized feedback loop

Solving challenge 3:

The 2-hour CI builds, and the QA configuration misalignment created a bottleneck for R8 debugging. We addressed this through a comprehensive infrastructure strategy targeting three critical areas:

Remote compilation to enable local build and fast feedback loop:

At Grab, we used to use Mainframer for remote execution to handle slow performance on local Gradle builds. However, since migrating to Bazel (only for the debug build without R8 enabled), we removed the large-scale Mainframer setup for every engineer. From that experience, to tackle the local compilation blocker for R8 builds, we decided to deploy a new Mainframer setup, a much smaller one with one powerful EC2 instance, serving as a solution for local compilation in a short time.

This targeted deployment transformed physically impossible local R8 builds into a manageable remote process, enabling engineers to test R8 changes without requiring powerful local hardware.

The performance improvement was substantial: from up to 2 hours in CI to around 1 hour with Mainframer - a ~50% reduction that enabled rapid iteration cycles essential for R8 debugging.

QA build configuration alignment:

We eliminated the critical gap between QA and production R8 behavior by aligning build configurations exactly. The key change was setting debuggable = false for QA builds while maintaining the environment configuration:

buildTypes {
    debug {
        if (isQaBuild()) {
            minifyEnabled true
            shrinkResources true
            debuggable false
            buildConfigField 'boolean', 'DEBUG', 'true'
            ...
        }
    }
}

Since, from our understanding, R8 applies different optimization levels based on the debuggable flag, with more aggressive optimizations when debuggable=false, this ensured our QA testing reflected actual production R8 processing. We preserved DEBUG = true to maintain staging environment routing while achieving R8 parity.

This infrastructure foundation was essential, providing faster feedback loops that accelerated verification and investigation, while the QA build configuration matching production exactly was critical for catching real production issues during testing.

A lucky break

Perhaps most surprising: the R8 flakiness issue that blocked us in 2022 (Issue #240077160) appears to have been resolved by the R8 team. We encountered no build determinism issues during this attempt, which significantly smoothed our path to production.

Results

After ~10 weeks of systematic implementation led by one engineer collaborating with multiple teams across the organization, we achieved substantial improvements using Android Gradle Plugin 8.6.X:

Stability: Around 25% reduction in ANR rates.
App size: A 16% decrease in download size on our reference device (zipped APK).
Performance: Nearly 27% improvement in startup time. An interesting discovery: After enabling R8 optimization, we saw ~12% app startup improvement. However, during our analysis, we discovered that our existing Baseline and Startup Profiles implementation was incorrect. We reimplemented it properly, and the combination of R8 optimization plus the corrected profiles delivered the full 27% improvement.

These results exceeded our initial targets and validated the significant effort required to enable R8 optimization at scale.

What’s next

Our journey doesn’t end here. We’re exploring several areas for continued optimization:

R8 full mode: More extreme/aggressive optimization than the current mode for additional performance benefits.
Revisit R8 keep rules: Clean up unnecessary rules that prevent optimization, and implement a governance solution to guardrail R8 rules in our pre-merge CI pipeline.

Conclusion

Enabling R8 optimization for the Grab Android app at scale required innovation beyond traditional debugging approaches. By combining AI-assisted debugging, pragmatic testing strategies, and infrastructure investment, we overcame challenges that had blocked previous attempts for many years.

For other teams considering R8 optimization at scale: the journey is challenging, but the results speak for themselves. With the right tools, strategy, and team collaboration, it’s achievable even for the largest codebases.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries: Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam. Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases, or access services such as lending and insurance, all through a single app. We operate supermarkets in Malaysia under Jaya Grocer and Everrise, which enables us to bring the convenience of on-demand grocery delivery to more consumers in the country. As part of our financial services offerings, we also provide digital banking services through GXS Bank in Singapore and GXBank in Malaysia. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line. We aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Reclaiming Terabytes: Optimizing Android image caching with TLRU

Fri, 06 Mar 2026 00:23:00 +0000

Introduction

In a previous post, we discussed Project Bonsai, our initiative to reduce the Grab app’s download size. We successfully reduced the Android Application Package (APK) download size by 26%. This reduction offers a substantial advantage: it minimizes download friction, allowing users to download the app, even on slower networks. However, the battle for storage doesn’t end after installation.

The Grab app includes a wide range of features and workflows that heavily depend on image content, particularly in services like transportation and e-commerce. Although some images are packaged within the app binary, a large majority are downloaded from Grab’s server at runtime. To optimize the app’s performance and minimize server expenses, the downloaded images are cached in the app’s storage. This reduces both load times and traffic to Grab’s image server, resulting in better user experience and lower costs. Although we use Least Recently Used (LRU) cache to manage storage, many images can remain in the app storage for extended periods, even if they are no longer relevant.

This blog details how we addressed this challenge in our Grab Android app by evolving our standard LRU cache into a Time-Aware Least Recently Used (TLRU) cache. This evolution allows us to reclaim storage space without compromising user experience or increasing server costs.

Understanding LRU cache limitations

Note: In this article, when “cache” or “image cache” is mentioned, it specifically refers to disk cache, which is the persistent storage on the device’s file system, rather than in-memory cache.

The Grab Android app uses the Glide library as its primary image loading framework. Glide provides excellent features for efficient image loading, caching, and display. At its core, by default, Glide uses a cloned version of Jake Wharton’s DiskLruCache for disk-based caching.

To prevent unlimited cache growth, we configured the LRU cache with a maximum size limit of 100 MB. However, our analytics revealed that the 90th percentile (P90) of users were consistently reaching this 100 MB limit, meaning the cache was constantly at capacity. Conversely, for users whose cache hadn’t yet reached the 100 MB threshold, images were never removed, even if they were outdated by several months and no longer relevant.

Our analysis revealed that image caching was a major contributor to the app’s disk footprint, and without proactive management, this would only worsen as we continued adding features and content to Grab’s superapp.

How DiskLruCache works

The LRU cache algorithm manages storage by maintaining entries in access order and automatically evicting the oldest unused entries when space is needed.

Figure 1 and 2 illustrates how LRU cache trimming works. These diagrams present an LRU cache with a maximum size of 100 MB containing three cache entries totaling 95 MB. When a new 25 MB cache entry is added, it exceeds the cache’s maximum size.

Figure 1. A new cache entry is added to an LRU cache that's near its 100 MB capacity, exceeding the limit.

Figure 2. The LRU cache automatically trims the least recently used entry to bring the total size back within the 100 MB limit.

The challenge

While DiskLruCache efficiently manages cache size, it has a critical limitation: It does not account for the age of cached content. Due to the lack of time-based eviction rules, the cache does not remove outdated entries until it exceeds the maximum size. This meant that stale promotional images, images from infrequently used features, and outdated content continued occupying disk space indefinitely, as long as the cache remained under the size limit.

What we needed was a cache mechanism that could:

Maintain LRU cache benefits: Preserve efficient caching for users who actively use the app features.
Remove stale content based on time: Automatically identify and evict outdated entries, not just rely on storage constraints.
Protect user experience: Ensure images still load quickly without cache misses.
Keep server costs low: Avoid increased server requests from premature cache evictions.

These requirements pointed us toward an enhanced LRU approach. We needed to enhance LRU with time awareness while preserving its proven size-management capabilities.

TLRU cache: The solution

To address these limitations, we developed a new LRU cache variant named TLRU that extends traditional LRU by introducing time-based eviction while maintaining size-based cache management.

Core TLRU attributes

TLRU introduces three core attributes to manage cache entries:

Time-To-Live (TTL): A threshold that determines when a cache entry is considered expired. An entry is expired if (current_time - last_accessed) > TTL. Expired entries are automatically removed during cache operations.
Minimum cache size threshold: A safety net that ensures a baseline set of essential images always remains cached, even when entries expire. This prevents complete cache deletion when users haven’t used the app for more than the TTL period, maintaining app responsiveness for returning users instead of starting with an empty cache.
Maximum cache size: Inherited from LRU cache, this enforces the upper storage limit (100 MB in our case). When exceeded, the least recently used entries are evicted regardless of their age.

Together, these attributes ensure TLRU maintains optimal cache size by managing both storage constraints and temporal relevance, reducing app disk footprint without impacting user experience.

TLRU cache trimming in action

To better understand how TLRU works in practice, let’s walk through a comprehensive example. The following diagrams demonstrate how the TLRU cache evaluates and trims entries based on both time and size constraints.

Our TLRU cache configuration includes:

Maximum cache size: 100 MB - the storage limit that triggers size-based eviction.
Minimum size threshold: 20 MB - the safety net that protects essential cached content.
TTL: 20 days - entries older than this are considered expired.

Each cache entry includes last_accessed metadata containing the timestamp of its most recent access. When an entry is first created, this timestamp is initialized with the creation time. This timestamp determines whether an entry has expired based on the formula:

Entry is expired if: (current_time - last_accessed) > TTL

For this walkthrough, we’ll use current_time = Day 100 as our starting point.

Initial cache state analysis

Our example begins with three existing cache entries totaling 95 MB, approaching the 100 MB limit:

Item 1 (8 MB, last accessed Day 82): At 18 days old
Item 2 (30 MB, last accessed Day 81): At 19 days old
Item 3 (57 MB, last accessed Day 80): At exactly 20 days old, valid at the TTL threshold

When a new 10 MB item is added on Day 100, the cache grows to 105 MB, exceeding our 100 MB limit and triggering size-based eviction.

Figure 3. Initial TLRU cache state and the impact of adding new entries.

Size-based eviction process

When the cache exceeds its 100 MB limit, TLRU applies traditional LRU eviction logic. Item 3 is selected for eviction because:

It is the least recently used entry (oldest access time).
This demonstrates TLRU maintaining LRU behavior for size enforcement, regardless of expiration status.

Figure 4. Size-based eviction removes the least recently used entry to enforce storage limits.

Time-based eviction process

Five days later (Day 105), Item 1 and Item 2 cross the expiration threshold:

Despite operating well below the size limit (48 MB < 100 MB), TLRU evaluates expired entries for time-based eviction. Item 2 is removed because it’s expired, and the cache remains above the minimum threshold. Item 1, although also expired, is protected by the minimum threshold rule; removing it would leave only 10 MB, which falls below the 20 MB minimum.

Figure 5. Time-based eviction and minimum threshold protection working together.

TLRU behavior summary

This comprehensive example demonstrates TLRU’s three core mechanisms:

Size-based eviction: Enforces storage limits using traditional LRU ordering (Item 3 removed despite being valid).
Time-based eviction: Proactively removes expired content when safe to do so (Item 2 removed for age).
Minimum threshold protection: Preserves essential cache functionality even with expired content (Item 1 protected despite expiration).

Technical implementation

Rather than building an image cache from scratch, we recognized that Glide’s bundled DiskLruCache (originally from Jake Wharton’s implementation) already provided a mature, battle-tested foundation. This implementation is widely adopted across the Android ecosystem and handles complex edge cases like crash recovery, thread safety, and performance optimization that would require substantial effort to replicate.

Our approach was pragmatic, we cloned Glide’s DiskLruCache and extended it to support time-based expiration. This strategy allowed us to inherit the existing reliability while adding the temporal awareness we needed for TLRU.

To understand our implementation, we’ll first explore how the original DiskLruCache works, then dive into the specific modifications we made to transform it into TLRU.

Understanding DiskLruCache

DiskLruCache provides a simple cache solution that stores key-value pairs on disk, while also keeping track of their usage to evict the least recently used items when the cache reaches its maximum size. Here is an overview of how DiskLruCache is implemented:

Data storage: DiskLruCache stores its data in a specified directory, creating files for each entry.
Key-based access: Each entry has a unique key (typically a hash generated by the image loader) used to create the filename of the cached entry.
Atomic writes: When adding an entry, it creates a temporary file and writes the data to it. If successful, it atomically renames the temporary file to the final filename.
Cache retrieval: When reading from the cache, it looks up the key, opens the corresponding file on disk, and returns an InputStream to read the data.
Size management: It maintains a maximum cache size limit. When exceeded, it removes the least recently used items until it is within the specified limit.

The central component that enables this functionality is the journaling mechanism, detailed in the following section.

The journaling mechanism

The journaling mechanism in DiskLruCache is designed to maintain consistency and prevent data corruption in the cache. The journal file records all cache operations, such as adding, updating, or removing entries. The journaling mechanism is essential in rebuilding the cache metadata during initialization and performing journal compaction to clean up the journal file.

Figure 6. Example of the journaling mechanism in DiskLruCache.

Journal file format:

The journal file is a plain text file that records cache operations line by line.

DIRTY: Indicates the start of a write operation to a cache entry.
CLEAN: Indicates that a cache entry was successfully written and closed.
REMOVE: Indicates that a cache entry was removed from the cache.
READ: Indicates that a cache entry was read.

To gain a comprehensive understanding of the journal file format, refer to the following detailed explanation.

Key information: Each line includes the key and other relevant information, such as the lengths of the cache entry files.
Cache initialization: Upon initialization, DiskLruCache reads the journal file to reconstruct cache metadata in memory, determining file associations, lengths, and access order. If the journal file is corrupted or missing, the cache will be considered invalid, and DiskLruCache will remove all cache files and start fresh.
Cache operations and journal updates: When performing cache operations like adding, updating, or removing entries, DiskLruCache appends corresponding lines to the journal file, recording the operation details. For example, when starting to write a new cache entry, it writes a DIRTY line with the key, and when the write is successful, it appends a CLEAN line with the key and lengths.
Synchronization and consistency: DiskLruCache uses synchronization to ensure that only one thread can access the cache at a time, preventing race conditions and data corruption. It also uses a journalWriter (java.io.Writer) instance to append operations to the journal file, ensuring that the file is always in a consistent state.
Journal compaction: Over time, the journal file may grow with redundant operations. DiskLruCache periodically compacts the journal by creating a new file that contains only the current cache metadata, then atomically replaces the old file. The compaction process usually happens when the journal file size exceeds a certain threshold.

DiskLruCache ensures consistency and prevents data corruption by using this journaling mechanism, making it a reliable solution for disk-based caching.

Modifying DiskLruCache for TLRU

With a solid understanding of DiskLruCache’s architecture, we can now explore how we extended it to implement the TLRU cache attributes defined earlier.

Three primary modifications to DiskLruCache:

Tracking last access time
Time-based eviction logic
Backward-compatible migration

Tracking last access time

To support time-based eviction, the cache needs to track when each entry was last accessed. This information m ust persist across app restarts, so it’s stored in the journal file itself.

Modified journal format:

READ [Cache-Key] [Access-Timestamp]
CLEAN [Cache-Key] [File-Size]-[Access-Timestamp]

The timestamps are added to READ and CLEAN operations:

READ entries record when a cache entry is accessed, updating its last-access time.
CLEAN entries record the creation time when a new entry is successfully added to the cache.

Figure 7. Example of a TLRU journal file.

Time-based eviction logic

The TLRU cache leverages the existing LRU ordering to optimize expiration checking. For each cache operation, it checks if the least recently accessed entry has expired before proceeding with time-based trimming.

The diagram below shows how the TLRU cache makes the decision to remove the cache entries.

Figure 8. TLRU eviction decision flow - evaluating cache entries based on time expiration and size constraints.

The algorithm leverages the sorted nature of the cache: if the least recently accessed entry hasn’t expired, no other entries need checking. If it has expired, the cache trim operation walks through entries from oldest to newest, removing all expired ones.

Backward-compatible migration

With an extensive user base, invalidating existing cached images would cause millions of users to experience poor performance while creating massive server traffic spikes and infrastructure costs.

One of the challenges was retrieving last-access timestamps from existing LRU entries, as file system APIs do not offer reliable access time data. Our solution was to set the last-access time of all existing entries to the migration timestamp. This approach preserves all cached content and establishes a consistent baseline, although it necessitates waiting one TTL period to realize the full benefits of eviction.

We also ensured bidirectional compatibility - the original LRU implementation can read TLRU journal files by ignoring timestamp suffixes, enabling safe rollbacks if needed.

Upon completing our TLRU implementation, we focused on determining optimal values for the three core attributes: TTL duration, minimum threshold, and maximum cache size. These parameters are crucial for balancing storage optimization and cache performance, requiring careful tuning based on real user behavior.

Finding optimal configuration values

Finding optimal configuration values requires systematic experimentation and data-driven decision-making. Controlled experiments to compare the cache hit ratio with baseline LRU performance must be conducted.

Note: Cache hit ratio, our key success metric, gauges efficiency by the percentage of requests served from cache versus requiring server downloads. Lower ratios lead to higher server costs and increased user data consumption.

Our success criteria is for a cache hit ratio decrease of no more than 3 percentage points (pp) during the transition to TLRU. For instance, a decrease from 59% to 56% hit ratio would result in 7% increase in server requests. This threshold balances storage optimization with acceptable performance impact.

To mitigate potential server cost impact from our maximum acceptable 3 pp cache hit ratio drop, we worked with the server team to optimize image delivery infrastructure, enabling a confident TLRU rollout without infrastructure cost concerns.

Impact and results

After fully rolling out TLRU to production, we significantly optimized storage while preserving user experience. Post-implementation stabilization, the P95 total app size reduced by approximately 50 MB. This meant that 95% of our users experienced storage reduction up to 50 MB, with the top 5% seeing even greater savings.

With over 100 million downloads of the Grab Android app, even conservative estimates show terabytes of storage reclaimed across all user devices worldwide. This translates to better device performance, especially on low-end devices, and improved user satisfaction.

Critically, we maintained our success criteria: cache hit ratio stayed within target thresholds (no more than 3 pp decrease), with no increase in infrastructure costs. The seamless migration preserved all existing cache data without disruption.

Conclusion

At Grab, we believe that every byte matters. Our users trust us with their device storage, and we take that responsibility seriously. The TLRU implementation exemplifies our commitment to user experience. We don’t just build features, we optimize them to ensure our app respects our users’ devices. The petabytes of storage reclaimed across millions of devices aren’t just a technical achievement; it’s a reflection of our dedication to creating a lighter, faster, more respectful mobile experience.

The implementation demonstrates that meaningful improvements can be achieved through thoughtful modifications to existing, well-tested libraries. Our focus on backward compatibility and safe migration ensured zero disruption for Grab’s users, proving that user experience and technical innovation can coexist.

Join Us

Grab is Southeast Asia’s leading superapp, serving over 900 cities across eight countries (Cambodia, Indonesia, Malaysia, Myanmar, the Philippines, Singapore, Thailand, and Vietnam). Through a single platform, millions of users access mobility, delivery, and digital financial services, including ride-hailing, food delivery, payments, lending, and digital banking via GXS Bank and GXBank. Founded in 2012, Grab’s mission is to drive Southeast Asia forward by creating economic empowerment for everyone while delivering sustainable financial performance and positive social impact.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Cursor at Grab: Adoption and impact

Thu, 29 Jan 2026 00:23:00 +0000

Adoption overview

The illustration below encapsulates how Cursor is scaled across Grab, achieving rapid and widespread adoption that accelerated software development and empowered non-technical teams to build solutions.

Figure 1: Adoption overview of AI tool Cursor in Grab.

Multi-tool strategy

Grab embraces a multi-tool strategy for AI coding assistants. Rather than committing to a single solution, we experiment with multiple tools simultaneously, allowing us to compare outcomes and adopt what works. This approach keeps us flexible in a space that evolves quickly. We covered this philosophy in a previous post.

Growth

We introduced Cursor in late 2024 as one of several tools in our AI engineering toolkit. Adoption grew quickly: 98% of Tech Grabbers became monthly active users, and about 75% use it weekly. For comparison, Google’s 2025 State of AI-Assisted Software Development report highlights that even among high-performing teams, AI coding tool adoption seldom surpasses 70%. Notably, Cursor’s appeal extended beyond engineering, with non-technical teams incorporating it into their workflows.

A standout metric is Cursor’s suggestion acceptance rate, which is around 50%, surpassing the industry average of 30%. This indicates two key insights: first, the suggestions are sufficiently relevant for engineers to accept them half of the time; second, engineers maintain a critical review process rather than accepting suggestions indiscriminately. We attribute this relevance to continuous feedback loops and environment-specific tuning, ensuring suggestions remain aligned with Grab’s codebase and conventions.

Extent of adoption

Raw adoption figures don’t provide the complete picture. We aimed to determine whether engineers were truly incorporating Cursor into their daily workflows or merely experimenting with it sporadically.

The data indicates genuine integration. Approximately half of Cursor users engage with it 10 or more consecutive days each month, with some teams achieving full adoption. Over a third of merge requests now incorporate Cursor in some capacity. Engineers actively share tips and workflows via a dedicated Slack channel, fostering an organic knowledge base.

Across various teams, we’ve observed significant transitions from light usage to moderate and power user levels over the past six months.

Engineer utilization patterns

The most common patterns we see are unit test generation, code refactoring, cross-repository navigation, bug fixing, and automation of routine tasks like API scaffolding or commit messages.

Test generation is particularly popular. Writing tests manually is tedious, and Cursor’s ability to generate and iteratively refine tests has become a standard part of many engineers’ workflows. Cross-repository navigation helps with onboarding and context-switching: engineers can ask Cursor questions about unfamiliar codebases rather than hunting through documentation.

Qualitative feedback confirms what the adoption numbers suggest: tasks that took a full day to complete now take hours. Engineers report tackling refactors and test additions they would have otherwise skipped due to time pressure. Cursor doesn’t just speed up existing work; it makes previously impractical work feasible.

Integration with Grab’s stack

Integrating Cursor effectively at Grab required custom tooling. We built solutions for monorepo indexing to handle Grab’s scale and to distribute preconfigured rules that align Cursor’s suggestions with Grab-specific coding conventions. This integration ensures that Cursor understands our environment rather than offering generic suggestions.

What’s next

Cursor is one tool in a broader toolkit. Our multi-tool strategy means we’re also investing in terminal-based workflows and GrabGPT for internal knowledge retrieval. Different tools suit different workflows. The aim is to empower users, not to restrict them.

Beyond engineering, we’re expanding AI-assisted development to new personas. Our AI Upskilling workshops have trained several hundred Grabbers across five countries, including executive committee members and senior leaders who built and deployed their own apps. Non-engineers in Financial Planning and Analysis (FP&A), Operations, and regional teams are now building tools to solve their own pain points.

Our product design team has launched an initiative empowering designers to directly implement production fixes. Designers have successfully merged hundreds of merge requests, often with same-day turnaround, facilitating quicker iterations on UI fixes without the engineering queue delay. This process requires designers to be trained in Git fundamentals prior to gaining access, with initial reviews conducted by design managers.

Cursor has become part of daily work at Grab. But adoption is only half the question — the other half is impact. We’ve been running a parallel effort to measure productivity effects rigorously, using fixed-effects regression to isolate Cursor’s contribution from other factors. Early findings show a dose-response relationship: productivity gains scale with usage intensity, and the effects hold up to statistical scrutiny.

We will address the measurement methodology and present our findings in a subsequent post.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Docker lazy loading at Grab: Accelerating container startup times

Wed, 21 Jan 2026 00:23:00 +0000

Introduction

At Grab, we’ve been exploring ways to dramatically reduce container startup times for our data platforms. Large container images for services like Airflow and Spark Connect were taking minutes to download, causing slow cold starts and poor auto-scaling performance. This blog post shares our journey implementing Docker image lazy loading using eStargz and Seekable OCI (SOCI) technologies, the results we achieved, and the lessons learned along the way.

Results: The numbers speak for themselves

Benchmark results

Our initial testing on fresh nodes (nodes without cached images) showed dramatic improvements in image pull times as shown in Figure 1.

Figure 1. Table of results.

The key advantage of lazy loading is the reduction in image pull time, especially on “fresh” nodes that do not have the image cached. By analyzing detailed pod events, we can see the precise impact of using the stargz snapshotter.

During our SOCI benchmark testing, we observed an important distinction between SOCI and eStargz: SOCI maintains the same application startup time as standard images, while eStargz takes longer. For example, with Airflow, both overlayFS and SOCI achieved 5.0 seconds startup time, while eStargz took 25.0 seconds. This demonstrates that lazy loading doesn’t eliminate download time; it redistributes it. SOCI’s approach of maintaining separate indexes allows it to optimize the download-to-startup time trade-off more effectively, keeping application startup performance on par with standard images while still dramatically reducing image pull time.

Production performance

The production deployment of SOCI lazy loading has delivered significant, measurable improvements across our data platforms. Both Airflow and Spark Connect now experience 30-40% faster startup times, directly improving our ability to handle traffic spikes and scale efficiently. These improvements translate to better auto-scaling responsiveness, reduced resource waste during initialization, and improved user experience for data processing workloads. The sustained performance gains observed over time demonstrate that lazy loading is a stable, production-ready optimization that delivers consistent value.

Figure 2 and 3 illustrates the P95 startup time improvements for both services:

Figure 2. Production results: Airflow P95 startup time.

Figure 3. Production results: Spark Connect P95 startup time.

It is important to note that P95 startup time includes both the image download/pull time and the application startup time itself. This metric captures the entire system performance for both cold and hot starts on fresh and hot nodes, showing the overall system improvement rather than just cold start performance.

During the production deployment and monitoring, we gained valuable insights on SOCI configuration tuning. Following AWS’s recommended configuration from their blog on Introducing Seekable OCI: Parallel Pull Mode for Amazon EKS, we optimized our SOCI snapshotter settings:

Increased max_concurrent_downloads_per_image from 5 to 10.
Increased max_concurrent_unpacks_per_image from 3 to 10.
Increased concurrent_download_chunk_size from 8MB to 16MB (aligning with AWS’s recommendation for Elastic Container Registry (ECR)).

This configuration tuning led to a significant performance improvement: image download time on a fresh node was reduced from 60 seconds to 24 seconds, representing a 60% improvement. The key lesson here is that default SOCI configurations may not be optimal for all environments, and tuning these parameters based on your infrastructure (especially when using ECR) can yield substantial gains.

Technical background: How Docker lazy loading works

Container root filesystem (rootfs) and file organization

A container’s root filesystem, or rootfs, is the directory structure that the container sees as its root (/). It contains all the files and directories necessary for an application to run, including the application itself, its dependencies, system libraries, and configuration files. It’s an isolated filesystem, separate from the host machine’s filesystem.

The rootfs is built from a series of read-only layers that come from the container image. Each instruction in an image’s Dockerfile creates a new layer, representing a set of filesystem changes. When a container is launched, a new writable layer, often called the “container layer,” is added on top of the stack of read-only image layers. Any changes made to the running container, such as writing new files or modifying existing ones, are written to this writable layer. The underlying image layers remain untouched. This is known as a copy-on-write (CoW) mechanism.

In containerd, a snapshotter is a plugin responsible for managing container filesystems. Its primary job is to take the layers of an image and assemble them into a rootfs for a container. The default snapshotter in containerd is overlayFS, which uses the Linux kernel’s OverlayFS driver to efficiently stack layers. To assemble the rootfs, the overlayFS snapshotter creates a “merged” view of the read-only image layers:

Figure 4. How OverlayFS assembles the container filesystem.

lowerdir: The read-only image layers are used as the lowerdir in OverlayFS. These are the immutable layers from the container image.
upperdir: A new, empty directory is created to be the upperdir. This is the writable layer for the container where any changes are stored.
merged: The merged directory is the unified view of the lowerdir and upperdir. This is what is presented to the container as its rootfs.

When a container reads a file, it’s read from the merged view. When a container writes a file, it’s written to the upperdir using a copy-on-write mechanism. This is an efficient way to manage container filesystems, as it avoids duplicating files and allows for fast container startup.

The problem: Traditional container image pull

To understand the benefits of lazy loading, we first need to understand the traditional container image pull process:

Download layers: The container runtime downloads all layer tarballs that make up the image.
Unpack layers: Each layer is unpacked and extracted onto the host’s disk.
Create snapshot: The snapshotter combines these layers into a single, unified filesystem, known as the container’s rootfs.
Start container: Only after all layers are downloaded and unpacked can the container start.

This process is slow, especially for large images, as the entire image must be present on the host before the container can launch.

The solution: Remote snapshotter

To address the slow startup issue with large images, we use a remote snapshotter solution. A remote snapshotter is a special type of snapshotter that doesn’t require all image data to be locally present. Instead of downloading and unpacking all the layers, it creates a “snapshot” that points to the remote location of the data (like a container registry). The actual file content is then fetched on-demand when the container tries to read a file for the first time.

While a traditional snapshotter like overlayFS uses directories on the local disk as its lowerdir, a remote snapshotter creates a virtual lowerdir that is backed by the remote registry. This is typically done using FUSE (Filesystem in Userspace). The remote snapshotter creates a FUSE filesystem that presents the contents of the remote layer as if it were a local directory. This FUSE mount is then used as the lowerdir for the overlayFS driver. This allows the remote snapshotter to integrate with the existing overlayFS infrastructure while adding the capability of lazy-loading data from a remote source.

There are two main formats that enable remote snapshotters: eStargz and SOCI.

eStargz format

eStargz is a backward-compatible extension of the standard OCI tar.gz layer format. It has several key features that enable lazy loading:

Individually compressed files: Each file within the layer (and even chunks of large files) is compressed individually. This is the key that allows for random access to file contents.
TOC (table of contents): A JSON file named stargz.index.json is located at the end of the layer. This TOC contains metadata for every file, including its name, size, and, most importantly, its offset within the layer blob.
Footer: A small footer at the very end of the layer contains the offset of the TOC, allowing it to be easily located by reading only the last few bytes of the layer.
Chunking and verification: Large files can be broken down into smaller chunks, each with its own entry in the TOC. Each chunk also has a chunkDigest in its TOC entry, allowing for independent verification of each downloaded piece of data.
Prefetch landmark: A special file, .prefetch.landmark, can be placed in the layer to mark the end of “prioritized files”. This allows the snapshotter to intelligently prefetch the most important files for the container’s workload.

The stargz snapshotter uses the eStargz format to enable lazy loading. Here’s how it works:

Mount request: When containerd calls the Mount function, it’s the main entry point for creating a new filesystem for a layer.
Resolve and read TOC: The snapshotter fetches the layer’s footer, then fetches the stargz.index.json TOC from the remote registry. This TOC contains all the file metadata needed to create a virtual filesystem.
Mount FUSE filesystem: With the TOC in memory, the snapshotter creates a virtual filesystem using FUSE. The container can now start, as it has a valid rootfs, even though most of the file content has not been downloaded.
On-demand fetching: When the container performs a file operation like read(), the FUSE filesystem intercepts the call. The snapshotter checks a local disk cache for the requested bytes. If the data is not cached, it issues an HTTP Range request to the container registry to download only the required chunk of the layer.
Remote fetching and caching: The downloaded data is returned to the container and also written to the local cache for subsequent reads.
Prefetching for optimization: After the FUSE filesystem is mounted, a background goroutine begins downloading the prioritized files (up to the .prefetch.landmark) and can also be configured to download the entire rest of the layer in the background.

For a deeper understanding of the eStargz format and stargz snapshotter, see the stargz-snapshotter overview documentation.

SOCI format

SOCI is a technology open sourced by AWS that enables containers to launch faster by lazily loading the container image. SOCI works by creating an index (SOCI Index) of the files within an existing container image. SOCI borrows some of the design principles from stargz-snapshotter but takes a different approach:

Separate index: A SOCI index is generated separately from the container image and is stored in the registry as an OCI Artifact, linked back to the container image by OCI Reference Types.
No image conversion: This means that the container images do not need to be converted, image digests do not change, and image signatures remain valid.
Native Bottlerocket support: SOCI is natively supported on Bottlerocket OS.

For a deeper understanding of the SOCI format, see the soci-snapshotter documentation.

Building and deploying lazy-loaded images

Setting up snapshotters in EKS

When using EKS with containerd as the container runtime, you can configure remote snapshotters to enable lazy loading. Here’s how to set them up:

For stargz-snapshotter (eStargz): You need to install the containerd-stargz-grpc service first, then register it as a proxy plugin in containerd’s configuration:

# /etc/containerd/config.toml
[proxy_plugins]
[proxy_plugins.stargz]
type = "snapshot"
address = "/run/containerd-stargz-grpc/containerd-stargz-grpc.sock"

For detailed installation instructions, see the stargz-snapshotter installation documentation. The setup can be baked into an AMI for production use or tested via user data from node bootstrap scripts.

For SOCI snapshotter (Bottlerocket): On Bottlerocket nodes, enable the SOCI snapshotter via user data:

# Enable SOCI snapshotter
[settings.container-runtime]
snapshotter = "soci"

SOCI is natively supported on Bottlerocket, so no additional daemon installation is required.

Building lazy-loaded images

eStargz images can be built natively using Docker Buildx by setting the output compression to estargz:

docker buildx build 
  --platform linux/amd64 
  --output type=registry,oci-mediatypes=true,compression=estargz,force-compression=true 
  --tag $ECR_REGISTRY/airflow:$TAG 
  .

SOCI doesn’t require rebuilding images; you only need to generate a SOCI index for existing images. Since Docker doesn’t natively support SOCI index generation yet, workaround solutions include using the AWS SOCI Index Builder Using Lambda Functions or integrating SOCI index generation into your CI/CD pipeline as described in this blog post.

Key takeaway: Why we chose SOCI

We started our exploration with eStargz but ultimately chose SOCI for production deployment. The key reason is scalability and alignment with our strategy to use Bottlerocket OS for enhancing Kubernetes pod startup and security. SOCI is natively supported by Bottlerocket, which means service teams don’t need to set up and maintain the more complicated stargz snapshotter across all EKS clusters. This makes the implementation easier to maintain and provides better support from AWS.

Additionally, we learned that lazy loading doesn’t eliminate the time required to download image data; it redistributes it from startup time to runtime. While this dramatically improves cold start performance, it’s important to monitor application performance closely and tune configuration parameters based on your workload and infrastructure. We achieved a 60% improvement by optimizing SOCI’s parallel pull mode settings, demonstrating the value of proper configuration tuning.

Conclusion

Docker image lazy loading with SOCI offers a significant opportunity to improve the performance and efficiency of our services at Grab. Our testing and production deployments have shown:

4x faster image pull times on fresh nodes.
29-34% improvement in P95 startup times for production workloads.
60% improvement in image download times with proper configuration tuning.

The implementation path is clear, low-risk, and builds on proven components. This technology is production-ready, and we’re continuing to scale it across more services.

References

Databricks: Booting Databricks VMs 7x Faster for Serverless Compute - Industry case study showing how major tech companies achieve fast container startup at scale
BytePlus: Container Image Lazy Loading Solution - Enterprise implementation guide for lazy loading in production Kubernetes environments
AWS: Introducing Seekable OCI: Parallel Pull Mode for Amazon EKS - AWS’s guide to SOCI configuration and optimization

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

From deployment slop to production reality: How BriX bridges the gap with enterprise-grade AI infrastructure

Fri, 16 Jan 2026 00:23:00 +0000

Abstract

You’ve vibe-coded an AI assistant that’s a game-changer for your team. It works perfectly on your laptop. But when you try to deploy it company-wide, everything falls apart.

This is what is known as “deployment slop”—the messy reality when quick AI prototypes hit the enterprise world. Your tool suddenly becomes unreliable, insecure, and impossible to maintain. Different teams run different versions. Security flags it. IT won’t touch it. Your innovation dies.

BriX solves this. It’s a platform that takes your working AI prototype and makes it production-ready—without forcing you to become a full-stack developer. BriX handles the hard parts such as security, scaling, and data connections, so you can focus on building great tools. Switch between AI models like Claude or GPT with a click. Connect securely to your company’s data sources. Deploy once, and it just works—for everyone.

This article shows how BriX transforms AI deployment from an engineering bottleneck into a configuration task, enabling domain experts to ship enterprise-grade AI tools in days instead of months.

Introduction

Building AI tools has never been easier. With ChatGPT, Claude, and other Large Language Models (LLMs), anyone can prototype a useful AI assistant in an afternoon. Data analysts build metric query tools; product managers create research assistants. This rapid experimentation—”vibe coding”—has sparked innovation across organizations.

But then comes the hard part: deployment.

That brilliant tool you built on your laptop? It works great for you. But when your boss asks you to “roll it out to the whole company,” you hit a wall. Suddenly you need:

Security reviews (Is it leaking sensitive data?)
Reliability guarantees (What happens when 500 people use it at once?)
Access controls (Who can see what data?)
Audit trails (Who asked what, and when?)
Consistent behavior (Why does it give different answers to different people?)

Most builders aren’t DevOps engineers. They’re domain experts who had a good idea. So these tools either:

Never get deployed (innovation dies in a Jupyter notebook); or
Get deployed badly (creating “Deployment Slop”—a mess of insecure, unreliable scripts).

The three failure modes of deployment slop

The chaos problem: Everyone’s running a different version

Marketing copies your script and tweaks the prompts. Finance changed the model from GPT-4 to Claude because it’s cheaper. Sales adds their own data sources. Within weeks, you have:

Five different versions of “the same tool”.
Wildly different answers to the same question.
No one knows which version is “correct”.
Teams making decisions based on inconsistent data.

Potential risk: A senior executive receiving conflicting answers from different teams, resulting in a loss of trust.

The reliability problem: It works until it doesn’t

Your laptop script was built for one user (you). Now 50 people are using it simultaneously. The result:

Timeouts and crashes during peak hours.
No error handling (users see cryptic Python stack traces).
Rate limits hit on API calls.
No monitoring or alerts when things break.
You become the “on-call” support person for a side project.

Potential risk: The tool fails during a critical metric review leaving folks to find the solution manually.

The security problem: Accidental data leaks

Your prototype connects directly to production databases. It has your personal credentials hardcoded. There’s no:

Access control (everyone sees all data, including sensitive info).
Audit trail (no record of who queried what).
Data governance (PII might be exposed).
Compliance review (legal and security teams don’t even know it exists).

Potential risk: An employee inadvertently querying PII, resulting in a potential breach.

Who gets hit hardest?

This problem is especially painful for semi-technical builders—the domain experts who understand the business problem but aren’t DevOps engineers:

Product Managers who write SQL but not Kubernetes configs.
Data Analysts who know Python but not cloud security.
Marketing Ops who build dashboards but not CI/CD pipelines.
HR Analytics who understand people data but not infrastructure scaling.

The traditional solution is to “hand it to Engineering,” but they are backlogged for months. By the time they rebuild your tool “properly,” the business need has changed.

Solution: Enter BriX: From prototype to production in days, not months

BriX is a platform that solves the deployment problem by centralizing all the hard infrastructure work. Instead of forcing every builder to become a DevOps expert, BriX provides the production-ready foundation so you can focus on building great AI tools.

The core insight: Deployment doesn’t have to be an engineering problem. It can be a configuration problem.

What BriX does

Think of BriX as the “production layer” for AI tools. You bring your working prototype. BriX handles security, scaling, data connections, monitoring, audit trails, and consistent behavior across teams.

You configure. BriX deploys.

Figure 1. BriX infrastructure

The three core capabilities

Choose your AI model (Model agnosticism)

Different tasks need different models. BriX lets you switch between models with a dropdown—Claude, GPT, Gemini, or others. Test which works best. Change models without rewriting code. Optimize for cost vs. performance.

Example: Your finance tool uses GPT-4 for complex analysis, but a new better model is available. Change it in BriX with one click—no code changes needed.

Figure 2. Model selection interface

Connect to enterprise data securely (Model Context Protocols)

This is where BriX really shines. Your AI tool needs data—metrics, customer info, documentation. But connecting to enterprise systems securely is hard.

Model Context Protocols (MCPs) are BriX’s solution. Think of them as secure, pre-built connectors to your company’s data sources.

Why MCPs matter:

Security built-in: No hardcoded credentials, proper access controls.
Certified data: Connect only to approved, governed data sources.
No custom integration: Pre-built connectors, not custom API code.
Audit trails: Every query is logged automatically.

Example: Your marketing tool can query the metrics system to get conversion rates, search the knowledge base for campaign guidelines, and pull customer data from the data lake —all through secure, governed connections.

Technical note: MCPs use a standardized protocol, so adding new data sources doesn’t require rebuilding your tool. BriX handles the complexity.

Figure 3. BriX chat user interface

Ensure consistent behavior (System prompts and context)

Remember the “chaos problem” where everyone runs different versions? BriX solves this with centralized configurations by allowing you to lock it down for the users:

System prompts: Define your AI’s personality, tone, and guardrails once.
Context files: Upload reference documents that every instance uses.
Global enforcement: All users get the same behavior automatically.

Example: Your customer support tool has a system prompt that says “Always be empathetic, never make promises about refunds, escalate to humans for complaints.” Every support agent’s AI follows these rules—no exceptions.

Figure 4. The builder’s view

Additional feature: Flexible interfaces and collaboration

Beyond the core infrastructure, BriX offers flexible ways to consume these tools. BriX goes beyond conversational interfaces—you can host custom UIs built with any frontend framework while BriX handles the AI backend. Users can also generate and share analyses as persistent reports, turning individual queries into institutional knowledge accessible across teams via shareable links—complete with data, visualizations, and AI insights.

Figure 5. Share feature interface

The BriX workflow: A real example

Let’s see how a product manager would use BriX:

Step 1: Upload your prototype

You’ve built a Jupyter notebook that queries metrics and generates reports.
Upload it to BriX (or connect your GitHub repo).

Step 2: Configure (Not code)

Choose your AI model: Claude 4.5 Sonnet
Connect data sources: Midas (metrics), Hubble (data lake)
Set system prompt: “You’re a data analyst. Always cite sources. Format numbers with commas.”
Upload context: Your company’s metrics definitions guide.

Step 3: Lock

Lock all the configurations of your BriX.
Share with your team.

Figure 6. BriX landing page

Figure 7. The user’s view (Locks and edit not available)

Step 4: It just works

Certification by design with Brick Quality residing with the brick admin.
Focused use cases have specific system prompts, context - minimizing hallucination concerns.
People can use it simultaneously (BriX handles scaling).
Everyone gets consistent answers (same model, same prompts).
All queries are logged (audit trail automatic).
The security team is happy (proper access controls).
You’re not on-call (BriX monitors and alerts).

Time to production: 3 Days, not 3 months.

Under the hood: The BriX architecture

BriX is built on a synchronous streaming architecture—a design that prioritizes real-time responsiveness without sacrificing enterprise security. Think of it like a live sports broadcast: you see the action as it happens, not a delayed replay.

Figure 8. BriX architecture

Here’s how a single user request flows through the system, from question to answer.

The request journey: Six layers

User Question
      ↓
[1] The Frontend — Real-Time Streaming
      ↓
[2] The Gateway — FastAPI Backend
      ↓
[3] The Brain — LangGraph Orchestration
      ↓
[4] Memory — Hot and Cold Storage
      ↓
[5] Security — Identity Propagation ("On-Behalf-Of" Flow)
      ↓
[6] Data Processing — Full Context, Not Fragments
      ↓
Response streams back to user in real-time

Let’s break down each layer.

Layer 1: The frontend — Real-time streaming

Technology: React (TypeScript)
User experience: ChatGPT-style interface

The User types a question: “What’s our conversion rate in Singapore last month?”

The frontend opens a persistent connection to BriX servers. As the AI processes the question, updates stream back instantly:

“🤔 Thinking…”
“📊 Querying metrics database…”
“✅ Found 3 relevant data points…” [Final answer appears]

Why streaming matters:

Traditional approach	BriX approach
❌ User waits 30 seconds, sees nothing, then gets full answer (feels broken).	✅ User sees progress every second (feels responsive and trustworthy).

Technical implementation: Server-Sent Events (SSE) for real-time updates without WebSocket complexity.

Layer 2: The Gateway — FastAPI backend

Technology: FastAPI (Python)
Role: Central traffic controller

What it does:

Receives all incoming requests
Authenticates users (checks SSO tokens)
Routes requests to the appropriate agent
Manages rate limiting (prevents abuse)
Handles errors gracefully

Why FastAPI?

⚡ Fast (async/await for concurrent requests)
🔒 Secure (built-in authentication)
📈 Scalable (handles thousands of concurrent users)

Layer 3: The Brain — LangGraph orchestration

Technology: LangGraph (AI workflow framework)
Role: The “main agent” that coordinates everything.

Think of LangGraph as a smart router that understands intent and delegates work.

Example flow:

User asks: “Compare our Singapore and Malaysia conversion rates, then explain why they differ”.

LangGraph analyzes the question:

Task 1: Query metrics (needs Midas MCP)
Task 2: Compare data (needs calculation)
Task 3: Explain differences (needs context/knowledge base)

LangGraph delegates to specialized “MCPs”:

Midas MCP: Queries Midas for conversion data
LLM Agent: Calculates the difference
Glean MCP: Searches knowledge base for regional factors

LangGraph synthesizes: Combines results into coherent answer Why modular “Bricks”?

✅ Reliability: Each Brick is specialized (fewer hallucinations)
✅ Maintainability: Update one Brick without breaking others
✅ Extensibility: Add new Bricks for new use cases

Layer 4: Memory — Hot and cold storage

BriX uses a two-tier memory system to balance speed and durability:

Hot memory (Redis):

⚡ Ultra-fast: In-memory storage (microsecond access).
🔄 Session management: Tracks active conversations.
🔒 Distributed locks: Prevents race conditions when multiple requests happen simultaneously.
💨 Temporary: Data expires after session ends.

Cold memory (PostgreSQL):

💾 Persistent: Data stored permanently
📜 Audit trail: Every query, response, and action logged
🔍 Searchable: Users can search past conversations
📊 Analytics: Track usage patterns and performance

Example scenario:

You ask BriX a question → Hot memory tracks your active session
You close the browser → Session data moves to cold memory
You return tomorrow → BriX loads your history from cold memory
You continue the conversation → New session in hot memory

Result: Fast responses + complete history + full auditability

Layer 5: Security — Identity propagation (“On-Behalf-Of” flow)

This is where BriX’s security model shines. Instead of using a single “service account” to access all data, BriX uses your credentials for every query.

How it works:

Step 1: Authentication (Login)

You log in via SSO (e.g., Okta, Azure AD)
BriX receives a secure token that represents your identity
This token includes your permissions (what data you can access)

Step 2: Identity propagation (Query execution)

You ask: “Show me customer revenue data”
BriX doesn’t use its own credentials to query the database
Instead, BriX carries your token to the data source
The data source checks: “Does this user have permission to see revenue data?”
- If yes → Returns data
- If no → Access denied

Step 3: Audit trail

Every query is logged with:
- Who asked (your user ID)
- What they asked (the question)
- What data was accessed (the query)
- When it happened (timestamp)

Why this matters:

Traditional approach	BriX approach
❌ Service account has access to ALL data.	✅ Each user only sees their authorized data.
❌ Can't tell who accessed what.	✅ Complete audit trail per user.
❌ Security team nervous about AI tools.	✅ Security team approves (same controls as existing tools).
❌ One compromised credential = full breach.	✅ Breach limited to single user's permissions.

Real-world example:

Finance analyst asks about revenue → Sees all financial data (authorized)
Marketing analyst asks same question → Sees only marketing budget (restricted)
Same AI tool, different permissions → Security enforced automatically

Technical term: This is called “identity propagation” or “on-behalf-of flow” in enterprise security.

Layer 6: Data processing — Full context, not fragments

The old way (Retrieval Augmented Generation (RAG)):

User asks a question.
System searches for relevant document chunks.
System sends top 5 chunks to AI.
AI answers based on fragments.

Problem: AI might miss context from other parts of the document.

The BriX way (Full context):

User uploads a document.
BriX feeds the entire document into the AI’s context window.
AI reads and understands the full document.
AI answers with complete context.

Why this works now: Modern AI models (Claude, GPT-4) have massive context windows (100K+ tokens). They can process entire documents, not just snippets—resulting in more accurate answers and fewer hallucinations.

Example:

Question: “What’s our refund policy for international orders?”

RAG approach: Finds 3 snippets about refunds → Might miss international-specific rules
BriX approach: Reads entire policy document → Finds exact international refund section

Architecture summary: Why this design works

Design choice	Benefit	User impact
Streaming architecture	Real-time feedback	Feels fast and responsive
Modular Bricks	Specialized agents	Fewer errors, more reliable
Hot/Cold memory	Speed + durability	Fast responses + full history
Identity propagation	User-level security	Only see authorized data
Full context processing	Complete understanding	More accurate answers

The result: An AI platform that feels as fast as ChatGPT but with enterprise-grade security and reliability.

What using BriX actually feels like

All the technical architecture is invisible to end users. Here’s what they actually see and experience.

What users see:

Visit BriX URL
Click “Log in with SSO” (uses your existing company login)
Redirects to familiar authentication screen
Logged in automatically

What users DON’T see:

No new account creation
No password to remember
No security questionnaire
BriX inherits your existing permissions automatically

Why this matters: Zero onboarding friction. If you can access your email, you can use BriX.

The app library: Your company’s AI tools

What users see: Company’s internal “App Store” for AI tools.

Each tool is pre-configured and vetted
Click to launch (no installation)
Tools are tailored to company’s data and processes

Using a Tool: ChatGPT-style interface

What users see: See the AI “thinking” and “querying”—no black box waiting. Builds trust (“I can see it’s actually checking the data”).

Source citations: Every answer includes a data source. Click to view original data. No “trust me” answers.

Conversational follow-ups: “Why did it increase?” | “Compare to Malaysia” | “Show me a chart”

BriX remembers the context.

Data upload: Drag, drop, analyze

What users have:

Files are processed securely (encrypted).
AI reads the full content.
Users can ask questions about the files.
Files are only visible to the uploader (privacy).

Trustworthy answers: Certified data, not hallucinations

The problem BriX solves:

ChatGPT/Generic AI	BriX
❌ Makes up data ("hallucinations")	✅ Only uses your company's real data
❌ No source citations	✅ Every answer cites the source
❌ Can't access internal data	✅ Connects to your data lakes, metrics, docs
❌ Same answer for everyone	✅ Respects your permissions (you only see your data)

Why users trust it:

✅ Specific number (not vague)
✅ Source cited (can verify)
✅ Certified data (governance approved)
✅ Timestamp (know it’s current)
✅ Can export/verify (transparency)

The impact: What BriX actually changes

BriX shifts how organizations build AI tools. Here’s what that looks like in practice.

From months to days

Traditional path	BriX path
1. Domain expert has idea.	1. Domain expert has idea
2. Submits request to engineering.	2. Configures the idea in BriX.
3. Waits in backlog (weeks to months).	3. Tests with small group.
4. Engineering rebuilds it "properly".	4. Deploys to production.
5. Tool finally launches.	5. Shares with team.

What changes:

⚡ Speed (hours instead of months)
👤 Ownership (domain experts maintain their tools)
🔄 Iteration (refine based on feedback immediately)
✅ Success rate (ideas get tested instead of dying in backlog)

True democratization

Who builds tools with BriX:

The shift isn’t just engineers anymore. We’re seeing:

Product managers building feature analysis tools.
Data analysts creating custom dashboards.
Marketing ops building campaign trackers.
Sales ops creating pipeline monitors.
HR analytics building retention tools.

What this means:

Domain expertise stays with domain experts (no translation loss). Engineering focuses on platforms (not individual tool requests). Innovation happens at business speed (not constrained by engineering capacity).

The reality check:

Not every domain expert will build tools (and that’s fine). Some tools still need engineering (complex integrations, custom logic). But the bottleneck shifts from “engineering capacity” to “good ideas.”

Flexibility without fragility

What you can change without rewriting code:

Swap AI models:

Dropdown menu selection (GPT-5, Claude, Gemini)
Different teams can setup different models for their BriX
Can test new models without rebuilding tools

Add data sources:

New MCP connector (one-time setup)
All existing tools can access the new source
No need to update individual tools

Update behavior globally:

Change system prompt in one place
All instances follow new rules immediately
Useful for policy updates, compliance changes

Real example: When a company needs to update data access policies:

Traditional approach: Update each tool individually (days/weeks)
BriX approach: Update system prompt once (minutes)

Security that enables (Not blocks)

The traditional trade-off:

Secure tools = slow approval, limited functionality
Fast tools = security nightmares, compliance issues

BriX’s approach: Security is built into the platform, not added per tool.

What’s automatic:

SSO authentication (no passwords to manage)
Identity propagation (users see only their authorized data)
Audit logging (every query tracked)

What this changes:

Security team reviews the platform once (not every tool)
Builders don’t need to become security experts
Compliance is automatic (audit trails, access controls)
Tools can move fast without sacrificing governance

Real impact: Security teams that previously rejected most AI proposals can pre-approve BriX. Then tools built on BriX inherit those security controls automatically.

BriX will:

Provide infrastructure for rapid AI tool deployment.
Make it easier for domain experts to productionize ideas.
Centralize security and governance.
Reduce (not eliminate) the engineering bottleneck.
Give you a path from prototype to production.

The real impact

The biggest change isn’t technical. It’s organizational.

BriX changes the conversation from:

“Can engineering build this for us?”

to:

“Let me try building this and see if it works”

That shift—from asking permission to testing ideas—is the real impact. Some ideas will fail. That’s fine. The cost of testing is now low enough that failure is acceptable.

The ideas that succeed can scale immediately. That’s what matters.

Adoption: From zero to production reality

This isn’t theoretical. Real teams are using BriX right now:

The Universal Playground - Data analysts and product managers drop in to run quick analyses or ask questions—no setup, no credentials to configure. Just connect and go. It’s become the default “let me check something” tool.
Country Intelligence Assistant - Country Analytics built a specialized assistant that answers country-specific questions—market data, regulations, operational metrics. It’s now the go-to source for regional teams making local decisions.
Medallion Architecture Validator - A data engineer created a tool that validates table compliance with medallion architecture standards. What used to take manual reviews now happens instantly. Teams query it before deployments to catch issues early.
Conversion Funnel Analyzer - Product analyst built an assistant that tracks user conversion funnels step-by-step in a custom UI. Marketing and product teams use it daily to understand drop-off points without writing SQL.

Learnings/conclusion

The promise: Anyone can build AI tools. The reality: Anyone can build prototypes, but production requires engineering expertise most people don’t have.

BriX bridges that gap.

What BriX does

For domain experts: Build and own tools without becoming DevOps experts. Iterate in hours, not months. For engineering: Stop being the bottleneck. Secure the platform once, not every tool. For the organization: Test more ideas. Scale what works. Automatic security and compliance.

Why BriX works: Three design principles

Building BriX taught us that successful enterprise AI platforms require:

Specialization over generalization Users prefer 5 focused tools over 1 unpredictable tool. That’s why BriX uses modular “Bricks”—each specialized for specific tasks (data analysis, trend detection, document search). Narrow scope = better reliability.

Enablement over control Deployment slop isn’t a problem to eliminate—it’s evidence of demand. Don’t kill experimentation; provide the path to production. BriX lets teams experiment locally, then offers the infrastructure to scale what works.

Reliability over features Users forgive missing features. They don’t forgive unreliability. One slow response or wrong answer = they never come back. That’s why BriX prioritizes real-time streaming, certified data sources, and source citations over adding more capabilities.

The result: A platform that feels as fast as ChatGPT but with enterprise-grade security and governance.

Configure once. Analyze everywhere. Act fast.

BriX makes AI tool deployment a configuration problem, not an engineering problem.

Your domain experts have the ideas. BriX gives them the path to production.

What’s next

BriX solves deployment, but we’re not stopping there.

More data sources

We’re expanding the MCP library. If our company uses it, BriX should connect to it—securely and without custom engineering work.

Bring your own code

For technical builders who want custom logic without DevOps headaches, we’re launching a mono repo setup:

App owners own: Their code and business logic
BriX owns: Platform, security, scaling, maintenance

More BriX

Onboarding more BriX for different tech and non-tech personas.

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Demystifying user journeys: Revolutionizing troubleshooting with auto tracking

Tue, 23 Dec 2025 00:23:00 +0000

Introduction

Troubleshooting critical issues by deciphering a user’s journey on the Grab app is an extremely challenging task. With countless user journeys and multiple paths through the User Interface (UI), it’s akin to searching for a needle in a vast haystack. This challenge frequently resonates with us, the dedicated developers at Grab, as we strive to understand user behaviors, views, and interactions.

The challenge

The distinction between resolving an issue effectively versus spending hours on a wild goose chase is understanding our user journey in real-time.

The development team initially attempted to address the issue of the incomplete user journey tracking by implementing a system where a click stream event would be sent with every user interaction. However, this approach presented significant challenges due to the sheer volume of UI components—often numbering in the hundreds—and the reliance on individual developers to correctly instrument each one.

A common pitfall was that developers would occasionally overlook or forget to instrument certain user interactions, leading to breaks in the recorded user journey. This created a highly frustrating situation for both the development and product teams, as the integrity of the user journey data was consistently compromised. Despite continuous efforts to patch these bugs and address the omissions, the team found themselves in a perpetual state of reaction, constantly trying to catch up with newly discovered breaches rather than proactively preventing them. This reactive approach consumed valuable resources and hindered the ability to gain a complete and accurate understanding of user behavior.

Diagnosing system failures, application bugs, or poor user experiences in complex applications becomes inefficient without real-time performance metrics and detailed session tracking. When engineering teams rely on outdated or fragmented data, they are forced to piece together issue narratives reactively, long after the issues occur. This significantly delays the Mean Time To Resolution (MTTR). Such a reactive approach leads to increased downtime, higher operational costs, customer dissatisfaction, and a waste of developers’ time, as they spend more time “hunting” for clues rather than deploying solutions or new features.

Our ‘Eureka’ moment: AutoTrack SDK

The pivotal breakthrough that provides our unique advantage was the creation of auto tracking user journeys—our “Eureka” moment. To deliver this, we developed the new Software Development Kit (SDK) called AutoTrack.

AutoTrack is system that comprehensively records application state, UI view state, as well as user interactions - a solution that pieces together a chronicle of the user journey, from launch to interactions, as they navigate through the screens. AutoTrack SDK is built on the three core pillars:

Application state
User interactions
UI screens

Let’s delve deeper into the mechanics of how this operates.

Application state

Understanding the application state is fundamental to comprehending user behavior and, consequently, executing effective troubleshooting. The application state provides crucial insights into how a user interacts with the app, particularly concerning its visibility and how it was initiated. This encompasses tracking when the app moves between the background and foreground, as well as the various launch mechanisms.

Figure 1. Application state user flow.

Key aspects of application state that are vital to monitor include:
Application lifecycle transitions:

Background state: When the app is running but not actively displayed to the user (e.g., the user switches to another app, or the device is locked). Understanding how frequently and for how long an app resides in the background can inform power consumption analysis and the effectiveness of background tasks.
Foreground state: When the app is actively in use and displayed to the user. Monitoring transitions into and out of the foreground provides a real-time view of user engagement.
Inactive state: A temporary state where the app is in the foreground but not receiving events (e.g., an incoming call temporarily interrupts the app).
Suspended state: An app that is in the background and has been explicitly suspended by the operating system to free up resources.
Terminated state: When the app has been completely closed or crashed. Differentiating between intentional termination and crashes is critical for identifying stability issues.

Application launch mechanisms:

The way an app is launched significantly impacts the initial user experience and can influence subsequent interactions. Tracking these different launch types is essential for understanding user entry points and for debugging issues that might be specific to a particular launch method.

Explicit user launch: This is the most straightforward launch mechanism, where the user directly taps on the app icon from their device’s home screen or app drawer. This indicates a deliberate intent to use the app and often signifies a primary entry point for regular users.
Deeplinks: Deeplinks are URLs that, when clicked, open a specific page or section within a mobile app rather than a web page. They are powerful tools for enhancing user experience and engagement by providing direct access to relevant content.
Push notifications: Push notifications are messages sent by an app to a user’s device even when the app is not actively in use. Tapping on a push notification often launches the app and directs the user to a specific context related to the notification’s content.

Figure 2. Code sample for tracking application lifecycle transition.

User interactions

Real-time session tracking is a crucial component in understanding user behavior and optimizing app performance. By meticulously tracking a wide array of user interactions, the system provides invaluable insights into how users navigate and engage with the app. This granular data forms the bedrock for constructing comprehensive user journeys, allowing development teams to visualise the path a user takes from their initial entry point to achieving their goals within the app.

This deep understanding of user interactions is the most important pillar in creating accurate and insightful user journey maps. These maps, in turn, are instrumental in identifying patterns of user behavior, both positive and negative. For instance, tracking helps to identify pain points, bugs, or areas of confusion that might lead to user frustration or abandonment.

Figure 3. Sample code for real-time session tracking.

UI screen

The system leverages lifecycle events from UIViewController (iOS), Activity (Android), and Fragments (Android) to accurately identify and track which specific screen is currently displayed to the user. This granular level of screen tracking is crucial because it significantly enriches the contextual information available to us. By understanding the precise UI that users are interacting with, we can account for the dynamic nature of our app. Different geographical regions, diverse user segments, and varying operational scenarios can lead to distinct user interfaces being presented. This capability ensures that our analysis and troubleshooting efforts are always based on the actual user experience, allowing for more precise problem identification and more effective solutions.

Figure 6. Sample code of UIViewController configuration.

UI screen data

On top of that, whenever the screen appears, we capture the screen metadata where we read the full screen hierarchy. With the Screen hierarchy JSON data at hand, we employ it to train an AI model. This model, consequently, can generate an HTML file, which mirrors the user’s screen and interaction.

Disclaimer: information is redacted in compliance with GDPR/PDPA, personal data protection laws.

Figure 7. Screen hierarchy.

Applications of AutoTrack

Key applications of AutoTrack data:

Reconstructing user journeys and reproducing elusive bugs: One of the most significant benefits of AutoTrack is its ability to meticulously record user interactions within the app. This detailed session data allows our teams to precisely recreate the user journey that led to a reported issue. For bugs that are notoriously difficult to reproduce, this capability is a game-changer, eliminating hours of manual guesswork and dramatically accelerating the identification and resolution of underlying problems.
Automated issue assignment: When an issue is reported, AutoTrack data can be leveraged to automatically assign it to the most relevant team. By analysing the context of the issue within the recorded session, including the specific features or modules involved, the system can intelligently route the problem to the engineers best equipped to address it. This automation reduces triage time, ensures issues are handled by subject matter experts, and improves overall response efficiency.
Automating UI test case generation: The rich dataset provided by AutoTrack offers a powerful foundation for automating the creation of UI test cases. By observing how users interact with the interface, we can automatically generate test scripts that mimic real-world usage patterns. This not only speeds up the testing phase but also leads to more comprehensive test coverage, identifying edge cases and user flows that might otherwise be missed by manually written tests.
Understanding analytics event triggers: AutoTrack data provides a granular view into when and why specific analytics events are triggered within the application. This allows us to validate the accuracy of our analytics instrumentation, ensure that events are firing as expected, and gain deeper insights into user behavior. By understanding the precise context surrounding event triggers, we can refine our data collection strategies and derive more meaningful insights from our analytics.

Key takeaways and what’s next

AutoTrack replaces fragile manual instrumentation with a unified, real-time view of application state, screen context, and user interactions. That end-to-end trace makes elusive bugs reproducible, routes issues to the right owners, and seeds reliable UI tests—turning guesswork into grounded evidence so teams can ship fixes faster and with greater confidence.

Looking ahead, we are expanding AutoTrack across surfaces and deepening the context it captures—pairing sessions with network and performance signals, strengthening privacy guardrails, and integrating with automated triage and test generation. Look forward to reading more of our deep dives on auto-generated UI tests and how these journeys will power proactive quality across Grab’s app.

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How Grab is accelerating growth with real-time personalization using Customer Data Platform scenarios

Thu, 18 Dec 2025 00:23:00 +0000

Introduction

Delivering personalized user experiences in real-time is central to Grab’s strategy, but achieving this at scale poses significant engineering challenges. Grab’s Customer Data Platform (CDP) and Growth team has successfully delivered several real-time campaigns, driving significant business impact through enhanced personalization. These initiatives include high-impact use cases like immediate mall offers, timely traveler recommendations, precise ad retargeting, and proactive interventions during key user journey moments. At the core of these successes is Grab’s CDP, which rapidly deploys advanced real-time personalization via a powerful new capability called “Scenarios.”

About Grab’s CDP

Grab’s CDP is a centralized, reliable repository for user attributes, designed for freshness, governance, and reusability. Built on Grab’s Signal Marketplace framework, the CDP streamlines data management through automation and integration, supporting seamless interactions with internal services and toolings that power marketing, experimentation, ads, Machine Learning (ML) features, and external platforms, including Facebook, Google Ads, and TikTok.

The platform currently manages over 1,000 batch user attributes for Passengers, Drivers, and Merchants, powering diverse use cases from targeted marketing campaigns to operational decision-making across Grab’s entire ecosystem.

The need for real-time personalization

In our current CDP setup, user segments are primarily created for targeting using batch attributes that update once daily. While these batch updates provide valuable historical insights, they are not suitable for scenarios requiring real-time responsiveness. This delay prevents timely engagement with users, particularly when immediate actions can significantly enhance user experiences and conversion rates.

For example, when travelers land at an airport, they immediately benefit from timely suggestions for rides, dining options, or local attractions. Traditional batch processing cannot deliver the agility and responsiveness required for these dynamic scenarios.

Historically, real-time personalization at Grab relied heavily on engineering resources, which resulted in limited scalability and agility. Marketers and product teams often found themselves blocked by engineering bandwidth constraints, restricting experimentation and innovation.

Problem statement

The limitations of Grab’s existing personalization frameworks include:

Batch attribute delays: Daily updates are insufficient for scenarios requiring immediate user responses.
Limited dynamic enrichment: Difficulties in dynamically integrating real-time events with historical user data, weakens personalization effectiveness.
High engineering overhead: Custom solutions require extensive resources, limiting agility and innovation.

To overcome these challenges and support Grab’s vision for comprehensive personalization – including proactive recommendations and assistance – CDP needed robust real-time capabilities.

CDP Scenarios: Real-time personalization made simple

The Scenario feature revolutionizes real-time targeting within the CDP by utilizing user-initiated events, geo-fencing, historical profile data, and on-the-fly predictions. This empowers the business to deliver easy, quick, and flexible personalization without the need for complex engineering efforts.

Scenarios enable innovative use cases such as these:

Mall personalization: Real-time personalized offers upon arrival.
Traveler assistance: Immediate recommendations at airports or hotels.
Ad retargeting: Enhanced real-time ad targeting.
Conversion optimization: Timely intervention during user drop-off points.

Imagine predicting a user’s intent to drop off at a mall using both real-time and historical context. For instance, when a user books a ride to a mall, factors such as destination, time, cuisine preferences, and past behavior (e.g., affluence level) can help predict whether the user’s purpose is retail therapy, grocery shopping, or dining out. This prediction accounts for elements like time of day, day of the week, and mall location. Grab’s engineering teams can leverage this predicted intent (signal) to offer personalized actions, such as GrabPay discounts for shopping or exclusive dining offers for dinner.

Figure 1. Scenario in CDP.

Key features

Event-driven personalization: Real-time Scenarios triggered by Scribe events (Grab’s comprehensive event collection and tracking platform) combined with geo-fencing.
Historical context integration: Optionally enrich Scenarios using historical CDP data.
Predictive modeling: Deploy pre-trained models for instant user behavior predictions.
Self-serve graphical user interface (GUI): Enable marketers to create complex event sequences and validate Scenarios with synthetic data processed through Flink pipelines.
Headless application programming interfaces (APIs): Allow programmatic access and management of Scenarios.

Figure 2. Attributes for a scenario in CDP.

Self-serve Scenario creation

We designed an intuitive self-serve UI, embedded within the Grab app, empowering marketers to quickly define and deploy Scenarios. Users can specify event triggers, configure geo-fencing, incorporate historical user attributes, and select predictive models. Marketers can also validate Scenarios using synthetic data before deployment, ensuring accurate and realistic outcomes.

How it works:

Select event triggers: Choose predefined events or define custom intra-session sequences via the GUI.
Configure geo-fencing: Define Scenario activation locations, like airports or malls.
Include historical attributes (optional): Utilize batch attributes from the CDP to enrich Scenarios.
Select predictive models (optional): Train custom classifiers or pick from pre-trained Catwalk models.
Define data sink: Choose between Amphawa (DynamoDB), Kafka, or both; potentially extendable to external destinations (e.g., Appsflyer).
Once configured, metadata synchronizes automatically with our streaming service, and Scenarios become available for real-time consumption within an hour.

Proven impact: Real-world success

CDP Scenarios are already delivering measurable business results, with over 12 live production implementations. For instance, in a case study addressing Grab Unlimited subscription signup abandonment, we leveraged CDP Scenarios to increase signups by engaging users in real time within 15 minutes of them leaving the signup process.

Figure 3. Grab Unlimited sign-up journey.

To enhance conversion rates, personalized real-time nudges were deployed through Scenarios. For example, users who started the signup process but failed to complete it within 15 minutes received a follow-up notification, prompting them to finalize their registration.

Figure 4. Scenario flow for Grab Unlimited registration.

This scenario alone achieved more than a 3% uplift in subscriber conversions vs non-real-time acquisition campaigns, demonstrating Scenarios’ potential to significantly boost business outcomes.

Technical architecture: Low latency, high reliability

Figure 5. High-level scenario flow. Scenarios are designed for low latency (under 15 seconds) and high reliability.

Event registration: Popular UI events from Scribe are whitelisted and immediately available; custom events are onboarded via the CDP web portal.
Scenario creation: Users configure Scenarios through a user-friendly GUI, defining events, historical contexts, and predictive models.
Real-time Flink processing: Incoming events trigger Scenarios, evaluating user historical data via StarRocks and performing real-time predictions using pre-trained models.
Real-time data sync: Outcomes are synced back to Kafka or Amphawa (Grab’s internal feature store built on AWS DynamoDB), enriching data for use by subsequent services.
Consumption by downstream services: Kafka streams or CDP’s Profile SDK facilitates immediate, personalized user experiences.

Advancing the future of real-time personalization

As we continue to innovate, we are focused on enhancing the capabilities of CDP Scenarios to support more complex and scalable personalization use cases. Here are some key areas of improvement we are exploring:

Optimized Scenario sharding for scalable processing: To accommodate the growing number of use cases, we plan to scale and orchestrate our Flink pipeline fleet in a headless manner. This approach will improve system stability and enable seamless management of complex Scenarios across the pipeline.
Enhanced signal distribution across multiple destinations: Currently, Scenario outputs are limited to a single topic or sink. To address the increasing diversity of use cases, we aim to expand signal distribution, allowing downstream consumers to access Scenario outcomes through multiple scalable and reliable channels.
Advanced scheduling and delayed triggering: While real-time computation of Scenario signals is effective, certain use cases require delayed activation for maximum impact. We are exploring ways to compute signals instantly but trigger actions at scheduled times, such as sending a push notification for booking a return Grab ride based on the average wait time at the drop-off location.

Conclusion: Revolutionizing real-time personalization

The launch of CDP Scenarios represents a significant milestone for Grab, paving the way for scalable, efficient, and user-friendly real-time personalization. Initial successes have demonstrated its immense potential, delivering notable improvements in user engagement and conversion rates. Looking ahead, we are committed to continuously advancing Scenarios by expanding its features, integrations, and applications to further elevate user experiences across the Grab ecosystem.

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

A Decade of Defense: Celebrating Grab's 10th Year Bug Bounty Program

Mon, 01 Dec 2025 00:00:10 +0000

Introduction

Ten years ago, we launched our bug bounty program in partnership with HackerOne. Beyond a security initiative, it represented an open invitation to collaborative development. As pioneers in Southeast Asia, we began the program with 23 initial researchers, and it has since evolved into a global community of security researchers.

The strategic structure and scope of our Bug Bounty Program, combined with our continuous innovation and experimentation, have successfully captured the attention of the global security research community. Over the past decade, we have partnered with more than 850 active security researchers from HackerOne’s community of over 2 million cybersecurity professionals worldwide. These dedicated researchers work alongside us across borders and time zones, forming a collaborative defense network that helps protect over 187 million users throughout Southeast Asia. Their ongoing participation demonstrates both the maturity of our program and the trust we’ve built within the security research community.

This milestone reflects the strength of shared purpose and our sustained partnership with the HackerOne platform. It demonstrates the value of human connection and the collective understanding that security is stronger through collaboration. Here’s to a decade of partnership and to many more years of building a safer future, one collaboration at a time!

Figure 1. Ten years of achievements with our HackerOne partnership.

Evolution and growth: Adapting to a dynamic threat landscape

Over the past ten years, our program has consistently adapted to the dynamic threat landscape and integrated invaluable feedback from our research community. We have grown from a private initiative to a program that consistently ranks among the top 20 worldwide and among the top 3 in Asia on HackerOne. Key milestones from our journey include:

Expanding our horizons: Our scope significantly broadened in 2023-2024, continuously adding new assets and prominently including financial services in Indonesia and AI systems. This expansion provides researchers with more avenues to contribute to Grab’s security.
Focused mobile security: We introduced a dedicated bounty table for mobile-specific issues, recognizing the unique challenges of mobile security.
Incentivizing excellence: We regularly experiment with campaigns of various types and targets, diversifying our reward methods to include both financial rewards and recognition.
Evolving vulnerability focus: We’ve observed a significant shift in the types of vulnerabilities reported over the decade, moving from foundational issues in early years to more sophisticated and emerging categories recently.

Figure 2. The journey of our bug bounty program.

The global stage: Connecting with the best

Our program’s success is deeply rooted in its vibrant global community, which we actively foster through continuous engagement. Our strategy extends beyond the platform to major live hacking events, including the ThreatCon Live Hacking Event 2023 in Nepal and DEFCON 32’s Live Recon Village 2024 in Las Vegas. These initiatives have been instrumental in connecting us with a diverse pool of new talent and strengthening relationships with researchers across different continents. By meeting hackers where they are, we’ve not only brought new expertise into our ecosystem but also demonstrated our commitment to being an accessible and collaborative partner on a global scale.

The high participation and quality submissions from these events demonstrate the effectiveness of this approach. They’ve expanded our global security testing coverage and strengthened our standing within the worldwide cybersecurity community. Through ongoing interactions and submitted reports, we continue to see that security is a collaborative effort with no borders.

Exclusive anniversary celebrations: Global club campaigns

To commemorate our 10th anniversary, we launched three exclusive, invite-only campaigns with HackerOne’s regional clubs in Germany, Morocco, and India. These campaigns served as cultural exchanges, bringing fresh perspectives from outside our core Southeast Asian consumer markets. By engaging with these clubs, we expanded our researcher community and connected with security experts who understand different threat landscapes and methodologies, bringing outside perspectives to our systems.

In August, we also ran a broader anniversary campaign that drew significant participation from the researcher community, resulting in 461 submissions. xchopath was awarded the Best Hacker Bonus for their contributions during this campaign.

These campaigns expanded our global security testing coverage and strengthened relationships with international researcher communities. Beyond vulnerability reports, they functioned as knowledge-sharing initiatives. We connected directly with researchers to learn from their experience and feedback, creating a continuous loop of improvement. This international collaboration also informed our global expansion security strategy by providing insights into how different regions approach digital payments and authentication.

The anniversary campaigns allowed us to validate our security frameworks against diverse regulatory environments and advanced testing methodologies from established security markets, reinforcing our commitment to maintaining robust security standards.

Voices from our community

Behind every vulnerability report is a researcher who chose to help make Grab safer. Their perspectives reveal the human side of our security evolution. These individuals are not just cybersecurity experts; they are partners in our mission to protect millions of users and ensure a safe digital environment. Here are a few testimonies from participants in our past campaigns:

“The triage was very fast despite the time difference, which I really appreciated. The triaging experience was better than other programs. The huge scope and business portal with different user roles made it especially interesting to explore.” – ArtSec [H1 Germany club campaign participant]
“I liked that different countries have different features—this gives me more attack surface to explore. Response time was great, triage was very fast, and I appreciated Grab’s effort in providing fast responses. The scope was huge with a lot of wildcards for reconnaissance.” – Sicksec [H1 Morocco club campaign participant]
“More than 20 bugs were reported, and was particularly happy that bounties were being paid upon triage. The Germany team spent a lot of time on the educational part, especially for newcomers. Communication overall was very good, and the immediate response even outside working hours was really cool. SSO and authentication is my expertise and I liked that aspect of exploring the platform.” – Lauritz [H1 Germany club campaign participant]

The road ahead: Our commitment to a secure future

With a strong community of security researchers across countries and a decade of collaboration, we’ve built meaningful partnerships. Every vulnerability report represents trust, and every discovery reflects dedication to our shared mission. The program demonstrates our choice to build together rather than work in isolation, to protect rather than exploit, and to collaborate rather than compete.

While we celebrate our external community, the success of our program relies equally on our dedicated internal teams. Our cybersecurity teams form the operational foundation of this initiative. Their consistent responsiveness and researcher-focused approach have enabled vulnerability reporting to evolve into a genuine partnership, maintaining researcher trust and keeping Grab secure.

The next ten years will bring challenges we can’t yet imagine, from emerging threats in artificial intelligence to novel cryptographic approaches in a quantum-powered world. We will face them together as a community that spans cultures, time zones, and expertise.

Together, we’ll continue securing Southeast Asia’s digital future, one partnership, one discovery, one shared achievement at a time.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility, and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people every day to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Real-time data quality monitoring: Kafka stream contracts with syntactic and semantic test

Wed, 26 Nov 2025 00:00:10 +0000

Introduction

In today’s data-driven landscape, monitoring data quality has become a critical need for ensuring reliable and efficient data usage across domains. High-quality data is the backbone of AI innovation, driving efficiency and unlocking new opportunities. As decentralized data ownership grows, the ability to effectively monitor data quality is essential for maintaining reliability in data systems.

Kafka streams, as a vital component of real-time data processing, play a significant role in this ecosystem. However, unreliable data within Kafka streams can lead to errors and inefficiencies for downstream users, and monitoring the quality of data within these streams has always been a challenge. This blog introduces a solution that empowers stream users to define a data contract, specifying the rules that Kafka stream data must adhere to. By leveraging this user-defined data contract, the solution performs automated real-time data quality checks, identifies problematic data as it occurs, and promptly notifies stream owners. This ensures timely action, enabling effective monitoring and management of Kafka stream data quality while supporting the broader goals of data mesh and AI-driven innovation.

Problem statement

In the past, monitoring Kafka stream data processing lacked an effective solution for data quality validation. This limitation made it challenging to identify bad data, notify users in a timely manner, and prevent the cascading impact on downstream users from further escalating.

Challenges in syntactic and semantic issue identification:

Syntactic issues: Refers to schema mismatches between producers and consumers, which can lead to deserialization errors. While schema backward compatibility can be validated upon schema evolution, there are scenarios where the actual data in the Kafka topic does not align with the defined schema. For example, this can occur when a rogue Kafka producer is not using the expected schema for a given Kafka topic. Identifying the specific fields causing these syntactic issues is a typical challenge.
Semantic issues: Refers to inconsistencies or misalignments between producers and consumers about the expected pattern or significance of each field. Unlike Kafka stream schemas, which act as a data structure contract between producers and consumers, there is no existing framework for stakeholders to define and enforce field-level semantic rules, for example, the expected length or pattern of an identifier.

Timeliness challenge in data quality monitoring: There is no real-time mechanism to automatically validate data against predefined rules, timely identify quality issues, and promptly alert stream stakeholders. Without real-time stream validation, data quality issues can sometimes persist for periods of time, impacting various online and offline downstream systems before being discovered.

Observability challenge for troubleshooting bad data: Even when problematic data is identified, stream users face difficulties in pinpointing the exact “poison data” and understanding which fields are incompatible with the schema or violate semantic rules. This lack of visibility complicates Root Cause Analysis and resolution efforts.

Solution

Our Coban platform offers a standardized data quality test and observability solution at the platform level, consisting of the following components:

Data Contract Definition: Enables Kafka stream stakeholders to define contracts that include schema agreements, semantic rules that Kafka topic data must comply with, and Kafka stream ownership details for alerting and notifications.
Automated Test Execution: Provides a long running Test Runner to automatically execute real-time tests based on the defined contract.
Real-time Data Quality Issue Identification: Detects data issues at both syntactic and semantic levels in real-time.
Alerts and Result Observability: Alerts users, simplifying observation of data quality issues via the platform.

Architecture details

The solution includes three components: Data Contract Definition, Test Execution & Data Quality Issue Identification, and Result Observability as shown in the architecture diagram in figure 1. All mentions of “Flow” from here onwards refer to the corresponding processes illustrated in figure 1.

Figure 1. Real-time Kafka Stream Data Quality Monitoring Architecture diagram.

Data Contract Definition

The Coban Platform streamlines the process of defining Kafka stream data contracts, serving as a formal agreement among Kafka stream stakeholders. This includes the following components:

Kafka Stream Schema: Represents the schema used by the Kafka topic under test and helps the Test Runner to validate schema compatibility across data streams (Flow 1.1).
Kafka Stream Configuration: Encompasses essential configurations such as the endpoint and topic name, which the platform automatically populates (Flow 1.2).
Observability Metadata: Provides contact information for notifying Kafka stream stakeholders about data quality issues and includes alert configurations for monitoring (Flow 1.3).
Kafka Stream Semantic Test Rules: Empowers users to define intuitive semantic test rules at the field level. These rules include checks for string patterns, number ranges, constant values, etc. (Flow 1.5).
LLM-Based Semantic Test Rules Recommendation: Defining dozens if not hundreds of field-specific test rules can overwhelm users. To simplify this process, the Coban Platform uses LLM-based recommendations to predict semantic test rules using provided Kafka stream schemas and anonymized sample data (Flow 1.4). This feature helps users set up semantic rules efficiently, as demonstrated in the sample UI in figure 2.

Figure 2. Sample UI showcasing LLM-based Kafka stream schema field-level semantic test rules. Note that the data shown is entirely fictional.

Data Contract Transformation

Once defined, the Coban Platform’s transformation engine converts the data contract into configurations that the Test Runner can interpret (Flow 2.1). This transformation process includes:

Kafka Stream Schema: Translates the schema defined in the data contract into a schema reference that the Test Runner can parse.
Kafka Stream Configuration: Sets up the Kafka stream as a source for the Test Runner.
Observability metadata: Sets contact information as configurations of the Test Runner.
Kafka Stream Semantic Test Rules: Transforms human-readable semantic test rules into an inverse SQL query to capture the data that violates the defined rules.

Figure 3. Illustration of semantic test rules being converted from human-readable formats into inverse SQL queries.

Test Execution & Data Quality Issue Identification

Once the Test Configuration Transformation Engine generates the Test Runner configuration (Flow 2.1), the platform automatically deploys the Test Runner.

Test Runner

The Test Runner utilises FlinkSQL as the compute engine to execute the tests. FlinkSQL was selected for its flexibility in defining test rules as straightforward SQL statements, enabling our platform to efficiently convert data contracts into enforceable rules.

Test Execution Workflow And Problematic Data Identification

FlinkSQL consumes data from the Kafka topic under test (Flow 2.2) using its own consumer group, ensuring it doesn’t impact other consumers. It runs the inverse SQL query (Flow 2.3) to identify any data that violates the semantic rules or that is syntactically incorrect in the first place. Test Runner captures such data, packages it into a data quality issue event enriched with a test summary, the total count of bad records, and sample bad data, and publishes it to a dedicated Kafka topic (Flow 3.2). Additionally, the platform sinks all such data quality events to an AWS S3 bucket (Flow 3.1) to enable deeper observability and analysis.

Result Observability

Grab’s in-house data quality observability platform, Genchi, consumes problematic data captured by the Test Runner (Flow 3.3).

Alerting

Genchi sends Slack notifications (Flow 3.5) to stream owners specified in the data contract observability metadata. These notifications include detailed information about stream issues, such as links to sample data in Coban UI, observed windows, counts of bad records, and other relevant details.

Figure 4. Sample Slack notifications

Observability

Users can access the Coban UI (Flow 3.4), displaying Kafka stream test rules and sample bad records, highlighting fields and values that violate rules.

Figure 5. In this Sample Test Result, the highlighted fields indicate violations of the semantic test rules.

Impact

Since its deployment earlier this year, the solution has enabled Kafka stream users to define contracts with syntactic and semantic rules, automate test execution, and alert users when problematic data is detected, prompting timely action. It has been actively monitoring data quality across 100+ critical Kafka topics. The solution offers the capability to immediately identify and halt the propagation of invalid data across multiple streams.

Conclusion

We implemented and rolled out a solution to assist Grab engineers in effectively monitoring data quality in their Kafka streams. This solution empowers them to establish syntactic and semantic tests for their data. Our platform’s automatic testing feature enables real-time tracking of data quality, with instant alerts for any discrepancies. Additionally, we provide detailed visibility into test results, facilitating the easy identification of specific data fields that violate the rules. This accelerates the process of diagnosing and resolving issues, allowing users to swiftly address production data challenges.

What’s next

While our current solution emphasizes monitoring the quality of Kafka streaming data, further exploration will focus on tracing producers to pinpoint the origin of problematic data, as well as enabling more advanced semantic tests such as cross-field validations. Additionally, we aim to expand monitoring capabilities to cover broader aspects like data completeness and freshness, and integrate with Gable AI to detect Data Transfer Object (DTO) changes and semantic regressions in Go producers upon committing code to the Git repository. These enhancements will pave the way for a more robust, multidimensional data quality testing solution across a wider range.

References

Driving Data Quality with Data Contracts: A Comprehensive Guide to Building Reliable, Trusted, and Effective Data Platforms by Andrew Jones

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

SpellVault’s evolution: Beyond LLM apps, towards the agentic future

Fri, 21 Nov 2025 00:00:10 +0000

Introduction

At Grab, innovation isn’t just about building new features; it’s about evolving our platforms to meet the changing needs of our users and the broader technological landscape. SpellVault, our internal AI platform, exemplifies this philosophy. When SpellVault was first launched, our vision was straightforward: empower everyone at Grab to effortlessly build and manage AI-powered apps without the need for coding. Built on the principles of Retrieval-Augmented Generation (RAG) and enhanced by plugin support, SpellVault rapidly evolved into a powerful productivity engine for the organization, enabling the creation of thousands of apps that drive automation, foster experimentation, and support production use cases.

As the AI landscape has evolved, SpellVault has grown alongside it. Initially launched as a straightforward no-code app builder for Large Language Models (LLMs), it has now evolved into a cutting-edge platform that embraces the agentic future—a future where AI goes beyond generating responses to reasoning, acting, and dynamically adapting through the use of tools and contextual understanding.

This article outlines SpellVault’s journey towards an agentic future and how we empower users to build AI Agents that are smarter, more adaptable, and ready for the future.

A no-code platform for building LLM apps

SpellVault was founded with a clear mission: to democratize access to AI for everyone at Grab, regardless of their technical expertise. Initially launched as a no-code LLM app builder, the platform was built on a foundation of RAG pipelines and basic plugin support.

Early on, we recognized that the true potential of AI apps extends beyond the capabilities of language models alone. Their real value lies in the ability to seamlessly interact with external systems and diverse data sources. This insight drove our commitment to minimizing barriers and ensuring users could access data from various sources with ease. From the very beginning, we centered our efforts on three key focus areas:

Comprehensive RAG solution with useful integrations

From the start, the SpellVault team prioritized enabling users to enhance their LLM apps with data through RAG. Rather than solely relying on the LLM’s internal information, we wanted the apps to ground their responses in up-to-date, contextually relevant, and factual information. SpellVault has built-in integrations with knowledge sources such as Wikis, Google Docs, as well as plain text and PDF uploads. These capabilities empower users to build assistants that reference relevant knowledge and provide more accurate, verifiable answers.

Plugins to fetch information on demand

To move beyond static knowledge retrieval, we needed a way for apps to act dynamically. This was made possible through SpellVault plugins—modular components that allow apps to interact with internal systems (e.g. service dashboards, incident trackers) and external APIs (e.g. search engines, weather data). Rather than being confined to their initial prompt and data, these plugins can fetch fresh information at runtime. From the available plugin types, users can create their own instances of plugins with custom settings, enabling highly specialized functionality tailored to their specific workflows. For instance, with SpellVault’s HTTP plugin, users can define custom endpoints and credentials, enabling their AI apps to make tailored HTTP calls during runtime. These custom plugins have become the backbone of many of our most impactful apps, empowering teams to seamlessly integrate SpellVault with their existing systems and processes.

Figure 1. SpellVault’s early architecture.

Making SpellVault accessible via common interfaces: Web, Slack, API

One of our primary goals was to make AI seamlessly accessible and useful within the tools users already use—whether it’s a browser or Slack. With SpellVault, users can make their AI apps in minutes and start using them via browser or Slack messaging immediately and intuitively, without requiring any additional setup. We also exposed APIs that enabled other internal services to integrate with SpellVault apps for a variety of use cases. This multi-channel approach ensured that SpellVault wasn’t just a standalone sandbox but a platform woven into existing tools and processes.

Users quickly adopted the platform, creating thousands of apps for internal productivity gains, automation, and even production use cases. The platform’s success validated our hypothesis that there was significant demand for democratized AI tools within the organization.

Figure 2. SpellVault’s web interface for LLM App configuration and chat.

Evolution over time

The AI landscape over the past few years has been defined by relentless change. New frameworks, execution paradigms, and standards have emerged in quick succession, each promising to make AI systems more powerful, more reliable, or more extensible. At Grab, we recognized that for SpellVault to stay relevant, it could not remain static. It needed to evolve in tandem with the ever-changing ecosystem, continuously incorporating valuable advancements while ensuring a seamless experience for our users.

This philosophy of continuous adaptation has guided SpellVault’s journey. From its early days as a simple RAG-powered app builder with a few plugins, the platform grew to support an extensive number of plugin types, richer execution models, and eventually a unified approach to tools. Each step was a response both to the needs of our users and to the shifting definition of what “building with AI” meant in practice. Rather than opting for a complete overhaul, SpellVault has embraced incremental advancements, ensuring that users can seamlessly benefit from new capabilities without disruption.

This approach to evolution has naturally positioned SpellVault to transition from a platform for LLM apps to one designed for AI agents. The following section delves into this transition in greater detail.

Expanding capabilities

Over time, we introduced numerous new capabilities to SpellVault, driven both by user feedback and our commitment to innovation and staying ahead of industry trends. For instance, we extended support for different plugin types, enabling integrations with tools like Slack and Kibana, and continuously added more integrations to enhance the platform’s versatility. We implemented auto-updates for users’ Knowledge Vaults, ensuring their data remained current. With more users building with the platform, ensuring the trustworthiness of responses generated by SpellVault apps became increasingly important. We included citation capability to mitigate some of that concern. Recognizing the need for more precise answers to mathematical problems, we developed a feature that enabled LLMs to solve such problems using Python runtime. Additionally, many users requested an automated way to trigger their LLM apps, which led to the creation of a Task Scheduler feature that allows LLMs to schedule actions based on natural language user input.

A significant milestone in SpellVault’s evolution was the introduction of “Workflow,” a drag-and-drop interface within the platform that empowered users to design deterministic workflows. These workflows enabled users to seamlessly combine various components from the SpellVault ecosystem—such as LLM calls, Python code execution, and Knowledge Vault lookups—in a predefined and structured manner. This enabled advanced use cases for many users.

Figure 3. Evolving tools landscape of SpellVault with increasing integrations.

Shifting the execution model

As SpellVault evolved, a fundamental shift took place in the way its apps were executed internally. We transitioned from our legacy executor system, which facilitated one-off information retrieval from the Knowledge Vault or user plugins, to a more advanced graph based executor. This empowered SpellVault’s app execution with nodes, edges, and states that supported branching, looping, and modularity. This laid the groundwork for more sophisticated agent behaviors, moving beyond the linear input-output paradigm.

This transformed all existing SpellVault apps into ‘Reasoning and Acting’ agents, better known as ReAct agents - a “one size fits many” solution that significantly enhanced the capabilities of these apps. By enabling them to leverage the Knowledge Vault and plugins in a more agentic and dynamic manner, the ReAct agent framework allowed apps to perform more complex tasks while seamlessly preserving their existing functionality, ensuring no disruption to their behavior.

In addition, the internal decoupling of the executor and prompt engineering components enabled us to design multiple execution pathways with ease. This allowed us to provide generic Deep Research capability to any SpellVault app via a simple UI checkbox, as well as sophisticated internal workflows that cater to high-ROI complex use cases like on-call alert analysis. The Deep Research capability came with SpellVault’s ability to search across internal information repositories (e.g., Slack messages, Wiki, Jira) within Grab, as well as searching online for relevant information.

Figure 4. SpellVault’s evolved architecture with more dynamic context gathering and advanced interaction modes.

Towards an agentic framework

Over time, several capabilities were added to SpellVault, including features like Python code execution and internal repository search. Initially, these functionalities were integrated directly into the core PromptBuilder class. For users, these features were primarily accessible through simple checkboxes in the user interface. As SpellVault gradually transitioned towards giving more agency to user-crafted apps, we recognized that these capabilities should instead be positioned as “Tools” for LLMs to use with greater autonomy, similar to how ReAct agent–backed apps have been using SpellVault’s user plugins. We also understood that this shift could bring a clearer mental model for users where they were no longer simply toggling features but creating AI agents with access to a defined set of tools. The agents could then decide when and how to use those tools intelligently to accomplish tasks, making the overall experience more natural and intuitive.

This recognition led to the consolidation of these scattered capabilities into a unified framework called “Native Tools.” These Native Tools, along with SpellVault’s existing user plugins—rebranded as “Community Built Tools”—formed a comprehensive collection of tools that LLMs could dynamically invoke at runtime. Despite being grouped under the same umbrella, a key distinction was maintained: Native Tools required no user-specific configuration (e.g., performing internet searches), whereas Community Built Tools were custom, user-configured entities (e.g., invoking specific HTTP endpoints) created from available plugin types, often requiring credentials or other personalized settings.

This consolidation of capabilities under a unified Tools abstraction and enabling SpellVault apps to invoke them with greater autonomy marked a pivotal milestone in the platform’s evolution. It meaningfully shifted SpellVault toward making agentic behavior more natural, discoverable, and extensible for every app.

Figure 5. SpellVault’s Unified Tools housing both Native Tools and Community Built Tools.

SpellVault as an MCP service

As we streamlined SpellVault’s internal capabilities into a unified tools framework, we also turned our focus outward to align with industry standards. The growing adoption of the Model Context Protocol (MCP) presented an opportunity for agents and clients to seamlessly interact without requiring custom integrations. To remain at the forefront of innovation, we adapted SpellVault to function as an MCP service, enabling it to actively participate in this evolving ecosystem. This extension brought two key advancements:

SpellVault apps as MCP tools: Each app created in SpellVault can now be exposed through the MCP protocol. This allows other agents or MCP-compatible clients, such as IDEs or external orchestration frameworks, to treat a SpellVault app as a callable tool. Instead of living only inside our web user interface or Slack interface, these apps become accessible building blocks that other systems can invoke dynamically.
RAG as an MCP tool: We extended the same idea to our Knowledge Vaults. Through MCP, external clients can search, retrieve, and even add information to Vaults. This effectively turns SpellVault’s RAG pipeline into an MCP-native service, making contextual grounding available to agents beyond SpellVault itself.

While building the SpellVault MCP Server, we also created TinyMCP - a lightweight open-source Python library that adds MCP capabilities to an existing FastAPI app as just another router, instead of mounting a separate app.

By exposing both apps and RAG through MCP, we shifted SpellVault from being a self-contained platform to becoming an interoperable service provider in the agentic ecosystem. Users still benefit from the no-code simplicity inside SpellVault. However, the output of their work, apps, and knowledge, are now usable by other agents and tools outside of it.

Conclusion

SpellVault’s evolution shows how a platform can adapt with the AI landscape while staying true to its original mission of making powerful technology accessible to everyone. What began as a no-code builder for LLM apps has steadily expanded into an agentic platform - one where apps can act with more intelligence, agency, and context and interact with the systems around them.

This progress wasn’t the result of a single breakthrough, but of steady, incremental improvements that introduced new capabilities while preserving ease of use. By layering in these advancements thoughtfully but boldly, SpellVault has managed to support more sophisticated agentic behaviors without compromising its original goal of democratizing AI at Grab.

Join us

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!