Cash App Code Blog

Cash Android Moves to Metro

2025-11-18T00:00:00+00:00

Cash Android has officially migrated to Metro - a modern dependency injection framework developed by Zac Sweers (read Zac’s Introducing Metro blog post). In this article, we’ll discuss the reasoning behind this change, explain how we approached the migration and tackled the technical challenges we faced, and share the results.

But before we start…

A trip down memory lane

Android teams at Block have a long history of using and building dependency injection frameworks.

Back in 2012 Square released Dagger. Over time, Dagger became the industry standard, and in 2018 it transitioned under Google’s stewardship to become the officially recommended dependency injection solution for Android. Dagger 2 has compile-time dependency graph validation, which proved extremely valuable as Cash Android grew.

2020 was the birth year of Anvil, a Kotlin compiler plugin and a suite of annotations to make it easier to extend and manage large Dagger graphs. The Cash Android team happily adopted Anvil, which helped us keep our ever-growing DI graph in check and improved our build speeds.

Fast forward to 2025, and our dependency injection setup still felt pretty solid: we could iterate with confidence, our build speeds were fine, so…

Why change?

The industry is moving fast.

Today, Cash Android codebase is almost 100% Kotlin. Dagger, our main dependency injection solution, is still very much a Java library: its annotation processor requires kapt to process Kotlin code, and it generates Java code that needs to be compiled with javac. The whole build pipeline is complex which slows down our builds.

Kotlin 2.0 was released back in 2024, with K2 - the next version of the compiler with improved performance and IDE integration - reaching stability. While we’ve upgraded to Kotlin 2.0 a while ago, we weren’t able to upgrade to K2 and had to keep the language version setting at 1.9, as Anvil didn’t support K2 yet. Since Anvil is a compiler plugin, adding K2 support required significant effort. As the Anvil team worked on adding support, Metro started gaining traction. Evaluations done by Cash and Square teams convinced us that Metro is well aligned with our long term vision for dependency injection, and therefore we decided to adopt it. As a result of this decision, Anvil transitioned to maintenance mode.

So what is Metro?

According to Metro’s documentation:

Metro is a compile-time dependency injection framework that draws heavy inspiration from Dagger, Anvil, and Kotlin-Inject. It seeks to unify their best features under one, cohesive solution while adding a few new features and implemented as a compiler plugin.

As a compiler plugin, Metro adds minimal build time overhead, noticeably improving performance. It ships with comprehensive interoperability tooling: while Metro has its own DI annotations, such as @Inject and @Provides, it can be configured to “understand” similar annotations from Dagger and Anvil, meaning we wouldn’t need to change every single file that uses those annotations during migration. And the fact that Metro is a Kotlin-first framework built for K2 means it can leverage modern language features to offer better API and developer experience. There was a lot to be excited about, and so we embarked on the journey to gradually and safely migrate Cash Android to Metro.

What did the migration look like?

Today, Cash Android is a huge 1500-module Android project serving tens of millions of customers every month, so we knew we couldn’t just YOLO rewrite everything and push the “ship” button - we needed a plan to ensure the migration is performed and rolled out as safely as technically possible.

Metro interop

We knew that Metro’s interop functionality will be the key to success, and we theorized that if we’re lucky, we should be able to get our code to a state where it can be built with both Dagger/Anvil and Metro, gated by a Gradle property. And so we introduced a Gradle property:

// gradle.properties

mad.di=AnvilDagger // Or Metro. Why "mad.di"? Don't ask!

Building the app would then look like this, which would allow us to set up CI shards building the app in both modes, to catch any potential regressions:

./gradlew app:assembleDebug -Pmad.di=AnvilDagger

// or

./gradlew app:assembleDebug -Pmad.di=Metro

Convention plugin changes

Cash Android engineers love convention plugins! They allow us to consolidate our project-specific build logic and share it between all Gradle modules, without having to copy paste configuration code. BaseDependencyInjectionPlugin is the convention plugin responsible for setting up dependency injection-related plugins and dependencies, and that’s where we would read the value of our Gradle property to decide which plugin to apply:

class BaseDependencyInjectionPlugin : Plugin<Project> {
  override fun apply(target: Project): Unit = with(target) {
    val diImplementation = providers.gradleProperty("mad.di")
      .getOrElse("AnvilDagger")
    val libs = extensions.getByName("libs") as LibrariesForLibs

    when (diImplementation) {
      "AnvilDagger" -> {
        pluginManager.apply(ANVIL_PLUGIN)
        dependencies.add("api", libs.dagger.runtime)
      }
      "Metro" -> {
        pluginManager.apply(METRO_PLUGIN)

        with(extensions.getByType(MetroPluginExtension::class.java)) {
          // We only had this option enabled during migration to debug build failures. It's not needed during normal
          // development as it produces very verbose reports and can have a slight effect on build speeds.  
          reportsDestination.set(layout.buildDirectory.dir("metro/reports"))

          interop.includeDagger(
            includeJavax = true, 
            includeJakarta = false,
          )
          interop.includeAnvil(
            includeDaggerAnvil = true, 
            includeKotlinInjectAnvil = false,
          )
        }
      }
    }
  }
}

Another important change, which we made in our BasePlugin, was to conditionally disable Kotlin language version override if we’re building with Metro:

tasks.withType(KotlinCompilationTask::class.java).configureEach { task ->
  if (diImplementation == "AnvilDagger") {
    task.compilerOptions.languageVersion.set(KotlinVersion.KOTLIN_1_9)
  }
}

Once we started building in K2 mode, we needed to fix up a few minor method deprecations here and there (like renaming toUpperCase() and toLowerCase() method calls to uppercase() and lowercase()), which was pretty straightforward.

Adjusting our code for Metro

At this point, in the best case scenario we would’ve been able to just build our project with Metro, but unsurprisingly it hadn’t been the case - there was more work to do to adjust our dependency graph to Metro.

Removing Module includes

Anvil allows @Modules to be annotated with @ContributesTo(Scope::class), which is an alternative to the @Module(includes = ...) construct that scales better for large dependency graphs like ours. As we adopted Anvil, we added @ContributesTo annotations to all our modules, but in some cases forgot to remove them from the includes clauses of aggregator modules. Metro’s validation logic turned out to be stricter than Anvil’s, which led to errors about modules being added to the DI graph twice. Luckily, this was easy to fix - we simply removed unnecessary includes clauses and kept the @ContributesTo annotations.

Converting @Component.Builder to @Component.Factory

We had a bunch of @Components with @Component.Builders that looked like this:

@Component
interface AppComponent {
  @Component.Builder
  interface Builder {
    @BindsInstance fun refWatcher(refWatcher: RefWatcher): Builder
    
    @BindsInstance fun application(app: Application): Builder
    
    fun build(): AppComponent
  }
}

Metro’s interop turns Dagger @Components into @DependencyGraphs, but there’s no construct similar to @Component.Builder in Metro. However, there’s @DependencyGraph.Factory, which maps perfectly to @Component.Factory. Converting builders to factories was trivial!

@Component
interface AppComponent {
  @Component.Factory
  fun interface Factory {
    fun create(
      @BindsInstance refWatcher: RefWatcher,
      @BindsInstance app: Application,
    ): AppComponent
  }
}

Moving scoping annotations from @Binds bindings to type declarations

We had a number of bindings that looked like this:

@Module
@ContributesTo(AppScope::class)
abstract class SettingsStoreModule {
  @Binds
  @SingleIn(AppScope::class)
  fun bindSettingsStore(real: RealSettingsStore): SettingsStore
}

Here, we’re binding RealSettingsStore implementation to the SettingsStore interface, at the same time marking RealSettingsStore as @SingleIn(AppScope::class). While this is a valid construct in Anvil and Dagger, Metro disallows scoping annotations on @Binds declarations, and for a good reason: these declarations are supposed to simply map one type (implementation) to another (interface) and shouldn’t carry any additional information. The scoping annotation should be placed on the implementation type declaration instead:

@SingleIn(AppScope::class)
class RealSettingsStore @Inject constructor(): SettingsStore

We simply had to move our scoping annotations to where they belong. Note that both annotation sites work in the exact same way in Anvil and Dagger whenever SettingsStore is injected, and since we always inject our interface types and never inject implementation types directly, we were confident this change would not cause any regressions in behavior.

Splitting up @MergeModules

This one was tricky: we had a number of Anvil’s @MergeModules used to aggregate @Modules contributed to a specific secondary scope, which would then be added to a @MergeComponent with the primary scope:

@Module
@ContributesTo(ProductionAppScope::class)
object ProductionEndpointsModule

@Module
@ContributesTo(ProductionAppScope::class)
object ProductionDbModule

@MergeModules(scope = ProductionAppScope::class)
class ProductionAppScopeMergeModule

@MergeComponent(
  scope = AppScope::class,
  modules = [ProductionAppScopeMergeModule::class],
)
interface AppComponent

@MergeComponent can only aggregate modules for a single scope, so this approach was necessary to support secondary scopes. Metro does support multiple scopes per @DependencyGraph, so we could simply convert our @MergeComponent like so:

@DependencyGraph(
  scope = AppScope::class,
  additionalScopes = [ProductionAppScope::class],
)
interface AppComponent

This, unfortunately, would’ve prevented our codebase from being built with Anvil and Dagger, which was one of the main requirements for the migration. So we had to resort to Dagger-style module includes, which is much less elegant than @MergeModules, but does the job. And we knew we’ll be able to come back and clean this up once we’ve finished rolling out the migration!

@MergeComponent(
  scope = AppScope::class,
  modules = [
    ProductionEndpointsModule::class,
    ProductionDbModule::class,
  ],
)
interface AppComponent

Removing direct calls to @Provides methods

There were a number of instances of @Provides-annotated bindings called directly from non-DI, mostly test, code:

object NetworkingModule {
  @Provides fun provideOkHttpClient(): OkHttpClient = ...
}

class PaymentsIntegrationTest {
  private val okHttpClient = NetworkingModule.provideOkHttpClient()
}

Metro doesn’t allow this, which makes sense: a dependency injection framework built as a compiler plugin should be able to rewrite DI definitions for optimization purposes, and having external code access those definitions would make it impossible. The fix we came up with was to simply split bindings into two methods, one that contains the actual binding logic and the other that calls the first one and is annotated with @Provides. The former is perfectly safe for external code to call!

object NetworkingModule {
  fun okHttpClient(): OkHttpClient = ...
  
  @Provides fun provideOkHttpClient(): OkHttpClient = okHttpClient()
}

class PaymentsIntegrationTest {
  private val okHttpClient = NetworkingModule.okHttpClient()
}

Fixing nullability issues on injected types

We had a surprisingly large number of bindings that returned nullable types for non-nullable injection sites and vice versa. Dagger, being a Java framework, does not distinguish between Kotlin’s nullable and non-nullable types, so this all worked fine at build time, but was definitely opening us up for potential NullPointerExceptions. Metro does honor nullable types, so we had to decide exactly what types we wanted in our bindings. This is a great example where Metro’s stricter validation helped us make our dependency graph more robust!

Replacing @ClassKey with custom map keys

A small number of our features relied on @IntoMap injections with @ClassKeys:

@Module
abstract class LendingActivityItemModule {
  @Binds
  @IntoMap
  @ClassKey(LendingActivityItem::class)
  abstract fun bindLendingActivityItemPresenterFactory(): LendingActivityItemPresenter.Factory = ...
}

@Module
abstract class TaxesActivityItemModule {
  @Binds
  @IntoMap
  @ClassKey(TaxesActivityItem::class)
  abstract fun bindTaxesActivityItemPresenterFactory(): TaxesActivityItemPresenter.Factory = ...
}

class PresenterFactory @Inject constructor(
  private val activityItemPresenterFactories: Map<Class<*>, ActivityItemPresenter.Factory>,
) 

While Metro does interop with @ClassKey, since it’s a Kotlin framework, it would generate a map with KClass keys, while Anvil/Dagger generated a map with Class keys. We couldn’t support both, as that would again break our requirement to be able to build the project in both modes, so we decided to introduce a custom map key:

enum class ActivityItemType {
  LENDING,
  TAXES,
}

@Retention(AnnotationRetention.RUNTIME)
@Target(
  AnnotationTarget.FUNCTION,
  AnnotationTarget.TYPE,
  AnnotationTarget.FIELD,
)
@MapKey
annotation class ActivityItemTypeKey(val type: ActivityItemType)

@Module
abstract class LendingActivityItemModule {
  @Binds
  @IntoMap
  @ActivityItemTypeKey(LENDING)
  abstract fun bindLendingActivityItemPresenterFactory(): LendingActivityItemPresenter.Factory = ...
}

@Module
abstract class TaxesActivityItemModule {
  @Binds
  @IntoMap
  @ActivityItemTypeKey(TAXES)
  abstract fun bindTaxesActivityItemPresenterFactory(): TaxesActivityItemPresenter.Factory = ...
}

class PresenterFactory @Inject constructor(
  private val activityItemPresenterFactories: Map<ActivityItemType, ActivityItemPresenter.Factory>,
) 

While this version is somewhat more verbose, it comes with additional type safety, as it ensures the number of injected keys is bounded by the ActivityItemType enum, so that’s another small win that the migration to Metro helped us unlock.

Deleting unused dependency injection code

Last but not least, we stumbled upon a bunch of unused modules, bindings, components, etc., which we happily deleted. The takeaway here is that dead code, if not deleted, will at some point require non-trivial maintenance, which is a complete waste of effort. It’s always better to simply delete something that’s not used than to keep maintaining it - dead code will live in your git history forever anyway!

One last thing - instantiating dependency graphs

While we managed to get almost the same codebase building with two distinct dependency injection configurations, there was one specific set of API calls that had to be different - the actual graph instantiation calls. With Dagger, we used to call DaggerAppComponent.factory().create(...) inside our application class to instantiate the app component, and with Metro, we had to migrate to the createGraphFactory().create(...) API. Here’s what we did:

We introduced two new custom source sets in our :app module, conditionally added to the build based on that same Gradle property:

 // app/build.gradle
    
 sourceSets {
   def diFramework = providers.gradleProperty('mad.di').getOrElse('AnvilDagger')
   if (diFramework == 'Metro') {
     main.kotlin.srcDir 'src/metro/kotlin'
   } else {
     main.kotlin.srcDir 'src/anvilDagger/kotlin'
   }
 }

We added methods returning AppComponent.Factory with the exact same signature to both source sets:

 // src/metro/kotlin/.../factories.kt
    
 import dev.zacsweers.metro.createGraphFactory
    
 internal fun appComponentFactory(): AppComponent.Factory {
   return createGraphFactory()
 }
    
 // src/anvilDagger/kotlin/.../factories.kt
    
 internal fun appComponentFactory(): AppComponent.Factory {
   return DaggerAppComponent.factory()
 }

We replaced the direct reference to DaggerAppComponent.Factory inside our application class with a reference to the appComponentFactory() method. And that’s it - the Gradle config ensured our code would always call the right version of the method based on the build property.

After a few weeks of iterative code modifications we were finally able to build our project with both frameworks with no code changes in between - that felt like magic!

The rollout

Once we did enough regression testing to ensure there were no runtime issues, we started preparing for the rollout. We knew this would be a tricky one as there’s no way to protect the change with a runtime feature flag - the decision for which DI framework to use happens at build time.

We decided that we’ll continue building the app in both modes up until we’ve fully rolled out, just in case we’d have to revert back to the Anvil + Dagger version. We actually managed to temporarily introduce regressions caused by overly eager post-K2 migration cleanup, so we set up separate CI shards that built the app in each mode, independent of what the state of the Gradle property was.

Finally, when everything was ready, we flipped the default value of the Gradle property and submitted the Metro flavor of the app build to the Play store. The rollout went smoothly and we were officially on Metro!

The results

So what did we achieve with this migration?

We were able to turn on K2 mode to benefit from the latest Kotlin compiler improvements.
We managed to modernize our dependency injection stack:
- We no longer use kapt.
- We don’t use Anvil and Dagger compilers anymore.
- Our dependency injection codegen now runs during Kotlin compilation, which is significantly simpler and faster than what we had before.
According to our benchmarks, by migrating to Metro and K2 we managed to improve clean build speeds by over 16% and incremental build speeds by almost 60%! 🎉

Scenario	Anvil/Dagger (seconds)	Metro (seconds)	Change (%)
ABI Change	28.77s	11.93s	-58.5% ⬇️
Non-ABI Change	17.45s	7.15s	-59.0% ⬇️
Raw Compilation Performance	242.97s	202.49s	-16.7% ⬇️

So what’s next?

We’re gradually migrating to Metro’s native annotations so we can disable interop.
We’re eager to adopt Metro-specific features to simplify our DI graph even further.
We’re committed to contributing back to Metro by reporting and fixing bugs, sharing design feedback and feature requests, to help the framework thrive.

Conclusion

Migrating Cash Android to Metro was a significant undertaking only made possible thanks to the collaboration between a large number of engineers from different teams at Block and the help of the open source community. We’re very happy with the results and really excited about adopting more of Metro’s features and seeing what the future holds. We hope this article will help your team migrate your app to Metro - a modern dependency injection stack and fast builds are well worth the effort!

Kotlin Multiplatform test interceptors with Burst

2025-09-04T00:00:00+00:00

Last year we announced Burst, our Kotlin Multiplatform library for parameterized tests.

I recently needed another JUnit feature that’s absent on Kotlin/Multiplatform: test rules! They offer a simple way to reuse behavior across tests. Here’s some of my favorites:

Paparazzi is Cash App’s snapshot testing for Android. The library’s main entry point is a test rule.
MockWebServer is an HTTP server focused on testing HTTP clients. A major contributor to OkHttp’s stability is its MockWebServer-powered test suite!
JUnit’s Timeout rule lets me write difficult integration tests without the risk of hanging my test suite.

JUnit rules don’t work on non-JVM platforms, so with Burst 2.8 we’re introducing a Kotlin Multiplatform alternative called TestInterceptor. It’s straightforward to create one:

class TemporaryDirectory : TestInterceptor {
  lateinit var path: Path
    private set

  override fun intercept(testFunction: TestFunction) {
    path = createTemporaryDirectory(testFunction)
    try {
      testFunction()
    } finally {
      deleteTemporaryDirectory(path)
    }
  }
}

Use @InterceptTest to apply it:

class DocumentStorageTest {
  @InterceptTest
  val temporaryDirectory = TemporaryDirectory()

  @Test
  fun happyPath() {
    DocumentWriter().write(SampleData.document, temporaryDirectory.path)
    val decoded = DocumentReader().read(temporaryDirectory.path)
    assertThat(decoded).isEqualTo(SampleData.document)
  }
}

Burst can also intercept suspending tests with CoroutineTestInterceptor. JUnit rules can’t do that!

Get Burst on GitHub.

Re-introducing Paparazzi’s Accessibility Snapshots

2025-07-14T00:00:00+00:00

Overview

As some of you may know, Paparazzi is an open source snapshot testing library allowing you to render your Android screens without a physical device or emulator. A feature of Paparazzi that may be less well known is its ability to take accessibility snapshots. While this feature has existed for quite a while, Paparazzi’s accessibility snapshotting capabilities have expanded dramatically in recent months, so I wanted to dive into what accessibility snapshots are, how Paparazzi captures them and why you might want to use this tool to help improve the accessibility of your application.

Accessibility snapshots?

Accessibility snapshots provide a way to visually inspect the semantic accessibility properties applied to each element of your view under test. Similar to Paparazzi’s regular snapshots, this allows you to create baseline images and verify any future changes against them to ensure that no regressions occur to your app’s accessibility support.

As shown in the example snapshot image below, a legend is drawn on the right side where each UI element is mapped (via colour coding) to its accessibility properties. These properties are what would be read out by screen readers your customers might use (i.e. TalkBack).

Paparazzi’s AccessibilityRenderExtension

Paparazzi creates accessibility snapshots through the use of the AccessibilityRenderExtension. The AccessibilityRenderExtension works by iterating over the View tree or SemanticsNode tree, for legacy Android views and Compose UI respectively. On each element, the accessibility semantics are extracted to display them in the legend that will be drawn alongside the UI snapshot. Additionally, the layout bounds of each element are captured to create the coloured boxes that map the elements in the UI to the text in the legend.

To create an accessibility snapshot test, the only change needed compared to a regular Paparazzi test is to add the AccessibilityRenderExtension to the renderExtensions set in your Paparazzi configuration, as follows:

@get:Rule
val paparazzi = Paparazzi(
    // ...
    renderExtensions = setOf(AccessibilityRenderExtension()),
    // ...
)

Recording and verifying accessibility snapshot tests works identically to regular Paparazzi tests.

Interpreting accessibility snapshots

While Paparazzi’s accessibility snapshots provide valuable information, you cannot rely on these screenshots alone to determine UI accessibility compliance. The snapshots require careful interpretation to verify that the set properties match your screen’s expectations, and should be used as one of several tools in a comprehensive accessibility testing strategy. When interpreting the accessibility snapshots, the top things you should look for are that all of the visually available context (e.g. text, icons that convey meaning) in the UI snapshot are represented in the legend, that elements that relate to each are grouped together (e.g. content in a row is represented as single item in the legend) and that the correct role or state is represented for each element (button, header, selected, disabled, etc). The Paparazzi docs have some additional content explaining in more detail how to ensure your UI is accessible.

Conclusion

As I mentioned at the start of this blog post, the capabilities of the AccessibilityRenderExtension have grown dramatically in recent months. Shown above is the big increase in supported semantic properties we have had (14 new properties!), many of which came from open source community feature requests!

I want to end off this blog post by encouraging anyone reading to try out Paparazzi’s accessibility snapshots in your projects! The Paparazzi docs and Github repo are great places to check out if you want any additional help getting started or if you find any issues or feature requests you would like to submit!

New Maven Central signing key and snapshot location

2025-06-13T00:00:00+00:00

In response to Sonatype announcing the end-of-life for OSSRH, we have migrated to their new publishing platform for our open source artifacts. This is otherwise a transparent change for those who consume these artifacts from Maven Central, but there are two related changes which might affect your builds.

First, the GPG key used to sign our artifacts has changed. Previously the keys varied across projects depending on how and who were publishing. Now, a company-wide shared key is used for all projects. A copy of the public key is available at code.cash.app/block.gpg for verification.

Second, projects which publish “snapshot” builds (i.e., builds from the latest commit on their integration branch) are now available in the Central Portal Snapshot repository at central.sonatype.com/repository/maven-snapshots/. Snapshot builds will also be signed with the same key as release builds.

Project Teleport: Cost-Effective and Scalable Kafka Data Processing at Block

2025-03-20T00:00:00+00:00

Introduction

In February 2022, Block acquired Australian fintech Afterpay. This acquisition necessitated the convergence of Afterpay’s Data Lake, originally hosted in the Sydney cloud region, into the Block ecosystem based in the US regions. Project “Teleport”, as the name suggests, was developed by the Afterpay data team to tackle this large-scale, cross-region data processing challenge. Built using Delta Lake and Spark on Databricks, Teleport ensures efficient, reliable, and lossless inter-region data transfer, utilizing object storage for transient data.

By incorporating a nuanced checkpoint migration technique, we performed seamless migration of legacy pipelines without reprocessing historical data. With Teleport, Afterpay data team reduced cloud egress costs by USD 540,000/annum, with zero impact on downstream user experience.

History of Kafka data archival and ingestion at Afterpay

Afterpay archives Kafka data using Confluent Sink Connectors that land hourly topic records as Avro files in the Sydney region (APSE2) of S3. Before Teleport, Spark batch jobs running on Amazon EMR processed these Avro files in the same region into Hive-partitioned Parquet tables which were then presented into Redshift via Glue catalog. The Parquet tables were written as one-to-one or one-to-many projections of Kafka-topics, with Spark transformations handling normalization and decryption.

Kafka pipelines managed by the Afterpay data team process over 9 TB of data daily and deliver data to critical business domains such as Risk Decisioning, Business Intelligence and Financial Reporting via ~200 datasets. In the legacy design, duplicate events from Kafka’s “at least once” delivery required downstream cleansing by Data Lake consumers and late-arriving records added substantial re-processing overheads.

Evolution of Afterpay’s legacy Kafka pipelines to Teleport happened in three phases. Each phase was executed in response to business requirements and optimisation opportunities.

Phase I: Converging Afterpay Frameworks into Block’s Data Ecosystem

Afterpay aligned its Data Lake architecture with Block by adopting Databricks as the primary compute platform and Delta Lake on S3 as the storage layer. As part of this transition, all Parquet tables living in APSE2 S3 were migrated to us-west-2 (USW2) S3 as Delta tables, colocating data with downstream compute.

Databricks offers the following out-of-the-box features that address some of the challenges in Kafka processing, eliminating the need to reinvent the wheel:

Autoloader handles late-arriving records with fault tolerance and exactly-once processing through checkpoint management, offers scalability to discover a large number of files, and supports schema inference and evolution.

Delta Live Tables (DLT) provides:

apply_changes API that handle Kafka data deduplication efficiently by merging incremental records into target tables; and
In-memory and in-line Data Quality (DQ) checks.

Leveraging the above capabilities, we developed Kafka Orion Ingest (KOI) – a fully meta-programmed framework for processing Kafka archives. Pipelines in KOI comprise:

A DLT job for transformations deployed in USW2;
Dynamically rendered Airflow DAG for orchestration; and
SLA alerting and in-line DQ checks.

All of these components are instantiated by simple metadata entries, simplifying deployments and maintenance.

As shown in the figure, KOI reads incremental Avro files via Autoloader, applies transformations and DQ checks, and writes external Delta tables to S3. These Delta tables are added to consumer catalogs and published to downstream services.

Alternate approaches considered for Kafka archival and ingestion

In an ideal, cost-effective architecture, source data, compute and the target tables would reside in the same cloud region. However, Afterpay’s practice of event archival in APSE2 S3 presented two key challenges in the convergence towards Block’s USW2 based Data Lake:

Migration of historical S3 objects and Kafka connectors from APSE2 to USW2 projected a huge one-time cost and engineering overhead.
Maintaining records for the same topics across two regions would add complexity to backfill operations¹ and data reconciliation.

As a trade-off, Afterpay data team adopted a hybrid approach:

Existing Kafka topics continued to land in the APSE2 region, preserving historical archival patterns.
New topics were directly landed into USW2, inline with cost effective architectural practices.

Phase II: Egress Reduction by Data Compression

Empirical analysis of Afterpay Kafka data showed that Avro to Parquet conversion achieves compression ratios close to 50% on average. This observation suggests that Parquet format is a better candidate for cross-region egress.

As Phase II, we added new clusters in APSE2 so that transformed records are moved across regions as Parquet files. This change reduced APSE2 - USW2 egress cost by ~50%.

Merge cost problem

While APSE2 egress cost was substantially reduced by Phase II, we now had a new cost challenge resulting from cross-region merges!

Delta Lake merge operations compare key columns to update or insert only necessary rows and leverage deletion vectors to track changes without rewriting files. Incremental merges into the target tables thus require the key columns to be loaded into the compute memory. With the target tables in USW2, and the compute in APSE2, each merge operation triggered costly data transfers from USW2 to APSE2 – moving huge numbers of parquet data.

At its peak, these merges incurred over $1,500 per day in S3 egress — an unsustainable expense as our data volumes continued to grow.

Phase III: Optimal Cross-Region Merge Using Teleport

Teleport workflow consists of three major components split into two stages of execution. Between the two stages, a “streaming interface”, implemented as a Delta table in APSE2 S3, maintains the latest records from the Avro files within a moving window. The stages involved in Teleport are:

Stage 1. DeltaSync jobs read incremental Avro files for each topic and append them to the corresponding streaming interfaces.

Stage 2. DLT jobs deployed to USW2 use Spark streaming APIs to read new records from the streaming interfaces, apply transformations, and perform incremental merging into the target tables in USW2.

Teleport achieves optimal cross-region merge as a result of:

Transferring compressed Parquet files (format used by Delta) from APSE2 to USW2 compute, retaining the advantages of Phase II;
Performing merge operations entirely within USW2; and
Ensuring minimal and predictable S3 storage costs by keeping the streaming interfaces transient.

These steps are orchestrated using Airflow that uses metadata configurations to determine whether to use a Phase II workflow or Teleport.

Design considerations

Catalogue-free streaming interface. By implementing the streaming interfaces as Delta tables on S3, we eliminate any need to rely on a catalogue for table maintenance operations such as creation, deletion, and vacuum.

Localised auto compaction. In Databricks environments, auto compaction jobs execute asynchronously after a Delta table is successfully written. As an additional optimization, interface tables are placed in APSE2 – allowing Databricks auto compaction to run locally.

Open source commitment. In line with Block’s commitment to open source, all the additional elements introduced by Teleport use the open source Delta format and native spark APIs. A highly available and scalable implementation of Airflow (also open source) is our standard orchestrator.

Sliding window implementation

The sliding window logic used to implement streaming interfaces (as Delta tables) ensures that only a fixed amount of recent records are retained while older ones are automatically deleted (and vacuumed) once all dependent target tables are refreshed.

Benefits of a sliding window approach:

Prevents accidental data loss due to race conditions arising from the concurrent executions of DeltaSync and DLT jobs.
Despite the use of a transient interface, late arriving records are guaranteed to be processed as long as they arrive within a defined window length.
Facilitates reconciliation and data validation, keeping recent records available for ad hoc queries and DQ checks.

The following figure demonstrates how the window moves dynamically based on refresh frequencies across three target tables: Table1, Table2, and Table3.

A Strategy for Seamless Migration to Teleport

A Spark application periodically saves its state and metadata as streaming checkpoints to a specified prefix in fault-tolerant storage systems like HDFS or S3. These checkpoints enable a Spark application to recover and resume processing seamlessly after failures or interruptions.

Phase II DLT jobs use Autoloaders in directory listing mode to incrementally process landing Avro files, with each target table maintaining checkpoints to track successfully processed files. Migration of Phase II DLT jobs to Teleport without preserving the checkpoints would trigger a full re-listing of Avro objects in S3. This would cause significant delays and substantial compute costs.

To mitigate the above challenges, we devised a “hard cut over” migration strategy that transfers the existing checkpoints from Phase II DLT jobs to the DeltaSync job, ensuring zero impact to the downstream user experience.

Avoiding history reprocessing using a “Checkpoint Transfer” job

Transitioning of Phase II DLT jobs to the Teleport workflow was carried out by a separate “Checkpoint Transfer” job in three steps:

Step 1. Initialise the streaming interface by creating an empty Delta table, replicating the source Dataframe structure.

Step 2. Migrate Phase II DLT Autoloader checkpoints to the Teleport interface table checkpoint² location. At this point, the interface Delta table remains empty, but the migrated checkpoints “trick” the DeltaSync job to think that all historical records have been processed.

After the checkpoints are migrated, the DeltaSync job is executed, loading only the newly landed Avro records into the streaming interface.

Step 3. Once the interface Delta table is populated, trigger the initial execution of Phase III DLT job in USW2. Before its initial run, the DLT job in USW2 does not have any checkpoints and treats the interface table as a new source. During this first run:

Records from the streaming interfaces are read and merged into the target tables in USW2.
Processed records are then checkpointed ensuring DLT jobs can resume incremental updates moving forward.

Using this technique, DeltaSync and DLT checkpoints were adjusted to enable uninterrupted incremental processing of target tables during migrations.

Reconciliation of records post migration

A reconciliation job compares Avro files in the landing S3 with the target Delta tables to ensure that no data was lost during the migration. This validation job runs after each migration and checks the last seven days of records for completeness.

Using the migration strategy discussed above, a total of ~120 topics were migrated in batches with negligible cost overhead and zero downtime.

The Outcome : USD 540,000/annum in Savings

Bulk migrations to Teleport commenced in November 2024, with a planned completion by March 2025. The reduction in transfer costs measured by mid-March amounts to an annual savings of ~USD 540,000.

Figure below shows the change in transfer cost averaged over a 14 day rolling window for preserving data privacy.

To further assess Teleport’s impact on reducing transfer costs, we analyzed S3 CloudTrail event logs, which track the total bytes transferred from USW2 to APSE2 for each S3 object. Once a table is migrated to Teleport, cross-region transfers from USW2 to APSE2 stop completely. Hence, the monthly savings for each migrated table corresponds to its pre-migration cross-region transfer cost. Our findings confirm that the cloud cost reductions can be directly attributed to Teleport migrations.

Project Teleport reinforces our commitment to agnostic engineering and open source, leveraging Airflow for orchestration and Spark APIs on cloud compute.

Databricks Autoloader performs periodic backfills by doing a full directory listing. Backfills may also be performed by engineers to refresh the records on a case-to-case basis. ↩
Note that, due to structural differences between DLT and Spark Streaming checkpoint directories, modifications to the checkpoint files are required for this transfer mechanism to work. ↩

Cash App on PlanetScale Metal

2025-03-11T00:00:00+00:00

Intro

At Cash App, we have a few gigantic databases that we ask a lot of. Our solution to managing this kind of capacity has been to utilize Vitess, as a piece of middleware that sits in front of hundreds of otherwise normal MySQL instances, handling the hard work of figuring out what query traffic routes where.

We historically ran this in our own datacenters for many years, however alongside a larger cloud migration effort we elected to work with PlanetScale to move to their cloud managed product. This utilized their standard configuration of each VTTablet and MySQL instance cohabiting the same Kubernetes pod container, backed by a volume mount. VTTablet is Vitess’ middleware that fronts a single MySQL instance, which you can think of as the contact point for the SQL proxy. In this setup we can think of individual shards as essentially fairly normal MySQL servers.

Moving to PlanetScale was a game changer for the team, as we historically run pretty light, and time previously spent maintaining a fairly large bespoke architecture can now be spent on developer experience tooling that makes current and future developers’ lives easier. Over the course of the last few months Cash App and PlanetScale have been working together to migrate our fleet to their new product, PlanetScale Metal, and I wanted to dig in a bit into the whys and hows of this change.

Problems with Volume Storage

Over time after the migration we started noticing issues with our storage volumes. Periodically the volumes would slowly degrade, with performance draining over several minutes, before eventually recovering or dying completely. These events were happening often enough to generate some pager noise and thrash as we dug into the problem. Additionally, as we were in the final phases of cleaning up the cloud lift, we were unable to turn on PlanetScale’s Orc autofailover mechanism, meaning a person had to log in and failover the shard manually.

After consulting with our cloud provider, we decided to switch to a more advanced class of volume temporarily, which cost quite a bit more, but offered much higher availability guarantees. This did mitigate the waves of degradation, however, we ran into another issue: sometimes shards would fail to accept writes, at times for up to 15 minutes. During these periods write traffic would queue up in MySQL, making calls involving that shard much slower than usual. We unfortunately have a decent amount of cross-shard traffic, so this was problematic.

Talking with our cloud provider, it looked like what was happening was we were hitting our IOPS (input/output operations per second) limits with occasionally spiky traffic, leading to this unexpected failure mode. The advice was to increase our limit, however this was frustrating as this would be another bump in cost, and our traffic is generally fairly predictable in most cases.

What is Metal anyway?

Given all these challenges, PlanetScale proposed utilizing their new Metal product on our workload. Metal is unique in the way it utilizes our cloud provider’s instance compute, running on the fastest NVMe (nonvolatile memory express). Rather than utilize separate storage and compute, the machines instead have their own physical storage. This is intended for high throughput data loads such as MySQL, cutting down on hops to get to your data and providing a more consistent failure path when things go wrong.

This, of course, comes with the tradeoff of your machine and data being tied to each other. With traditional volume storage, if the machine goes down, you simply mount it to a different one and are back in business. With instance storage this scenario requires rebuilding the replica. This is a big part of why enabling semi-sync replication is a prerequisite for using Metal, as you have confidence that writes won’t drop on the floor. Additionally, PlanetScale’s backup restore system is very well exercised; it’s a normal part of verifying the backup process every time it’s run.

managing costs

The other big difference is Metal bundles up the components you need to run a database shard into a single machine, potentially providing savings compared to a traditional standalone compute + volume storage setup. The two components in that setup are billed separately, with the quota of IOPS required for storage coming at a premium for our needs. When we move to instances the compute and storage come at a single cost, and with Metal, there is virtually no limit on the max amount of IOPS. Thus by using the local storage built into Metal we remove the unbounded risk of costs associated with buying more IOPS from cloud providers.

p99 off a cliff

The cost savings are nice to be sure, however day-to-day the real wins come from the stability and power of running on this kind of setup. Since moving, we’ve seen much more predictable failure modes, where write buffering is a thing of the past. Additionally, the changes to p99 latency were dramatic, cutting our main workload by 50%.

Additionally, during a recent event we saw query traffic double beyond normal values for a period, and while this was happening response times and metrics were very comfortably nominal, something we certainly haven’t been able to say in the past.

We are very happy with our decision to migrate to PlanetScale Metal which enabled us to achieve the rare outcome of improvements in performance, cost, and reliability – a win for our customers and our business.

Data Safety Levels Framework: The foundation of how we look at data in Block

2025-01-16T00:00:00+00:00

One of our foundational principles at Block is incorporating privacy and the protection of customer data into every layer of our software systems. This commitment goes beyond meeting the numerous regulatory requirements for how we process and manage customer data that we face as a financial technology company: we believe protecting this customer data is essential to building and maintaining our customers’ trust in us.

One of the biggest challenges in protecting customer data turns out to be devising a system for thinking about data sensitivity that lends itself to engineering scalable solutions that can be automated and built transparently into our systems so they simply work. Data itself is complex and sensitivity can vary based on context. Solutions often either ignore the sensitivity variance or overly simplify this complexity, resulting in under-protection of the data for the customer or overly rigid systems that hinder innovation and limit our ability to serve customers effectively.

In this post, we introduce the Data Safety Levels (DSL) Framework that we initially built for Cash App and have since extended across the rest of our diverse product ecosystem, including Square and TIDAL. The DSL framework forms the foundation of the way we understand data. It acknowledges the complexities of data by recognizing that data:

Exists as part of a larger set, with sensitivity being an emergent property of the set rather than individual elements.
Is contextual, requiring us to consider context when determining its management and usage policies.

This framework has created a strong foundation for us to build guidelines and policies on top of which allow us to better show not just our compliance with our regulatory requirements but also our commitment to customer trust.

Our Origin Story

We had long had an internal policy around classifying and handling sensitive data, especially PCI-relevant data and Personally Identifying Information (PII). This framework, for the most part, classified each semantic type of data as being either Public, Confidential, Basic PII, or Secret PII. Over time, it grew increasingly complicated with specific requirements around particular data being covered by either a PCI standard, SOX, PII, or MNPI. This policy made engineering increasingly complicated as it required both service and platform engineers to be aware of the nuances of various standards and regulations when their underlying questions were really: “can Security sign-off on my design doc yet?”. It also resulted in many questions to security teams like, “is this particular data type PII?” for which the answer was always (frustratingly), “well, it depends.”

Coincidentally or not, with a lot of extra time to read things on the Internet during a global pandemic, we learned about the US Centers for Disease Control and Prevention (CDC) Biosafety Level system for rating the risk levels of biological agents and approving facilities for storing and handling them. The World Health Organization also publishes laboratory biosafety manuals with more elements of this framework including the risk assessment methodology that assigns one of four levels to particular biological agents as well as laboratory safety requirements for handling biological agents at each level. The framework of assigned risk levels and increasing control requirements made sense to us as inspiration for another type of thing that we did not want to accidentally expose to people: regulated and sensitive data.

Why a Dataset-Oriented Approach?

In practice data usually exists as part of a larger set, where the relationships between elements can impact their overall sensitivity. A phone number on its own may not be as sensitive as a phone number combined with a precise home address and transaction history. The DSL framework allows us to reason about such combinations and ensure that data is classified appropriately based on its aggregate sensitivity, not just on the sensitivity of individual elements.

For example, in our Cash App Investing operations, the DSL classification for customer data doesn’t just consider individual components like an account number or government-issued ID—it considers how these pieces combine to potentially elevate the risk of exposure. Thus, each dataset’s DSL is determined by considering the highest level of sensitivity found within its components, ensuring that we adopt the strictest safeguards when necessary.

Problems DSL Framework Addresses

The DSL Framework was developed to address these needs. It provides:

Actionable Guidance for Teams: The framework translates data sensitivity into clear data safety levels that dictate the required policies for handling specific data sets, from consumer information to merchant financials.
Consistency Across the Organization: By using a unified framework, all teams across the various products at Block Inc. (Cash App, Square, TIDAL) have a common language and understanding of how to secure data. This consistency is crucial for ensuring we meet the highest security standards globally.
Compliance Across Jurisdictions: Block operates in multiple jurisdictions, each with its own set of regulatory requirements for data security and privacy. The DSL Framework is an effective tool for mapping these regulatory requirements to our internal security practices. By using the DSLs as the benchmark, we can ensure that we meet or exceed the data security obligations in regions like the U.S., Europe, and Asia. Our DSL rubrics and guidelines help streamline product development by providing a uniform and self-service framework for engineering teams to ensure that all data meets the necessary standards.
Incremental Security Controls: With four Data Safety Levels ranging on a numerical scale from DSL-1 to DSL-4 (lowest to highest sensitivity), each DSL builds on the preceding level with additional controls to ensure the security measures we apply are calibrated to the risks involved. For example, highly sensitive data, such as tax return information, is classified at DSL-4, meaning it requires stringent protections like application-layer encryption and multi-party authorization.

Key Components of the DSL Framework

The DSL framework at Block is actionable for both automated and manual processes, providing a clear roadmap for platform and product development teams to understand what protections they must implement based on the data they are handling. Here are some of its critical components:

Data Classification Rubrics: To determine a dataset’s DSL requirement, we apply specific rubrics designed for different data domains to perform a risk assessment of the data set to determine the appropriate DSL requirement. We have rubrics for Consumer Personal Data, Payment Card Data, Merchant Data, and more. These rubrics standardize how we assess sensitivity, ensuring that each dataset receives a consistent and accurate classification.
Data Safety Guidelines: The DSL framework is complemented by our Data Safety Guidelines, which define the minimum protections that systems must implement based on their DSL rating. These guidelines include measures like access controls, encryption standards, auditability, and more. Systems approved at a particular DSL must meet all the prescribed security controls for that level and any lower levels, ensuring a robust baseline of security.
Automation and Manual Processes: The DSL framework is designed to integrate seamlessly into our workflows, leveraging automation to classify data and verify that proper protections are in place. At the same time, manual reviews ensure that our systems comply with specific regulatory requirements and address any nuanced security needs that automation alone cannot handle.
Access and Usage Controls: The framework’s effectiveness also relies on enforcing appropriate access controls. For example, datasets classified at DSL-4 or higher are generally protected with multi-party authorization (MPA), ensuring that no single individual can access sensitive data without additional oversight. This prevents unilateral actions that could jeopardize data security and demonstrates our commitment to upholding customer privacy.

Real-World Examples of the DSL Framework at Work

Tokenized Payment Data

Payment card data, such as Primary Account Numbers (PANs) and Card Verification Codes, are highly sensitive and classified as DSL-4. By applying our DSL Framework, we require this data to be encrypted at the application layer before it is stored or transmitted. Fidelius, our tokenization service, manages such data to ensure it remains secure during payment processing and at rest. The DSL Framework allows downstream systems, with lower safety level capabilities, to process this data without compromising on security, as long as strict encryption standards are upheld.

Cash App Investing Data

Cash App Investing (CAI) data, such as trading patterns or Social Security Numbers (SSNs), also falls into higher DSLs—typically DSL-3 or DSL-4, depending on the specifics. The DSL classification ensures that appropriate access controls and encryption are in place, including requiring employee fingerprinting for access to the most sensitive records. This not only adheres to regulatory requirements, such as FINRA rules, but also demonstrates our commitment to proactively protecting customer data.

Tax Return Information

Tax Return Information (TRI) collected through Cash App Taxes is classified as DSL-4, given its highly sensitive nature. Compliance with IRS requirements and ensuring privacy of TRI is a non-negotiable part of our operations. The DSL Framework supports this by enforcing strict encryption, auditability, and access controls—all designed to minimize the likelihood of unauthorized disclosure or misuse.

DSL is just the Start

The DSL framework is live and has expanded steadily over the years of its adoption. New products as well as feedback loops from internal audits, security incidents, and regulatory changes have translated to identification of new semantic types and classification rubrics as well as new mapping of data to safety levels.

Developing our perspective on data has been a collaborative effort between Security, Governance and Compliance and most importantly, Product. Starting from our inspiration in WHO’s biosafety levels, we have intentionally challenged ourselves to understand data, its lifecycle and its requirements in a holistic and systematic manner, with the knowledge that automation is a must given the scale of the data we deal with.

This framework is also just the beginning of the story. Now that we have a systematic way of conceptualizing our data, we need to complement it by our Data Safety Guidelines and the implementation of these guidelines in a scalable, automated and transparent way that seamlessly integrates into our systems.

This blog is also the first in its series as we describe some of the challenges and solutions we have encountered in this space.

Data Safety is for Everyone

Block is committed to improving Data Safety in our community. In the coming months, we hope to open source the DSL framework and allow others to not just use and adopt this foundation but also build upon it and enhance the protection of customer data across the industry. We look forward to hearing from you.

Dispatchers.Unconfined and why you actually want EmptyCoroutineContext

2025-01-15T00:00:00+00:00

Dispatchers.Unconfined is one of kotlinx.coroutines’ built in CoroutineDispatchers. It’s different from other built in dispatchers as it’s not backed by a thread pool or other asynchronous primitive. Instead, Dispatchers.Unconfined is hardcoded to never change threads when entering its context (this is called “dispatching”). It’s pretty easy to verify this from its (simplified) implementation:

object Unconfined : CoroutineDispatcher() {
  override fun isDispatchNeeded(context: CoroutineContext) = false

  override fun dispatch(context: CoroutineContext, block: Runnable) {
    throw UnsupportedOperationException()
  }
}

This behavior is different from Dispatchers.Main or Dispatchers.Default, which will change threads if you’re not already on one of their preferred thread(s). As a result, code using Dispatchers.Unconfined will always execute synchronously when entering its context.

In practice, this means that any code inside of the Dispatchers.Unconfined has no guarantees about what thread it will run on. This can create subtle bugs as dispatching occurs both when entering a new context and when returning from it. Consider this example where we read some text on the IO dispatcher then update the main thread with the result:

// Pretend these dispatchers are injected.
val ioDispatcher = Dispatchers.IO
val mainDispatcher = Dispatchers.Main

withContext(ioDispatcher) {
  val firstText = readFile(1)
  val secondText = readFile(2)
  withContext(mainDispatcher) {
    textView.text = firstText
    delay(1.seconds)
    textView.text = secondText
  }
}

If we’re testing this function, say in a screenshot test, and we know our test starts on the main thread we may want to avoid dispatching entirely so our test executes synchronously on the calling dispatcher. We can do this by injecting Dispatchers.Unconfined for our IO and main dispatchers:

val ioDispatcher = Dispatchers.Unconfined
val mainDispatcher = Dispatchers.Unconfined

withContext(ioDispatcher) {
  val firstText = readFile(1)
  val secondText = readFile(2)
  withContext(mainDispatcher) {
    textView.text = firstText
    delay(1.seconds)
    textView.text = secondText // This line will crash!
  }
}

However, this change introduces a crash as delay changes the context to Dispatchers.Default internally and because we’re using Dispatchers.Unconfined we never dispatch back to the main thread. When we try to update textView’s text it will throw a CalledFromWrongThreadException.

This example also shows how Dispatchers.Unconfined breaks one of coroutines’ best features: making threading a local consideration. When we use Dispatchers.Main or Dispatchers.Default we don’t have to worry about dispatching back to the right thread after calling another suspend fun - it’s handled for us.

There’s a better way

Typically we use withContext to change the CoroutineDispatcher, but withContext actually accepts a CoroutineContext. You can think of CoroutineContext as equivalent to Map. When we invoke withContext(Dispatchers.Unconfined) we’re overwriting the current context’s CoroutineDispatcher key with Dispatchers.Unconfined.

Instead, we should use EmptyCoroutineContext as it doesn’t update the current context’s CoroutineDispatcher. This means we don’t dispatch when calling withContext(EmptyCoroutineContext) as the coroutine context doesn’t change but we’ll still dispatch back to the right thread if another function like delay changes the context. Let’s reexamine the above example using EmptyCoroutineContext instead of Dispatchers.Unconfined:

val ioDispatcher = EmptyCoroutineContext
val mainDispatcher = EmptyCoroutineContext

withContext(ioDispatcher) {
  val firstText = readFile(1)
  val secondText = readFile(2)
  withContext(mainDispatcher) {
    textView.text = firstText
    delay(1.seconds)
    textView.text = secondText // Does not crash.
  }
}

Using EmptyCoroutineContext lets us continue executing synchronously on the main thread and avoids crashing as we correctly dispatch back to the main thread after delay. It’s for these reasons that at Cash App we inject all our dispatchers as CoroutineContexts and inject EmptyCoroutineContext in our tests:

class MoneyPresenter @Inject constructor(
    @IoDispatcher private val ioDispatcher: CoroutineContext,
)

In fact, there are actually very few cases where you need to reference the CoroutineDispatcher class at all. CoroutineScope(), withContext, and CoroutineContext.plus all accept a CoroutineContext. CoroutineContext is also more flexible as there are other elements you can add like CoroutineName for debugging purposes. I’d recommend replacing all your references to CoroutineDispatcher with CoroutineContext - especially if you maintain a public API. Coil updated its public API to accept CoroutineContext instead of CoroutineDispatcher in 3.0. Thanks to Bill Phillips for suggesting this change!

Also thanks to Bill Phillips, Jesse Wilson, and Raheel Naz for reviewing this blog post.

Encryption using data-specific keys

2024-12-12T00:00:00+00:00

At Cash App, securing our customers’ data is not just a priority - it’s a responsibility. One cornerstone of our approach to achieve that is encryption. We’re obsessed with it. We’ve been using encryption at the application layer so much that our security team’s motto is “encrypt everything”. Application layer encryption specifically has been our bread and butter because it removes the data storage tier from the compliance scope and security threat model, allows granular access control over who can decrypt the data, enables easy integration with privacy regulations, and much more.

Introduction

In the last blog post we wrote about Application Layer Encryption in AWS back in **checks notes**… 2020, we discussed how we’re creating KMS instances and Tink keysets for Cash App services running in the cloud.
Here’s a quick reminder of the general layout of how it works:

Each service in our cloud backend has its own KMS Customer Managed Key (CMK) instance associated with it.
To create a new service encryption key, we’ll generate a Tink keyset and then call the necessary CMK to encrypt it (using envelope encryption).
The encrypted Tink keyset is then persisted in the service’s resource folder; and committed to Git.
When the service starts up, it calls the CMK to decrypt the Tink keyset, and then store it in-memory to encrypt/decrypt data as needed.

Things have changed quite a bit in our app-layer encryption setup from that original design. Our initial approach was tailored to secure the data managed and stored by individual services. In that design, encryption keys were tightly coupled with the services using them, aligning well with our early needs. However, as we expanded app-layer encryption to encompass our data transport infrastructure—spanning gRPC and our Kafka event bus—this service-centric model began to show its limitations. The tight coupling made the system less flexible and more challenging to scale. To address these challenges, we evolved our encryption strategy. The next generation of app-layer encryption shifted from a service-centric model to a data-centric model, decoupling encryption keys from individual services and instead associating them directly with the data itself. This change enabled us to maintain robust security while enhancing flexibility and scalability across our infrastructure. We refer to this latest evolution of our encryption infrastructure as “Data Keys”.

There are 2 main differences between these approaches. First of all and most importantly, we switched from having a map of (CMK instance → service) to a (CMK instance → encryption key). That might seem like a minor detail, but it is very significant; it means that encryption keys can be associated with their own independent CMK. This makes it possible to have multiple services access the same key.

Secondly, moving away from service-centric keys also affects where encrypted key material may be stored, such as in a more accessible S3 bucket instead of in the service’s resources directory.

Key Access Controls With IAM

When an encryption key has its own CMK, an AWS IAM policy can be attached to the CMK defining which roles can access it, and what APIs they can use.

{
  "Version": "2012-10-17",
  "Statement": {
    "Effect": "Allow",
    "Principal": {
      "AWS": "arn:aws:sts:::assumed-role/key-name/session-name"
    },
    "Action": [
      "kms:Decrypt"
    ],
    "Resource": "*"
  }
}

When specifying the AWS Principals in the IAM policy, assumed-role session principals can be used to ensure that only roles assumed via AWS STS are allowed access. When clients rely on short-lived, dynamically generated access via STS, it reduces the risk of long-term credential exposure and limits the impact of compromised access.

A relevant IAM Role can be defined for each encryption key in a dedicated AWS account. The AWS account data-keys-roles has a role for each data key (AKA “bastion role” or “access role”) that grants permission to decrypt that data key. This access role’s trust policy allows a role in other consumer accounts to assume the access role. identified by the isolation of the role(s) in question. Thus, enabling access to our encryption keys The “bastion role” pattern is one implementation of AWS’ “IAM Role Chaining” concept, from different AWS accounts and business units in Block.

Key Storage In S3

Since encryption keys are no longer tied to a single service, it becomes impractical to store encrypted Tink keysets in the service’s resources directory or commit them to Git. To address this, encrypted Tink keysets can be stored in a dedicated S3 bucket within the same AWS account as the CMKs. This approach not only centralizes key management but also leverages S3’s built-in versioning, enabling the recovery of keysets in case of accidental deletion or overwrites. Security remains intact because the Tink keysets are encrypted, and access to the corresponding CMKs is strictly governed by IAM roles, ensuring that only authorized services can decrypt them.

Provisioning Data Keys

So far, we’ve described the following resources needed in this design:

CMK per encrypted Tink keyset in the data-keys account
IAM Role for each KMS instance in the data-keys account
Encrypted Tink keyset stored in an S3 bucket in the data-keys account
IAM policy for each CMK specifying the role above as the principal in the data-keys account
IAM Role for role chaining in the data-keys-roles account
IAM policy specifying which principals can assume the role in the data-keys-roles account

Provisioning these AWS resources should be easily accomplished using Terraform. Creating and encrypting the Tink keysets is straightforward with tools like Tinkey. And the last remaining step is to upload the encrypted keyset to the dedicated S3 bucket.

All of the above tasks are easily accomplished in a simple bash script, and executed via most CI platforms. Which means that now, key provisioning and management is completely self-served, fully audited, and automated.

Lessons Learned

The main improvement with this design is the de-coupling of encryption keys and services. The ability to share encryption keys between services and workloads, and even other cloud accounts and consumers naturally led to the situation that keys became associated with the data they protect.

For example, encryption keys can be created per Kafka topic. Any workload or service that needs to produce or consume data from a specific topic, must have access to that topic’s encryption key. In fact, this change in how keys are being provisioned and used was so much easier and friendlier for engineers to use, that it led to a big spike in adoption of data encryption and keys being created that we’re now encrypting on average more than 8TB of data a day.

The same happened with protocol-buffer messages (to be continued)…

KotlinPoet 2.0 is here!

2024-11-05T00:00:00+00:00

Lo! KotlinPoet 2.0 doth grace our realm!

~~William Shakespeare~~ ChatGPT

KotlinPoet is an ergonomic Kotlin and Java API for generating Kotlin source files. Source code generation is a useful technique in scenarios that involve annotation processing or interacting with metadata files: popular libraries, such as SQLDelight and Moshi, use KotlinPoet to generate source code.

After originally releasing KotlinPoet 1.0 in 2018, today we’re announcing the next major version of the library - KotlinPoet 2.0!

We decided to keep KotlinPoet 2.0 source- and binary-compatible with 1.0 to make the migration as seamless as possible. That said, 2.0 ships with an important behavior change:

Spaces don’t wrap by default

KotlinPoet 1.x was designed to replace space characters with newline characters whenever a given line of code exceeded the length limit, so the following FunSpec:

val funSpec = FunSpec.builder("foo")
  .addStatement("return (100..10000).map { number -> number * number }.map { number -> number.toString() }.also { string -> println(string) }")
  .build()

Would be generated as follows, honoring the line length limit:

public fun foo(): Unit = (100..10000).map { number -> number * number }.map { number ->
  number.toString() }.also { string -> println(string) }

This usually led to slightly better code formatting, but could also lead to compilation errors in generated code. Depending on where this function occurred in the generated code, it could be printed out as follows:

public fun foo(): Unit = (100..10000).map { number -> number * number }.map { number -> number.toString() }.also
  { string -> println(string) } // Doesn't compile, "also {" has to be on one line!

Developers could mark spaces that aren’t safe to wrap with the · character, but the discoverability of this feature wasn’t great:

val funSpec = FunSpec.builder("foo")
  .addStatement("return (100..10000).map·{ number -> number * number }.map·{ number -> number.toString() }.also·{ string -> println(string) }")
  .build()

KotlinPoet 2.0 does not wrap spaces, even if the line of code they occur in exceeds the length limit. The newly introduced ♢ character can be used to mark spaces that are safe to wrap, which can improve code formatting:

val funSpec = FunSpec.builder("foo")
  .addStatement("return (100..10000).map { number ->♢number * number♢}.map { number ->♢number.toString()♢}.also { string ->♢println(string)♢}")
  .build()

The generated code here is similar to the original example:

public fun foo(): Unit = (100..10000).map { number -> number * number }.map { number ->
  number.toString() }.also { string -> println(string) }

The · character has been preserved for compatibility, but its behavior is now equivalent to the regular space character.

Please let us know if you’re experiencing any issues with the new release by opening an issue in our issue tracker, or starting a discussion if you’d like to provide general feedback or are looking for help on using the library.

Get KotlinPoet 2.0 on GitHub today!