Scribd Technology

Dual-Embedding Trust Scoring

2026-02-25T00:00:00+00:00

Scribd is a digital library serving academics and lifelong learners, offering hundreds of millions of documents. This very nature presents a significant concern: content trust and safety. Protecting our library from undesirable and unsafe content is a top priority, but the multilingual and multimodal (text and images) nature of our platform makes this mission very challenging. Also, while third-party tools exist, they often fall short, lacking the nuance to handle our specific trust and safety categories.

To this end, we capitalized on Generative AI (GenAI) signals and our proprietary multilingual embeddings, in conjunction with classical machine learning methods, to develop our Content Trust Score. This metric reflects the severity of a document violating a specific trust pillar, enabling us to identify high-risk content and take appropriate actions. Ultimately, the score allows us to build a more robust and scalable moderation system, ensuring a safer and more reliable experience for all users while preserving the rich diversity of our UGC.

The data and methodologies presented here are for research purposes and do not represent Scribd’s overall moderation or policy implementation.

Content Trust Pillars

According to our internal Trust & Safety framework, we defined and prioritized our current efforts on four top-level concern pillars:

Illegal: Documents that contain or promote illegal materials or activities
Explicit: Sexual or shocking content
Privacy/PII: Violate privacy or contain Personally Identifiable Information (PII)
Low Quality: Junk, gibberish, low information, or non-semantic documents

To maintain a clear project scope, we focused our research on these four semantic-heavy pillars where our embedding-based approach offers the greatest impact. The remaining violation types are out of scope and are addressed by other specialized detection algorithms.

From Embeddings to Trust Score

Datasets & Features

We leveraged annotated data at Scribd, which includes human-assigned trust labels, to craft our core modeling dataset of roughly 100,000 documents. This dataset was split 90-to-10 for training and testing data and distributed across the four trust pillars. The training set was used exclusively to derive the Content Trust Pillar embeddings, while the testing set provided the initial basis for comparison between content- and description-based scores. In addition to the four primary Trust Pillars, we also included documents not violating any trust & safety pillars. These clean documents serve as the “baseline” in our analyses. It is important to note that the data presented here is for discussion purposes and does not represent approximate category distributions within the Scribd corpus.

Table 1. Document distribution across Trust Pillars. This table details the percentage of labeled documents within the training and testing datasets. Note that the Clean documents are included separately as the baseline.

Trust Pillar	Training Dataset	Testing Dataset
Illegal	1.49%	1.56%
Explicit	0.39%	0.41%
PII/Privacy	5.43%	5.48%
Low Quality	2.18%	2.17%
Clean	90.51%	90.38%

The core feature of our project is the 128-dimensional semantic embeddings for every document, which were generated using the LaBSE model, fine-tuned on our in-house dataset. Specifically, semantic embeddings are dense, numerical vector representations of text in a high-dimensional space. The goal of the embeddings is to map linguistic meaning into this vector space such that pieces of text with similar semantics are positioned mathematically closer together. Moreover, the degree of similarity between texts can be quantified by the distance between their respective vectors. For instance, in Figure 1, the words “circle” and “square” are closer to each other since they are semantically more similar, compared to words like “crocodiles” or “alligators”. This allows us to represent all the text in our document using a vector of numbers and accurately quantify their semantic relationships.

Figure 1. Conceptual visualization of semantic embeddings.

Content Trust Score

The first step in generating the Trust Score was creating the representative vectors for each trust pillar. Using the semantic embeddings, we generated the Content Trust Pillar embeddings for each trust pillar by averaging the embeddings of all documents with that pillar’s label in the training dataset. The large size of the training dataset helps ensure the representativeness of these Pillar embeddings.

The content trust score for a Trust Pillar was then computed as the cosine similarity between the document’s embedding and the corresponding Trust Pillar’s embedding. Crucially, all scores are generated and evaluated exclusively using the testing dataset to strictly avoid data leakage and circularity in our analysis. Our hypothesis is that documents closely matching a specific trust pillar will yield a high similarity score against that Pillar’s embedding, while non-matching documents will yield a low score.

This concept is visualized in Figure 2, where each “Pillar” represents a distinct trust pillar centroid. Individual documents are clustered around their respective pillar, illustrating that the closer a document’s embedding is to a specific Trust Pillar embedding, the higher its calculated similarity score, which confirms a stronger thematic match to that pillar.

Figure 2. Conceptual visualization of Trust Pillar embeddings and document similarity in a high-dimensional space. Each colored dot represents a single document.

Enhancing Semantics with Description Embeddings

While the content-based semantic embeddings are generally effective, they struggle in certain cases where the raw text is not fully informative. Specifically, these embeddings may fail when documents are extremely long, image-heavy, or contain meaningless repetitive text.

In these scenarios, a brief content summary can provide a superior document representation. For example, Figure 3 illustrates a document containing presentation slides where the raw text is minimal, yet the user-provided description is quite informative.

Figure 3. Example of an extremely long document with good descriptive metadata. This example demonstrates how a concise, user-provided description (bottom box) provides more focused, informative text for embedding than the raw content of an extremely long document.

However, since users often do not provide adequate descriptions upon document upload, we rely on large language models (LLMs) to generate descriptive summaries based on the content. Figure 4 demonstrates this necessity, showing a document with lengthy and repetitive text where the LLM-generated descriptions (GenAI descriptions) summarize the core topic effectively.

Consequently, we generated a second set of document semantic embeddings and the corresponding Content Trust Pillar embeddings based on the LLM-generated descriptions. This dual-approach allowed us to compute the content trust score using the alternative, enhanced representation.

Figure 4. Example of a document with meaningless, repetitive content (top). The LLM successfully analyzes and summarizes the document, providing a usable description for embedding generation (bottom).

Content- vs. Description-Based Trust Scores

For each trust pillar, we compared the distribution of the content trust scores derived from the document’s content to their GenAI-description-based counterparts, using the approximately 10,000-document testing dataset. To ensure a fair comparison, we included only documents for which both sets of scores were available. Our results reveal that content-based trust scores outperformed the scores generated from GenAI descriptions for all Trust Pillars (Figure 5a-c) except the Low Quality pillar (Figure 5d).

For the majority of Trust Pillars, the content-based scores demonstrated strong discrimination: they were higher for documents truly violating a given pillar (True Positives) than for documents violating other trust pillars or clean documents. Conversely, for these same pillars, the GenAI-description-based scores were indistinguishable from those of other documents, or showed significantly less separation compared to the content-based counterparts. This suggests that while content-based embeddings offer a superior representation for general trust identification, the descriptive embeddings provided little added value for these pillars.

This performance pattern is reversed for Low Quality documents. Specifically, the content-based scores for Low Quality documents were ineffective, proving to be indistinguishable from those violating other trust pillars or those labeled as clean. The GenAI-based approach, however, showed a distinct advantage: GenAI-description-based scores were significantly higher for Low Quality documents compared to all others. This result indicates that the descriptive summary is crucial for accurately identifying this specific type of document.

Figure 5. Trust Score Distribution Comparison of Content vs. GenAI-Description Trust Scores. Violin plots showing the distribution of trust scores for documents belonging to a specific violation pillar (blue) compared to all other documents (red; other pillars in scope or clean documents).

For completeness and to verify that our results were not skewed by the presence of other violating documents, we conducted a final comparative analysis by isolating the scores of labeled documents against only the clean, non-violating documents. As evident in Figure 6, the core patterns persist: the content-based scores consistently yield superior separation between violating content (blue) and clean content (green) for the Illegal, Explicit, and PII/Privacy pillars (Figure 6a-c). In sharp contrast, the GenAI-description-based scores for these same three pillars exhibit significantly greater distribution overlap. Conversely, for the Low Quality pillar (Figure 6d), the GenAI-description method again established a much clearer boundary from the clean documents than the content-based method, further validating our hybrid scoring approach.

Figure 6. Trust Score Distribution Comparing Pillars Exclusively to Clean Documents. Violin plots showing the distribution of scores for documents belonging to a specific violation pillar (blue) compared only to Clean documents (green).

Score Generation for All Documents

Based on these differentiating findings, we adopted a hybrid scoring approach: we use the content-based trust scores for the Illegal, Explicit, and PII/Privacy pillars, and the GenAI-description-based trust scores for the Low Quality pillar. This decision enabled the computation of the most effective Content Trust Scores for all documents in our library across every trust pillar.

Classification Through Threshold Setting

The content trust score reflects the extent to which a document violates a specific pillar – a high score indicates that the document closely resembles the defined trust violation type. To build a classification system that flags violations, we must determine an optimal score threshold.

Strategic Thresholding: Prioritizing Precision

In this work, we chose to prioritize precision to build a high-confidence classification system. Our goal is to maintain a very low mislabeling rate, specifically aiming for a false positive rate (FPR) close to 1%. This decision is driven by the need to minimize user friction – incorrectly flagging documents as violating trust pillars would be an undesirable user experience, making the avoidance of high FPR our primary concern.

Building the Evaluation Dataset

The inherently low document count for certain violation types (e.g., Explicit) prevented us from performing reliable analyses to determine classification thresholds. To address this methodological challenge, we developed an expanded evaluation dataset. This was built by taking the original modeling data (both training and testing sets) and augmenting it with a high volume of additional human-annotated documents from our existing corpus. By incorporating this high-volume, high-quality labeled data, we established a more comprehensive baseline for threshold analysis. To ensure fair comparisons between the content-based and GenAI-description-based scores, we filtered the data to only include documents with both scores available. This refinement resulted in a final working total of approximately 109,000 documents in the evaluation dataset.

Final Classification Thresholds

For each of the four in-scope trust pillars, we calculated classification metrics, specifically recall and false positive rate (FPR), across a range of thresholds (0.5 to 0.95). Adhering to our rigorous safety standards, we prioritized precision to maintain an FPR close to 1%. This conservative thresholding strategy was chosen to minimize user friction associated with false flagging. The final score thresholds for the classification systems of the four Trust Pillars are summarized in Table 2.

Table 2. Classification metrics at the chosen thresholds for the Trust Pillars.

Trust Pillar	Score Threshold	Recall	False Positive Rate
Illegal	0.80	71.83%	0.79%
Explicit	0.80	10.22%	1.07%
PII/Privacy	0.75	3.82%	0.62%
Low Quality	0.60	27.20%	0.52%

The analysis revealed that the Illegal pillar achieved the optimal balance of metrics, securing a high recall of 72% while maintaining an excellent FPR of 0.79%. The Low Quality pillar, which relies on the GenAI-description-based scores, achieved a respectable recall of 27.2% with a very low FPR of 0.52%. This outcome validates our decision to utilize the descriptive embeddings for this challenging content type.

However, this high-performance scenario was not replicated across all Trust Pillars. Specifically, the strict FPR target limited the system’s ability to capture certain violations, with Explicit and PII/Privacy achieving only a recall of 10% and 4%, respectively. This disparity highlights the inherent challenges in identifying documents violating these two pillars, as their topical language is much broader and less defined compared to the other classes.

These results serve as an initial performance baseline. We are actively exploring internal refinements to our embedding representations and scoring logic, as well as integrating complementary models, to progressively enhance detection sensitivity. Our goal is to expand coverage across these more complex pillars while strictly upholding our commitment to a low false-positive environment.

Discussion

Our work demonstrates a straightforward and flexible content moderation system by effectively leveraging classical machine learning principles (cosine similarity, thresholding) alongside modern Large Language Models (LLMs) for superior document representation. This hybrid approach offers several key operational and technical advantages:

Technical and Operational Advantages

Scalability and Efficiency: The final content trust score calculation relies on simple vector mathematics (cosine similarity) against pre-computed pillar embeddings. This allows the system to run efficiently at scale with a low computational cost for real-time inference.
Customizable Representations: The system is easy to fine-tune, allowing us to quickly update the trust category representations (the Pillar Embeddings) using new data. This flexibility is critical for adapting the system to the unique data and specific violation nuances present in our library.
Enhanced Contextual Understanding: Incorporating LLM-generated summaries provides a level of contextual understanding that helps handle the nuance and ambiguity often present in challenging document types (e.g., extremely long documents or those with minimal text).
Resilience to Emerging Threats: The use of semantic embeddings, which capture underlying meaning rather than just keywords, allows the system to adapt well to new or evolving types of harmful content without requiring constant manual rule updates.

Potential Applications

The Content Trust Score and the underlying classification system created in this project open the door to various critical applications at Scribd:

Content Safety in Discovery: Serving as a primary filter to ensure safe content appears prominently in search results and recommendation feeds. Our N-way testing experiments revealed that filtering unsafe content from search results significantly increases core business metrics (e.g., signup) and user engagement (e.g., read time).

Acknowledgments

This work was a collaborative effort, and we are incredibly grateful to the following individuals and teams for their invaluable contributions:

Rafael Lacerda, Monique Alves Cruz, and Seyoon Kim for their strategic guidance and steadfast support throughout the project.
John Strenio for his foundational research and exploratory work that paved the way for this initiative.
Kara Killough for her diligent efforts in building the high-quality annotated datasets that powered our models.
The Search and Recommendation Teams for their partnership and agility in integrating the trust scores, directly driving the measurable improvements in our user experience and business metrics.

Screaming in the Cloud

2026-02-10T00:00:00+00:00

Scribd has absolutely fascinating data-at-scale type problems, all the way down to the fundamentals of how we use AWS S3. In my previous post I wrote about the design of Content Crush and how Scribd is consolidating objects in S3 to minimize our costs. Related to that work I was fortunate enough to join the (in)famous Corey Quinn to talk about Engineering around Extreme S3 scale:

Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive. Normal solutions don’t work anymore. Tyler shares how with this much data, you can’t just throw money at the problem, but rather you have to engineer your way out.

You can also listen On Everand or watch via the Last Week in AWS YouTube channel

Deploying a Cost-Effective, Scalable PhotoDNA System for CSAM Detection

2026-01-20T00:00:00+00:00

Child safety is a non‑negotiable responsibility for any platform that hosts user‑generated content. Over the last year, we designed and deployed a production system that detects known Child Sexual Abuse Material (CSAM) using PhotoDNA perceptual hashes, integrates with the National Center for Missing and Exploted Children’s (NCMEC) reporting system, and scales efficiently across our ingestion surfaces. This post explains the problem we set out to solve, how PhotoDNA hashing works, the online child-protection ecosystem (NCMEC, Tech Coalition, Project Lantern), our architecture and operational model, cost considerations, and key learnings.

Note: This article discusses safety technology at a high level. We intentionally omit sensitive operational details to protect the effectiveness of these defenses.

Problem: Accurate CSAM detection at scale, within strict safety and cost constraints

We needed to:

Accurately detect known CSAM at upload and in historical backfills.
Minimize false positives while keeping latency low on critical paths.
Meet obligations for reporting to NCMEC and preserve chain‑of‑custody evidence.
Fit within pragmatic cost envelopes and scale elastically with traffic.
Integrate into Scribd’s existing ML and batch compute ecosystem for observability, auditability, and maintainability.

The ecosystem: Tech Coalition, Project Lantern, PhotoDNA, and NCMEC

Tech Coalition and Project Lantern: An industry consortium and initiative to strengthen cross‑platform child safety, including responsible signal sharing that helps disrupt abusers across services. Lantern focuses on sharing signals that increase detection of predatory accounts and coordinated abuse while respecting privacy and legal constraints.
PhotoDNA: A perceptual hashing technology created by Microsoft in collaboration with Dartmouth College. PhotoDNA transforms an image into a robust hash that stays stable across common modifications (resize, recompress, minor color adjustments). Matching is performed against vetted hash sets of known illegal content.
NCMEC (National Center for Missing & Exploited Children): A US-based nonprofit with a Congressional mandate to serve as the clearinghouse and triage for CSAM reports in the United States via the CyberTipline.NCMEC also serves as the steward of vetted hash CSAM data sets. US-based platforms are required to report confirmed CSAM and retain appropriate artifacts for law enforcement.

How PhotoDNA CSAM detection works (at a glance)

An image is normalized (e.g., resized, converted to a canonical colorspace). For PDFs, we first extract images.
A perceptual transformation produces a PhotoDNA hash vector.
We compare that hash to vetted hash sets using a distance threshold tuned to minimize both false positives and false negatives.
A match triggers automated containment (quarantine/blocks), evidence preservation, safety review, and NCMEC reporting workflows.

Architecture

At a high level, we separate event driven and highly parallel PhotoDNA hash generation from high throughput GPU based batch PhotoDNA matcher. The components in our design are AWS services but equivalent from any other hyperscaler will suffice.

Key properties:

The deterministic matching path is GPU‑parallel, horizontally scalable, and isolated from image transform and hash generation.
Hash set updates are versioned and rolled atomically; match records include hash‑set version.
Matches are logged and reviewed.

The diagram above shows the high-level architecture of our PhotoDNA CSAM Detection System. The system is designed to be cost-effective, scalable, and efficient.

Hasher and matcher details

Hasher: event driven and highly parallel

Image sources: raw images and images extracted from PDFs (embedded image extraction are deterministic and versioned).
Parallelism: Each PDF document is processed in parallel by evented and isolated compute (AWS Lambda).
Storage: PhotoDNA hashes are versioned and storage for every extracted image.
Observability: structured metrics (throughput, error codes, backlog depth) and end‑to‑end lineage identifiers provide for auditability.

Matcher: high‑throughput batch

Vetted hash sets are loaded for matching; where feasible, keep structures memory‑resident to maximize throughput.
Batched distance computations with conservative thresholds minimize false positives; thresholds and policies are versioned.
Aggregation: combine duplicate or near‑duplicate image evidence into per‑asset decisions and preserve the strongest evidence for review.
Events and evidence: emit match events to quarantine/review flows and include hash‑set version and metadata for audit.

Lessons Learned & Best Practices

Which NCMEC hash set to use?

We prioritize vetted, legally curated sources:

Primary: NCMEC‑provided hash sets for known CSAM.
Supplementary: Industry‑shared signals via Tech Coalition initiatives (e.g., Project Lantern) where applicable and approved.

Operationally, we version, verify, and roll out hash updates.

Where do GPUs come in?

In our final design and implementation, graphical processing units (GPUs) materially improved throughput and unit cost for PhotoDNA hashing when run as SageMaker Batch workloads. We containerized the PhotoDNA pipeline and executed it on GPU‑backed instances to accelerate matching, enabling us to meet tight batch Service-level objectives (SLOs) and backfill schedules with fewer nodes.

Batched matching on GPU nodes via SageMaker Batch/Processing reduced runtimes significantly.
GPU‑accelerated transforms improved end‑to‑end throughput.
Higher throughput per node reduced cost at scale.

Learnings from Microsoft’s PhotoDNA guidance

Preprocessing matters: adhere to canonical normalization steps (grayscale, downsample strategy) or use the vetted cloud service where appropriate.
Treat thresholds conservatively; don’t repurpose perceptual distances beyond vetted safety use cases.
Keep auditable logs of match context and system versions; separate operational telemetry from sensitive evidence artifacts.

Machine learning (ML) deployment at Scribd: Observability and operational rigor

Although PhotoDNA isn’t “a model” we train, we run complementary ML components and rigorous observability:

Weights & Biases (W&B): Host the versioned model in the W&B Model Registry, lineage, and provenance for audit. SageMaker Batch jobs resolve the model to ensure reproducibility.
AWS SageMaker Batch Inference: Host batch inference jobs using standardized containers, consistent IAM boundaries, and autoscaling.

Cost model

We sized for steady‑state uploads and periodic backfills:

Compute: GPU‑backed SageMaker Batch for PhotoDNA hashing improved throughput/SLOs and, when saturated, delivered better $/throughput than equivalently provisioned CPU fleets.
Storage: Keep only what is necessary for safety review and legal retention. Use lifecycle policies and tiering for aging artifacts.
Queueing and elasticity: Amazon Simple Queue Service (SQS) buffers absorb bursts; autoscaling workers maintain SLOs without overprovisioning.
Hash set operations: Updates are small; cost is dominated by compute and storage around matches and evidence.

In practice, the unit economics are driven by: input volume, match rate (rare but higher cost per event), retention windows, and backfill cadence.

Wins

Safety‑first by design: Deterministic matching path is simple, fast, and auditable.
Operational clarity: Clear blast‑radius boundaries between hashing, matching, enrichment, and reporting.
Scalable and cost‑effective: GPU‑accelerated hashing on SageMaker Batch achieved high throughput and favorable unit economics at scale.
Stronger together: Collaboration with the ecosystem improves coverage and response speed.

Operational guardrails and compliance

Strict identity and access management (IAM) boundaries; least‑privilege for all safety components.
Immutable logging with retention; separate telemetry from sensitive evidence.
Privacy and data minimization: collect only what’s necessary for safety and compliance.

Acknowledgments

This was truly a cross‑functional effort. Thank you:

Machine Learning and Data Engineering team
Product Managers
Infrastructure
Legal
Partners at NCMEC and Microsoft
Industry peers via Tech Coalition initiatives, including Project Lantern

Collaboration highlights

Ongoing alignment with NCMEC reporting workflows (evidence packaging, retention, and audit trails).
Incorporating best practices from Microsoft’s PhotoDNA guidance for normalization and thresholding.
Participation with industry groups (e.g., Tech Coalition/Project Lantern) to improve cross‑platform defenses.

Appendix: FAQs

Does PhotoDNA require GPUs? No. However, in our SageMaker Batch implementation, GPUs significantly improved throughput and cost for large‑scale hashing, so we run hashing on GPU for batch workloads.
How are false positives handled? Conservative thresholds plus human‑in‑the‑loop review on any flagged item before reporting or account actions.

Supercharging S3 Intelligent Tiering with Content Crush

2026-01-12T00:00:00+00:00

Scribd and Slideshare have been using AWS S3 for almost twenty years and store hundreds of billions of objects making storage management quite a challenge. My focus at Scribd has generally been around data and storage but only in the past twelve months have I started to really focus on one of our hardest technology problems: cost-effective storage and availability for the hundreds of billions of objects that represent our content library.

Since adopting S3 for our object storage in 2007 a lot has changed with the service, most notably Intelligent Tiering which was introduced in 2018. At a very high level Intelligent Tiering allows object access patterns to dictate the storage tier for a small per-object monitoring fee. Behind the scenes S3 manages moving objects which are infrequently accessed into cheaper storage.

For most organizations simply adopting Intelligent Tiering is the right solution to save on S3 storage costs. For Scribd however the sheer number of objects in our buckets makes the problem much more complex.

Cost management is an architecture problem

Mike Julian of duckbill.

The small per-object monitoring fee adds up to some serious numbers. While monitoring 100 million objects costs $250/month, the monitoring fees for 100 billion is $250,000/month. Billion is such a big number that it is hard to make sense of it sometimes.

The difference between million and billion is a lot. Intelligent tiering was not going to work for Scribd unless we found a way to reduce or remove hundreds of billions of objects!

Content Crush

When users upload a document or presentation to Scribd and Slideshare a lot of machinery kicks in to process the file and converts it into a multitude of smaller files to ensure clear and correct rendering on a variety of devices. Further post-processing is done to help Scribd’s systems understand the document with a multitude of generated textual and image-based metadata. As a result one file upload might result in hundreds or sometimes thousands of different objects being produced in various storage locations.

Content Crush is the system we have built to bring all these objects back into a single stored object while preserving a virtualized keyspace and the discrete retrieval semantics for systems which rely on these smaller files.

Before Content Crush a single document upload could produce something like the following tree:

s3://bucket/guid/
               /info.json
               /metadata.json
               /imgs/
                    /0.jpg
                    /1.jpg
               /fmts/
                    /original.pdf
                    /v1.tar.bz2
                    /v2.zip
               /other/
                     /random.uuid
                     /debian.iso
                     /dbg.txt

After Content Crush these different objects are collapsed into a single Apache Parquet file in S3:

s3://bucket/guid.parquet

We became intimately familiar with the Parquet file format from our work creating delta-rs. The format was designed in a way that really excels in object storage systems like AWS S3. For example:

S3 allows GetObject with byte ranges for partial reads of an object, most importantly it allows for negative offset reads. This allows fetching the last N bytes of a file.
Parquet stores its metadata at the end of a file, with the last 8 bytes indicating the length of the footer. One can read all a file’s metadata with two calls: GetObject(-8) followed by GetObject(-footer_len).
Parquet’s footer metadata indicates which byte offsets of different row groups, allowing retrieving one of N row groups rather than requiring full object reads.
Additional user-provided metadata in the file footer allows for further optimizations around selective reads.

Parquet file format

Without Apache Parquet, Content Crush fundamentally would not work. There is prior art for “compressing objects” into S3 with other formats, but for our purposes they all have downsides:

Zip: Streamable but not suitable for random access inside the file
Tar: Also streamable but same issue as zip, then there’s nuance between different implementations.
Build your own: I looked into this but all my designs ended up looking like a less-good version of Apache Parquet.

The original prototype implementation used S3 Object Lambda which allowed for a seamless drop-in for existing S3 clients, allowing applications to switch from on S3 Access Point to another any indication that they are accessing “crushed” files. Since Object Lambda has ceased to be, Content Crush is being moved over to an S3 API-compatible service.

Downsides

No optimization is ever free and crushed assets has a couple of caveats that are important to consider:

Retrieval of a single “file” within a crushed object requires at least two GetObject calls to retrieve the appropriate data. The worst case is three since most Parquet readers will read the footer length, the footer, and then fetch the data they seek. We can typically optimize this by hinting at the footer size with a 95% estimate.
This system works well with relatively static objects, since editing a “file” inside of a crushed object requires the whole object to be read and then re-written. There can be some concurrency concerns with object updates too, we must ensure that only one process is updating an object at a time.

A related downside with maintaining an S3 API-compatible service is that retrieving multiple files inside of single objects cannot be easily pipelined or streamed. There are a number of ways to solve for this that I am exploring, but they all converge on a different API scheme entirely to take advantage of HTTP2.

Upsides!

The ability to effectively use S3 Intelligent Tiering is by far the largest benefit of this approach. With a dramatic reduction in object counts we can adopt S3 Intelligent Tiering for large buckets in a way that provides major cost improvements.

Fewer objects also makes tools like S3 Batch Operations viable for these massive buckets.

There are also hidden performance optimizations now available that were not possible before. For example, for heavily requested objects there is now AZ-local caching opportunities whether at the API service layer or simply by pulling popular objects into S3 Express One Zone.

Much of this work is on-going and not completely open source. None of this would have been possible without the stellar work by the folks in the Apache Arrow Rust community building the high-performance parquet crate. After we set off on this path we learned of their similar work in Querying Parquet with Millisecond Latency.

There remains plenty of work to be done building the foundational storage and content systems at Scribd which power one of the world’s largest digital libraries. If you’re interested in learning more we have a lot of positions open right now!

Presentation

Content Crush was originally shared at the August 2025 FinOps Meetup hosted by duckbill, with the slides from that event hosted on Slideshare below:

2025-08 San Francisco FinOps Meetup from RTylerCroy

Don’t hardcode IAM credentials in GitHub!

2026-01-06T00:00:00+00:00

Scribd deploys a lot of code from GitHub to AWS using GitHub Actions, which means many of our Actions need to access AWS resources. Managing AWS API keys and tokens for different IAM users is time-consuming, brittle, and insecure. Managing key-distribution between AWS and GitHub also makes it difficult to track which keys go where, when they should be rotated, and what permissions those keys have. Fortunately AWS supports creating OpenID Connect identity providers which is an ideal tool handle this kind of cross-cloud authentication in a more maintainable way.

From the AWS documentation:

IAM OIDC identity providers are entities in IAM that describe an external identity provider (IdP) service that supports the OpenID Connect (OIDC) standard, such as Google or Salesforce.

You use an IAM OIDC identity provider when you want to establish trust between an OIDC-compatible IdP and your AWS account.

The following diagram from GitHub’s documentation gives an overview of how GitHub’s OIDC provider integrates with your workflows and cloud provider:

From within GitHub Actions we can specify the repository and role to assume in via the aws-actions/configure-aws-credentials action, which will automatically configure the necessary credentials for AWS SDK operations inside the job.

Our newly open sourced terraform-oidc-module makes setting up the resources necessary to bridge the gap between AWS GitHub much simpler.

Tying OIDC together between AWS and a single GitHub repository starts with the aws_iam_openid_connect_provider resource, but then developers must also configure resources and permissions for common deployment tasks such as:

access S3 bucket with read only permissions
access S3 bucket with write permissions
access ECR with read only permissions
access ECR with write permissions
access some AWS service with some specific permissions set

Redoing this work for every repository in the organization to ensure segmentation of permissions becomes very tedious without the terraform-oidc-module.

module "oidc" {
  source = "git::https://github.com/scribd/terraform-oidc-module.git?ref=v1.0.0"

  name = "example"
  url = "https://token.actions.githubusercontent.com"
  client_id_list = ["sts.amazonaws.com"]
  thumbprint_list = ["example0000example000example"]
  repo_ref = ["repo:REPO_ORG/REPO_NAME:ref:refs/heads/main"]

  custom_policy_arns = [aws_iam_policy.example_policy0.arn,aws_iam_policy.example_policy1.arn ]

  tags = {
    Terraform = "true"
    Environment = "dev"
  }
}

I hope you find this useful to getting started with OIDC and GitHub Actions!

Building a Scalable Data Lake Backup System with AWS

2025-09-22T00:00:00+00:00

We designed and implemented a scalable, cost-optimized backup system for S3 data warehouses that runs automatically on a monthly schedule. The system handles petabytes of data across multiple databases and uses a hybrid approach: AWS Lambda for small workloads and ECS Fargate for larger ones. At its core, the pipeline performs incremental backups — copying only new or changed parquet files while always preserving delta logs — dramatically reducing costs and runtime compared to full backups. Data is validated through S3 Inventory manifests, processed in parallel, and stored in Glacier for long-term retention. To avoid data loss and reduce storage costs, we also implemented a safe deletion workflow. Files older than 90 days, successfully backed up, and no longer present in the source are tagged for lifecycle-based cleanup instead of being deleted immediately. This approach ensures reliability, efficiency, and safety: backups scale seamlessly from small to massive datasets, compute resources are right-sized, and storage is continuously optimized.

Our old approach had problems:

Copying over the same files all the time – not effective from a cost perspective
Timeouts when manifests were too large for Lambda
Redundant backups inflating storage cost
Orphaned files piling up without clean deletion

We needed a systematic, automated, and cost-effective way to:

Run monthly backups across all databases
Scale from small jobs to massive datasets
Handle incremental changes instead of full copies
Safely clean up old data without risk of data loss

The Design at a Glance

We built a hybrid backup architecture on AWS primitives:

Step Functions – orchestrates the workflow
Lambda – lightweight jobs for small manifests
ECS Fargate – heavy jobs with no timeout constraints
S3 + S3 Batch Ops – storage and bulk copy/delete operations
EventBridge – monthly scheduler
Glue, CloudWatch, Secrets Manager – reporting, monitoring, secure keys
IAM – access and roles

The core idea: Do not copy file what are already in back up and copy over always delta log, Small manifests run in Lambda, big ones in ECS.

How It Works

Database Discovery

Parse S3 Inventory manifests
Identify database prefixes
Queue for processing (up to 40 in parallel)
Manifest Validation

Before we touch data, we validate:
- JSON structure
- All CSV parts present
- File counts + checksums match
  If incomplete → wait up to 30 minutes before retry
Routing by Size
- ≤25 files → Lambda (15 minutes, 5GB)
- 25 files → ECS Fargate (16GB RAM, 4 vCPUs, unlimited runtime)
Incremental Backup Logic
- Load exclusion set from last backup
- Always include delta logs
- Only back up parquet files not yet in backup
- Ignore non-STANDARD storage classes (we use Intelligent-Tiering; over time files can go to Glacier and we don’t want to touch them)
- Process CSVs in parallel (20 workers)
- Emit new manifest + checksum for integrity
Copying Files
- Feed manifests into S3 Batch Operations
- Copy objects into Glacier storage
Safe Deletion
- Compare current inventory vs. incremental manifests
- Identify parquet files that:
  - Were backed up successfully
  - No longer exist in source
  - Are older than 90 days
- Tag them for deletion instead of deleting immediately
- Deletion is performed using S3 lifecycle configuration for cost-optimized deletion
- Tags include timestamps for rollback + audit

Error Handling & Resilience

Retries with exponential backoff + jitter
Strict validation before deletes
Exclusion lists ensure delta logs are never deleted
ECS tasks run in private subnets with VPC endpoints

Cost & Performance Gains

Incremental logic = no redundant transfers
Lifecycle rules = backups → Glacier, old ones cleaned
Size-based routing = Lambda for cheap jobs, ECS for heavy jobs
Parallelism = 20 CSV workers per manifest, 40 DBs at once

Lessons Learned

Always validate manifests before processing
Never delete immediately → tagging first saved us money
Thresholds matter: 25 files was our sweet spot
CloudWatch + Slack reports gave us visibility we didn’t have before

Conclusion

By combining Lambda, ECS Fargate, and S3 Batch Ops, we’ve built a resilient backup system that scales from small to massive datasets. Instead of repeatedly copying the same files, the system now performs truly incremental backups — capturing only new or changed parquet files while always preserving delta logs. This not only minimizes costs but also dramatically reduces runtime.

Our safe deletion workflow ensures that stale data is removed without risk, using lifecycle-based cleanup rather than immediate deletion. Together, these design choices give us reliable backups, efficient scaling, and continuous optimization of storage. What used to be expensive, error-prone, and manual is now automated, predictable, and cost-effective.

Let’s save tons of money with cloud-native data ingestion!

2025-08-01T00:00:00+00:00

Delta Lake is a fantastic technology for quickly querying massive data sets, but first you need those massive data sets! In this talk from Data and AI Summit 2025 I dive into the cloud-native architecture Scribd has adopted to ingest data from AWS Aurora, SQS, Kinesis Data Firehose and more!

By using off-the-shelf open source tools like kafka-delta-ingest, oxbow and Airbyte, Scribd has redefined its ingestion architecture to be more event-driven, reliable, and most importantly: cheaper. No jobs needed! Attendees will learn how to use third-party tools in concert with a Databricks and Unity Catalog environment to provide a highly efficient and available data platform.

This architecture will be presented in the context of AWS but can be adapted for Azure, Google Cloud Platform or even on-premise environments. The slides are also available on Scribd!

Terraform module to manage Oxbow Lambda and its components

2025-03-14T00:00:00+00:00

Oxbow is a project to take an existing storage location which contains Apache Parquet files into a Delta Lake table. It is intended to run both as an AWS Lambda or as a command line application. We are excited to introduce terraform-oxbow, an open-source Terraform module that simplifies the deployment and management of AWS Lambda and its supporting components. Whether you’re working with AWS Glue, Kinesis Data Firehose, SQS, or DynamoDB, this module provides a streamlined approach to infrastructure as code (IaC) in AWS.

✨ Why terraform-oxbow?

Managing event-driven architectures in AWS can be complex, requiring careful orchestration of multiple services. Terraform-oxbow abstracts much of this complexity, enabling users to configure key components with simple boolean flags and module parameters. This is an easy and efficient way to have Delta table created using Apache Parquet files.

🚀Features

With terraform-oxbow, you can deploy:

AWS Oxbow Lambda with customizable configurations
Kinesis Data Firehose for real-time data streaming
SQS and SQS Dead Letter Queues for event-driven messaging
IAM policies for secure access management
S3 bucket notifications to trigger Lambda functions
DynamoDB tables for data storage and locking
AWS Glue Catalog and Tables for schema management

⚙️ How It Works

This module follows a modular approach, allowing users to enable or disable services based on their specific use case. Here are a few examples:

To enable AWS Glue Catalog and Tables: hcl enable_aws_glue_catalog_table = true
To enable Kinesis Data Firehose delivery stream hcl enable_kinesis_firehose_delivery_stream = true
To enable S3 bucket notifications hcl enable_bucket_notification = true
To enable advanced Oxbow Lambda setup for multi-table filtered optimization hcl enable_group_events = true
AWS S3 bucket notifications have limitations: Due to AWS constraints, an S3 bucket can only have a single notification configuration per account. If you need to trigger multiple Lambda functions from the same S3 bucket, consider using event-driven solutions like SNS or SQS.
IAM Policy Management: The module provides the necessary permissions but follows the principle of least privilege. Ensure your IAM policies align with your security requirements.
Scalability and Optimization: The module allows fine-grained control over Lambda concurrency, event filtering, and data processing configurations to optimize costs and performance

Cloud-native Data Ingestion with AWS Aurora and Delta Lake

2025-01-15T00:00:00+00:00

One of the major themes for Infrastructure Engineering over the past couple years has been higher reliability and better operational efficiency. In a recent session with the Delta Lake project I was able to share the work led Kuntal Basu and a number of other people to dramatically improve the efficiency and reliability of our online data ingestion pipeline.

Join Kuntal Basu, Staff Infrastructure Engineer, and R. Tyler Croy, Principal Engineer at Scribd, Inc. as they take you behind the scenes of Scribd’s data ingestion setup. They’ll break down the architecture, explain the tools, and walk you through how they turned off-the-shelf solutions into a robust pipeline.

Video

Presentation

The Evolution of the Machine Learning Platform

2024-02-05T00:00:00+00:00

Machine Learning Platforms (ML Platforms) have the potential to be a key component in achieving production ML at scale without large technical debt, yet ML Platforms are not often understood. This document outlines the key concepts and paradigm shifts that led to the conceptualization of ML Platforms in an effort to increase an understanding of these platforms and how they can best be applied.

Technical Debt and development velocity defined

Development Velocity

Machine learning development velocity refers to the speed and efficiency at which machine learning (ML) projects progress from the initial concept to deployment in a production environment. It encompasses the entire lifecycle of a machine learning project, from data collection and preprocessing to model training, evaluation, validation deployment and testing for new models or for re-training, validation and deployment of existing models.

Technical Debt

The term “technical debt” in software engineering was coined by Ward Cunningham, Cunningham used the metaphor of financial debt to describe the trade-off between implementing a quick and dirty solution to meet immediate needs (similar to taking on financial debt for short-term gain) versus taking the time to do it properly with a more sustainable and maintainable solution (akin to avoiding financial debt but requiring more upfront investment). Just as financial debt accumulates interest over time, technical debt can accumulate and make future development more difficult and expensive.

The idea behind technical debt is to highlight the consequences of prioritizing short-term gains over long-term maintainability and the need to address and pay off this “debt” through proper refactoring and improvements. The term has since become widely adopted in the software development community to describe the accrued cost of deferred work on a software project.

Technical Debt in Machine Learning

Originally a software engineering concept, Technical debt is also relevant to Machine Learning Systems infact the landmark google paper suggest that ML systems have the propensity to easily gain this technical debt.

Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt , we ﬁnd it is common to incur massive ongoing maintenance costs in real-world ML systems

Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems

As the machine learning (ML) community continues to accumulate years of experience with livesystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-tems is relatively fast and cheap, but maintaining them over time is difﬁcult and expensive

Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems

Technical debt is important to consider especially when trying to move fast. Moving fast is easy, moving fast without acquiring technical debt is alot more complicated.

The Evolution Of ML Platforms

DevOps – The paradigm shift that led the way

DevOps is a methodology in software development which advocates for teams owning the entire software development lifecycle. This paradigm shift from fragmented teams to end-to-end ownership enhances collaboration and accelerates delivery. Dev ops has become standard practice in modern software development and the adoption of DevOps has been widespread, with many organizations considering it an essential part of their software development and delivery processes. Some of the principles of DevOps are:

Automation
Continuous Testing
Continuous Monitoring
Collaboration and Communication
Version Control
Feedback Loops

Platforms – Reducing Cognitive Load

This shift to DevOps and teams teams owning the entire development lifecycle introduces a new challenge—additional cognitive load. Cognitive load can be defined as

The total amount of mental effort a team uses to understand, operate and maintain their designated systems or tasks.

Skelton & Pais (2019) Team Topologies

The weight of the additional load introduced in DevOps of teams owning the entire software development lifecycle can hinder productivity, prompting organizations to seek solutions.

Platforms emerged as a strategic solution, delicately abstracting unnecessary details of the development lifecycle. This abstraction allows engineers to focus on critical tasks, mitigating cognitive load and fostering a more streamlined workflow.

The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy. The stream-aligned team maintains full ownership of building, running, and fixing their application in production. The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services.

Skelton & Pais (2019) Team Topologies

Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users

Rowse & Shepherd (2022) Building Infrastructure Platforms

ML Ops – Reducing technical debt of machine learning

The ability of ML systems to rapidly accumulate technical debt has given rise to the concept of MLOps. MLOps is a methodology that takes inspiration from and incorporates best practices of the DevOps, tailoring them to address the distinctive challenges inherent in machine learning. MLOps applies the established principles of DevOps to machine learning, recognizing that merely a fraction of real-world ML systems comprises the actual ML code. Serving as a crucial bridge between development and the ongoing intricacies of maintaining ML systems. MLOps is a methodology that provides a collection of concepts and workflows designed to promote efficiency, collaboration, and sustainability of the ML Lifecycle. Correctly applied MLOps can play a pivotal role controlling technical debt and ensuring the efficiency, reliability, and scalability of the machine learning lifecycle over time.

Scribd’s ML Platform – MLOps and Platforms in Action

At Scribd we have developed a machine learning platform which provides a curated developer experience for machine learning developers. This platform has been built with MLOps in mind which can be seen through its use of common DevOps principles.

Automation:
- Applying CI/CD strategies to model deployments through the use of Jenkins pipelines which deploy models from the Model Registry to AWS based endpoints.
- Automating Model training throug the use of Airflow DAGS and allowing these DAGS to trigger the deployment pipelines to deploy a model once re-training has occured.
Continuous Testing:
- Applying continuous testing as part of a model deployment pipeline, removing the need for manual testing.
- Increased tooling to support model validation testing.
Monitoring:
- Monitoring real time inference endpoints
- Monitoring training DAGS
- Monitoring batch jobs
Collaboration and Communication:
- Feature Store which provides feature discovery and re-use
- Model Database which provides model collaboration
Version Control:
- Applying version control to experiments, machine learning models and features

References

Bottcher (2018, March 05). What I Talk About When I Talk About Platforms. https://martinfowler.com/articles/talk-about-platforms.html

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Franc¸ois Crespo, Dan Dennison (2021) Hidden Technical Debt in Machine Learning Systems

Fowler (2022, October 20).Conway’s Law. https://martinfowler.com/bliki/ConwaysLaw.html

Galante, what is platform engineering. https://platformengineering.org/blog/what-is-platform-engineering

Humanitect, State of Platform Engineering Report

Hodgson (2023, July 19).How platform teams get stuff done. https://martinfowler.com/articles/platform-teams-stuff-done.html

Murray (2017, April 27. The Art of Platform Thinking. https://www.thoughtworks.com/insights/blog/platforms/art-platform-thinking)

Rouse (2017, March 20). Technical Debt. https://www.techopedia.com/definition/27913/technical-debt

Rowse & Shepherd (2022).Building Infrastructure Platforms. https://martinfowler.com/articles/building-infrastructure-platform.html

Skelton & Pais (2019) Team Topologies

Scribd Technology

Dual-Embedding Trust Scoring

Content Trust Pillars

From Embeddings to Trust Score

Datasets & Features

Content Trust Score

Enhancing Semantics with Description Embeddings

Content- vs. Description-Based Trust Scores

Score Generation for All Documents

Classification Through Threshold Setting

Strategic Thresholding: Prioritizing Precision

Building the Evaluation Dataset

Final Classification Thresholds

Discussion

Technical and Operational Advantages

Potential Applications

Further Reading

Acknowledgments

Screaming in the Cloud

Deploying a Cost-Effective, Scalable PhotoDNA System for CSAM Detection

Problem: Accurate CSAM detection at scale, within strict safety and cost constraints

The ecosystem: Tech Coalition, Project Lantern, PhotoDNA, and NCMEC

How PhotoDNA CSAM detection works (at a glance)

Architecture

Hasher and matcher details

Hasher: event driven and highly parallel

Matcher: high‑throughput batch

Lessons Learned & Best Practices

Which NCMEC hash set to use?

Where do GPUs come in?

Learnings from Microsoft’s PhotoDNA guidance

Machine learning (ML) deployment at Scribd: Observability and operational rigor

Cost model

Wins

Operational guardrails and compliance

Acknowledgments

Collaboration highlights

Appendix: FAQs

Supercharging S3 Intelligent Tiering with Content Crush

Content Crush

Downsides

Upsides!

Presentation

Don’t hardcode IAM credentials in GitHub!

Building a Scalable Data Lake Backup System with AWS

Our old approach had problems:

We needed a systematic, automated, and cost-effective way to:

The Design at a Glance

How It Works

Error Handling & Resilience

Cost & Performance Gains

Lessons Learned

Conclusion

Let’s save tons of money with cloud-native data ingestion!

Terraform module to manage Oxbow Lambda and its components

✨ Why terraform-oxbow?

🚀Features

⚙️ How It Works

Cloud-native Data Ingestion with AWS Aurora and Delta Lake

Video

Presentation

The Evolution of the Machine Learning Platform

Technical Debt and development velocity defined

Development Velocity

Technical Debt

Technical Debt in Machine Learning

The Evolution Of ML Platforms

DevOps – The paradigm shift that led the way

Platforms – Reducing Cognitive Load

ML Ops – Reducing technical debt of machine learning

Scribd’s ML Platform – MLOps and Platforms in Action

References