To this end, we capitalized on Generative AI (GenAI) signals and our proprietary multilingual embeddings, in conjunction with classical machine learning methods, to develop our Content Trust Score. This metric reflects the severity of a document violating a specific trust pillar, enabling us to identify high-risk content and take appropriate actions. Ultimately, the score allows us to build a more robust and scalable moderation system, ensuring a safer and more reliable experience for all users while preserving the rich diversity of our UGC.
The data and methodologies presented here are for research purposes and do not represent Scribd’s overall moderation or policy implementation.
According to our internal Trust & Safety framework, we defined and prioritized our current efforts on four top-level concern pillars:
To maintain a clear project scope, we focused our research on these four semantic-heavy pillars where our embedding-based approach offers the greatest impact. The remaining violation types are out of scope and are addressed by other specialized detection algorithms.
We leveraged annotated data at Scribd, which includes human-assigned trust labels, to craft our core modeling dataset of roughly 100,000 documents. This dataset was split 90-to-10 for training and testing data and distributed across the four trust pillars. The training set was used exclusively to derive the Content Trust Pillar embeddings, while the testing set provided the initial basis for comparison between content- and description-based scores. In addition to the four primary Trust Pillars, we also included documents not violating any trust & safety pillars. These clean documents serve as the “baseline” in our analyses. It is important to note that the data presented here is for discussion purposes and does not represent approximate category distributions within the Scribd corpus.
| Trust Pillar | Training Dataset | Testing Dataset |
|---|---|---|
| Illegal | 1.49% | 1.56% |
| Explicit | 0.39% | 0.41% |
| PII/Privacy | 5.43% | 5.48% |
| Low Quality | 2.18% | 2.17% |
| Clean | 90.51% | 90.38% |
The core feature of our project is the 128-dimensional semantic embeddings for every document, which were generated using the LaBSE model, fine-tuned on our in-house dataset. Specifically, semantic embeddings are dense, numerical vector representations of text in a high-dimensional space. The goal of the embeddings is to map linguistic meaning into this vector space such that pieces of text with similar semantics are positioned mathematically closer together. Moreover, the degree of similarity between texts can be quantified by the distance between their respective vectors. For instance, in Figure 1, the words “circle” and “square” are closer to each other since they are semantically more similar, compared to words like “crocodiles” or “alligators”. This allows us to represent all the text in our document using a vector of numbers and accurately quantify their semantic relationships.
The first step in generating the Trust Score was creating the representative vectors for each trust pillar. Using the semantic embeddings, we generated the Content Trust Pillar embeddings for each trust pillar by averaging the embeddings of all documents with that pillar’s label in the training dataset. The large size of the training dataset helps ensure the representativeness of these Pillar embeddings.
The content trust score for a Trust Pillar was then computed as the cosine similarity between the document’s embedding and the corresponding Trust Pillar’s embedding. Crucially, all scores are generated and evaluated exclusively using the testing dataset to strictly avoid data leakage and circularity in our analysis. Our hypothesis is that documents closely matching a specific trust pillar will yield a high similarity score against that Pillar’s embedding, while non-matching documents will yield a low score.
This concept is visualized in Figure 2, where each “Pillar” represents a distinct trust pillar centroid. Individual documents are clustered around their respective pillar, illustrating that the closer a document’s embedding is to a specific Trust Pillar embedding, the higher its calculated similarity score, which confirms a stronger thematic match to that pillar.
While the content-based semantic embeddings are generally effective, they struggle in certain cases where the raw text is not fully informative. Specifically, these embeddings may fail when documents are extremely long, image-heavy, or contain meaningless repetitive text.
In these scenarios, a brief content summary can provide a superior document representation. For example, Figure 3 illustrates a document containing presentation slides where the raw text is minimal, yet the user-provided description is quite informative.
However, since users often do not provide adequate descriptions upon document upload, we rely on large language models (LLMs) to generate descriptive summaries based on the content. Figure 4 demonstrates this necessity, showing a document with lengthy and repetitive text where the LLM-generated descriptions (GenAI descriptions) summarize the core topic effectively.
Consequently, we generated a second set of document semantic embeddings and the corresponding Content Trust Pillar embeddings based on the LLM-generated descriptions. This dual-approach allowed us to compute the content trust score using the alternative, enhanced representation.
For each trust pillar, we compared the distribution of the content trust scores derived from the document’s content to their GenAI-description-based counterparts, using the approximately 10,000-document testing dataset. To ensure a fair comparison, we included only documents for which both sets of scores were available. Our results reveal that content-based trust scores outperformed the scores generated from GenAI descriptions for all Trust Pillars (Figure 5a-c) except the Low Quality pillar (Figure 5d).
For the majority of Trust Pillars, the content-based scores demonstrated strong discrimination: they were higher for documents truly violating a given pillar (True Positives) than for documents violating other trust pillars or clean documents. Conversely, for these same pillars, the GenAI-description-based scores were indistinguishable from those of other documents, or showed significantly less separation compared to the content-based counterparts. This suggests that while content-based embeddings offer a superior representation for general trust identification, the descriptive embeddings provided little added value for these pillars.
This performance pattern is reversed for Low Quality documents. Specifically, the content-based scores for Low Quality documents were ineffective, proving to be indistinguishable from those violating other trust pillars or those labeled as clean. The GenAI-based approach, however, showed a distinct advantage: GenAI-description-based scores were significantly higher for Low Quality documents compared to all others. This result indicates that the descriptive summary is crucial for accurately identifying this specific type of document.
For completeness and to verify that our results were not skewed by the presence of other violating documents, we conducted a final comparative analysis by isolating the scores of labeled documents against only the clean, non-violating documents. As evident in Figure 6, the core patterns persist: the content-based scores consistently yield superior separation between violating content (blue) and clean content (green) for the Illegal, Explicit, and PII/Privacy pillars (Figure 6a-c). In sharp contrast, the GenAI-description-based scores for these same three pillars exhibit significantly greater distribution overlap. Conversely, for the Low Quality pillar (Figure 6d), the GenAI-description method again established a much clearer boundary from the clean documents than the content-based method, further validating our hybrid scoring approach.
Based on these differentiating findings, we adopted a hybrid scoring approach: we use the content-based trust scores for the Illegal, Explicit, and PII/Privacy pillars, and the GenAI-description-based trust scores for the Low Quality pillar. This decision enabled the computation of the most effective Content Trust Scores for all documents in our library across every trust pillar.
The content trust score reflects the extent to which a document violates a specific pillar – a high score indicates that the document closely resembles the defined trust violation type. To build a classification system that flags violations, we must determine an optimal score threshold.
In this work, we chose to prioritize precision to build a high-confidence classification system. Our goal is to maintain a very low mislabeling rate, specifically aiming for a false positive rate (FPR) close to 1%. This decision is driven by the need to minimize user friction – incorrectly flagging documents as violating trust pillars would be an undesirable user experience, making the avoidance of high FPR our primary concern.
The inherently low document count for certain violation types (e.g., Explicit) prevented us from performing reliable analyses to determine classification thresholds. To address this methodological challenge, we developed an expanded evaluation dataset. This was built by taking the original modeling data (both training and testing sets) and augmenting it with a high volume of additional human-annotated documents from our existing corpus. By incorporating this high-volume, high-quality labeled data, we established a more comprehensive baseline for threshold analysis. To ensure fair comparisons between the content-based and GenAI-description-based scores, we filtered the data to only include documents with both scores available. This refinement resulted in a final working total of approximately 109,000 documents in the evaluation dataset.
For each of the four in-scope trust pillars, we calculated classification metrics, specifically recall and false positive rate (FPR), across a range of thresholds (0.5 to 0.95). Adhering to our rigorous safety standards, we prioritized precision to maintain an FPR close to 1%. This conservative thresholding strategy was chosen to minimize user friction associated with false flagging. The final score thresholds for the classification systems of the four Trust Pillars are summarized in Table 2.
| Trust Pillar | Score Threshold | Recall | False Positive Rate |
|---|---|---|---|
| Illegal | 0.80 | 71.83% | 0.79% |
| Explicit | 0.80 | 10.22% | 1.07% |
| PII/Privacy | 0.75 | 3.82% | 0.62% |
| Low Quality | 0.60 | 27.20% | 0.52% |
The analysis revealed that the Illegal pillar achieved the optimal balance of metrics, securing a high recall of 72% while maintaining an excellent FPR of 0.79%. The Low Quality pillar, which relies on the GenAI-description-based scores, achieved a respectable recall of 27.2% with a very low FPR of 0.52%. This outcome validates our decision to utilize the descriptive embeddings for this challenging content type.
However, this high-performance scenario was not replicated across all Trust Pillars. Specifically, the strict FPR target limited the system’s ability to capture certain violations, with Explicit and PII/Privacy achieving only a recall of 10% and 4%, respectively. This disparity highlights the inherent challenges in identifying documents violating these two pillars, as their topical language is much broader and less defined compared to the other classes.
These results serve as an initial performance baseline. We are actively exploring internal refinements to our embedding representations and scoring logic, as well as integrating complementary models, to progressively enhance detection sensitivity. Our goal is to expand coverage across these more complex pillars while strictly upholding our commitment to a low false-positive environment.
Our work demonstrates a straightforward and flexible content moderation system by effectively leveraging classical machine learning principles (cosine similarity, thresholding) alongside modern Large Language Models (LLMs) for superior document representation. This hybrid approach offers several key operational and technical advantages:
The Content Trust Score and the underlying classification system created in this project open the door to various critical applications at Scribd:
This project was recently presented at TrustCon 2025. For those interested in a visual walkthrough of the dual-embedding approach, you can view the full presentation slides on Slideshare.
This work was a collaborative effort, and we are incredibly grateful to the following individuals and teams for their invaluable contributions:
Checking if files are damaged? $100K. Using newer S3 tools? Way too expensive. Normal solutions don’t work anymore. Tyler shares how with this much data, you can’t just throw money at the problem, but rather you have to engineer your way out.
You can also listen On Everand or watch via the Last Week in AWS YouTube channel
]]>Note: This article discusses safety technology at a high level. We intentionally omit sensitive operational details to protect the effectiveness of these defenses.
We needed to:
At a high level, we separate event driven and highly parallel PhotoDNA hash generation from high throughput GPU based batch PhotoDNA matcher. The components in our design are AWS services but equivalent from any other hyperscaler will suffice.
Key properties:

The diagram above shows the high-level architecture of our PhotoDNA CSAM Detection System. The system is designed to be cost-effective, scalable, and efficient.
We prioritize vetted, legally curated sources:
Operationally, we version, verify, and roll out hash updates.
In our final design and implementation, graphical processing units (GPUs) materially improved throughput and unit cost for PhotoDNA hashing when run as SageMaker Batch workloads. We containerized the PhotoDNA pipeline and executed it on GPU‑backed instances to accelerate matching, enabling us to meet tight batch Service-level objectives (SLOs) and backfill schedules with fewer nodes.
Although PhotoDNA isn’t “a model” we train, we run complementary ML components and rigorous observability:
We sized for steady‑state uploads and periodic backfills:
In practice, the unit economics are driven by: input volume, match rate (rare but higher cost per event), retention windows, and backfill cadence.
This was truly a cross‑functional effort. Thank you:
Since adopting S3 for our object storage in 2007 a lot has changed with the service, most notably Intelligent Tiering which was introduced in 2018. At a very high level Intelligent Tiering allows object access patterns to dictate the storage tier for a small per-object monitoring fee. Behind the scenes S3 manages moving objects which are infrequently accessed into cheaper storage.
For most organizations simply adopting Intelligent Tiering is the right solution to save on S3 storage costs. For Scribd however the sheer number of objects in our buckets makes the problem much more complex.
Cost management is an architecture problem
Mike Julian of duckbill.
The small per-object monitoring fee adds up to some serious numbers. While monitoring 100 million objects costs $250/month, the monitoring fees for 100 billion is $250,000/month. Billion is such a big number that it is hard to make sense of it sometimes.
The difference between million and billion is a lot. Intelligent tiering was not going to work for Scribd unless we found a way to reduce or remove hundreds of billions of objects!
When users upload a document or presentation to Scribd and Slideshare a lot of machinery kicks in to process the file and converts it into a multitude of smaller files to ensure clear and correct rendering on a variety of devices. Further post-processing is done to help Scribd’s systems understand the document with a multitude of generated textual and image-based metadata. As a result one file upload might result in hundreds or sometimes thousands of different objects being produced in various storage locations.
Content Crush is the system we have built to bring all these objects back into a single stored object while preserving a virtualized keyspace and the discrete retrieval semantics for systems which rely on these smaller files.
Before Content Crush a single document upload could produce something like the following tree:
s3://bucket/guid/
/info.json
/metadata.json
/imgs/
/0.jpg
/1.jpg
/fmts/
/original.pdf
/v1.tar.bz2
/v2.zip
/other/
/random.uuid
/debian.iso
/dbg.txt
After Content Crush these different objects are collapsed into a single Apache Parquet file in S3:
s3://bucket/guid.parquet
We became intimately familiar with the Parquet file format from our work creating delta-rs. The format was designed in a way that really excels in object storage systems like AWS S3. For example:
GetObject with byte ranges for partial reads of an object, most
importantly it allows for negative offset reads. This allows fetching the
last N bytes of a file.GetObject(-8) followed by GetObject(-footer_len).N row groups rather than requiring full
object reads.Without Apache Parquet, Content Crush fundamentally would not work. There is prior art for “compressing objects” into S3 with other formats, but for our purposes they all have downsides:
The original prototype implementation used S3 Object Lambda which allowed for a seamless drop-in for existing S3 clients, allowing applications to switch from on S3 Access Point to another any indication that they are accessing “crushed” files. Since Object Lambda has ceased to be, Content Crush is being moved over to an S3 API-compatible service.
No optimization is ever free and crushed assets has a couple of caveats that are important to consider:
GetObject calls to retrieve the appropriate data. The worst case is three
since most Parquet readers will read the footer length, the footer, and
then fetch the data they seek. We can typically optimize this by hinting at
the footer size with a 95% estimate.A related downside with maintaining an S3 API-compatible service is that retrieving multiple files inside of single objects cannot be easily pipelined or streamed. There are a number of ways to solve for this that I am exploring, but they all converge on a different API scheme entirely to take advantage of HTTP2.
The ability to effectively use S3 Intelligent Tiering is by far the largest benefit of this approach. With a dramatic reduction in object counts we can adopt S3 Intelligent Tiering for large buckets in a way that provides major cost improvements.
Fewer objects also makes tools like S3 Batch Operations viable for these massive buckets.
There are also hidden performance optimizations now available that were not possible before. For example, for heavily requested objects there is now AZ-local caching opportunities whether at the API service layer or simply by pulling popular objects into S3 Express One Zone.
Much of this work is on-going and not completely open source. None of this would have been possible without the stellar work by the folks in the Apache Arrow Rust community building the high-performance parquet crate. After we set off on this path we learned of their similar work in Querying Parquet with Millisecond Latency.
There remains plenty of work to be done building the foundational storage and content systems at Scribd which power one of the world’s largest digital libraries. If you’re interested in learning more we have a lot of positions open right now!
Content Crush was originally shared at the August 2025 FinOps Meetup hosted by duckbill, with the slides from that event hosted on Slideshare below:
]]>From the AWS documentation:
IAM OIDC identity providers are entities in IAM that describe an external identity provider (IdP) service that supports the OpenID Connect (OIDC) standard, such as Google or Salesforce.
You use an IAM OIDC identity provider when you want to establish trust between an OIDC-compatible IdP and your AWS account.
The following diagram from GitHub’s documentation gives an overview of how GitHub’s OIDC provider integrates with your workflows and cloud provider:

From within GitHub Actions we can specify the repository and role to assume in via the aws-actions/configure-aws-credentials action, which will automatically configure the necessary credentials for AWS SDK operations inside the job.
Our newly open sourced terraform-oidc-module makes setting up the resources necessary to bridge the gap between AWS GitHub much simpler.
Tying OIDC together between AWS and a single GitHub repository starts with the
aws_iam_openid_connect_provider resource, but then developers must also
configure resources and permissions for common deployment tasks such as:
Redoing this work for every repository in the organization to ensure
segmentation of permissions becomes very tedious without the
terraform-oidc-module.
module "oidc" {
source = "git::https://github.com/scribd/terraform-oidc-module.git?ref=v1.0.0"
name = "example"
url = "https://token.actions.githubusercontent.com"
client_id_list = ["sts.amazonaws.com"]
thumbprint_list = ["example0000example000example"]
repo_ref = ["repo:REPO_ORG/REPO_NAME:ref:refs/heads/main"]
custom_policy_arns = [aws_iam_policy.example_policy0.arn,aws_iam_policy.example_policy1.arn ]
tags = {
Terraform = "true"
Environment = "dev"
}
}
I hope you find this useful to getting started with OIDC and GitHub Actions!
]]>
We built a hybrid backup architecture on AWS primitives:
The core idea: Do not copy file what are already in back up and copy over always delta log, Small manifests run in Lambda, big ones in ECS.
Database Discovery
Parse S3 Inventory manifests
Identify database prefixes
Queue for processing (up to 40 in parallel)
Manifest Validation
Before we touch data, we validate:
Routing by Size
Incremental Backup Logic
Copying Files
Safe Deletion
By combining Lambda, ECS Fargate, and S3 Batch Ops, we’ve built a resilient backup system that scales from small to massive datasets. Instead of repeatedly copying the same files, the system now performs truly incremental backups — capturing only new or changed parquet files while always preserving delta logs. This not only minimizes costs but also dramatically reduces runtime.
Our safe deletion workflow ensures that stale data is removed without risk, using lifecycle-based cleanup rather than immediate deletion. Together, these design choices give us reliable backups, efficient scaling, and continuous optimization of storage. What used to be expensive, error-prone, and manual is now automated, predictable, and cost-effective.
]]>By using off-the-shelf open source tools like kafka-delta-ingest, oxbow and Airbyte, Scribd has redefined its ingestion architecture to be more event-driven, reliable, and most importantly: cheaper. No jobs needed! Attendees will learn how to use third-party tools in concert with a Databricks and Unity Catalog environment to provide a highly efficient and available data platform.
This architecture will be presented in the context of AWS but can be adapted for Azure, Google Cloud Platform or even on-premise environments. The slides are also available on Scribd!
]]>Managing event-driven architectures in AWS can be complex, requiring careful orchestration of multiple services. Terraform-oxbow abstracts much of this complexity, enabling users to configure key components with simple boolean flags and module parameters. This is an easy and efficient way to have Delta table created using Apache Parquet files.
With terraform-oxbow, you can deploy:
This module follows a modular approach, allowing users to enable or disable services based on their specific use case. Here are a few examples:
To enable AWS Glue Catalog and Tables: hcl
enable_aws_glue_catalog_table = true
To enable Kinesis Data Firehose delivery stream hcl
enable_kinesis_firehose_delivery_stream = true
To enable S3 bucket notifications hcl
enable_bucket_notification = true
To enable advanced Oxbow Lambda setup for multi-table filtered optimization hcl
enable_group_events = true
AWS S3 bucket notifications have limitations: Due to AWS constraints, an S3 bucket can only have a single notification configuration per account. If you need to trigger multiple Lambda functions from the same S3 bucket, consider using event-driven solutions like SNS or SQS.
IAM Policy Management: The module provides the necessary permissions but follows the principle of least privilege. Ensure your IAM policies align with your security requirements.
Scalability and Optimization: The module allows fine-grained control over Lambda concurrency, event filtering, and data processing configurations to optimize costs and performance
Join Kuntal Basu, Staff Infrastructure Engineer, and R. Tyler Croy, Principal Engineer at Scribd, Inc. as they take you behind the scenes of Scribd’s data ingestion setup. They’ll break down the architecture, explain the tools, and walk you through how they turned off-the-shelf solutions into a robust pipeline.
Machine learning development velocity refers to the speed and efficiency at which machine learning (ML) projects progress from the initial concept to deployment in a production environment. It encompasses the entire lifecycle of a machine learning project, from data collection and preprocessing to model training, evaluation, validation deployment and testing for new models or for re-training, validation and deployment of existing models.
The term “technical debt” in software engineering was coined by Ward Cunningham, Cunningham used the metaphor of financial debt to describe the trade-off between implementing a quick and dirty solution to meet immediate needs (similar to taking on financial debt for short-term gain) versus taking the time to do it properly with a more sustainable and maintainable solution (akin to avoiding financial debt but requiring more upfront investment). Just as financial debt accumulates interest over time, technical debt can accumulate and make future development more difficult and expensive.
The idea behind technical debt is to highlight the consequences of prioritizing short-term gains over long-term maintainability and the need to address and pay off this “debt” through proper refactoring and improvements. The term has since become widely adopted in the software development community to describe the accrued cost of deferred work on a software project.
Originally a software engineering concept, Technical debt is also relevant to Machine Learning Systems infact the landmark google paper suggest that ML systems have the propensity to easily gain this technical debt.
Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt , we find it is common to incur massive ongoing maintenance costs in real-world ML systems
Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems
As the machine learning (ML) community continues to accumulate years of experience with livesystems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML sys-tems is relatively fast and cheap, but maintaining them over time is difficult and expensive
Sculley et al (2021) Hidden Technical Debt in Machine Learning Systems
Technical debt is important to consider especially when trying to move fast. Moving fast is easy, moving fast without acquiring technical debt is alot more complicated.
DevOps is a methodology in software development which advocates for teams owning the entire software development lifecycle. This paradigm shift from fragmented teams to end-to-end ownership enhances collaboration and accelerates delivery. Dev ops has become standard practice in modern software development and the adoption of DevOps has been widespread, with many organizations considering it an essential part of their software development and delivery processes. Some of the principles of DevOps are:
Automation
Continuous Testing
Continuous Monitoring
Collaboration and Communication
Version Control
Feedback Loops
This shift to DevOps and teams teams owning the entire development lifecycle introduces a new challenge—additional cognitive load. Cognitive load can be defined as
The total amount of mental effort a team uses to understand, operate and maintain their designated systems or tasks.
The weight of the additional load introduced in DevOps of teams owning the entire software development lifecycle can hinder productivity, prompting organizations to seek solutions.
Platforms emerged as a strategic solution, delicately abstracting unnecessary details of the development lifecycle. This abstraction allows engineers to focus on critical tasks, mitigating cognitive load and fostering a more streamlined workflow.
The purpose of a platform team is to enable stream-aligned teams to deliver work with substantial autonomy. The stream-aligned team maintains full ownership of building, running, and fixing their application in production. The platform team provides internal services to reduce the cognitive load that would be required from stream-aligned teams to develop these underlying services.
Infrastructure Platform teams enable organisations to scale delivery by solving common product and non-functional requirements with resilient solutions. This allows other teams to focus on building their own things and releasing value for their users
The ability of ML systems to rapidly accumulate technical debt has given rise to the concept of MLOps. MLOps is a methodology that takes inspiration from and incorporates best practices of the DevOps, tailoring them to address the distinctive challenges inherent in machine learning. MLOps applies the established principles of DevOps to machine learning, recognizing that merely a fraction of real-world ML systems comprises the actual ML code. Serving as a crucial bridge between development and the ongoing intricacies of maintaining ML systems. MLOps is a methodology that provides a collection of concepts and workflows designed to promote efficiency, collaboration, and sustainability of the ML Lifecycle. Correctly applied MLOps can play a pivotal role controlling technical debt and ensuring the efficiency, reliability, and scalability of the machine learning lifecycle over time.
At Scribd we have developed a machine learning platform which provides a curated developer experience for machine learning developers. This platform has been built with MLOps in mind which can be seen through its use of common DevOps principles.
Fowler (2022, October 20).Conway’s Law. https://martinfowler.com/bliki/ConwaysLaw.html
Humanitect, State of Platform Engineering Report
Rouse (2017, March 20). Technical Debt. https://www.techopedia.com/definition/27913/technical-debt
]]>