PDF Association https://pdfa.org PDF’s technical community Mon, 16 Mar 2026 13:26:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.9.4 https://pdfa.org/wp-content/uploads/2019/01/cropped-logo_squares_trans-1-32x32.png PDF Association https://pdfa.org 32 32 EAA and Mass Documents: Why Accessible PDFs Become a System-Level Issue https://pdfa.org/eaa-and-mass-documents-why-accessible-pdfs-become-a-system-level-issue/?utm_source=rss&utm_medium=rss&utm_campaign=eaa-and-mass-documents-why-accessible-pdfs-become-a-system-level-issue Thu, 12 Mar 2026 15:30:53 +0000 https://pdfa.org/?p=217437 The European Accessibility Act (EAA) is fundamentally changing how organizations must think about PDFs. While accessibility has often been treated as a final step in the document process, this approach quickly reaches its limits when dealing with mass documents. Invoices, contract documents, confirmations, and system-generated letters are produced automatically in large volumes every day and are deeply embedded in digital business processes. If accessibility is only addressed after documents have been generated, the result is manual rework, inconsistent quality, and uncertainty when demonstrating compliance with regulations such as the EAA.

For organizations that rely on automated document generation, accessibility therefore becomes a structural issue rather than an isolated task. Instead of correcting individual PDFs afterward, accessibility needs to be integrated directly into the document generation process. This is particularly relevant in environments such as SAP Commerce and other SAP-based systems, where document production is highly automated and closely linked to core business processes.

This article explains why accessible PDFs for mass documents must be treated as a system-level requirement, how the EAA changes expectations for system integrators, and why accessibility needs to be embedded directly into document architecture. It also highlights how template-based approaches enable organizations to generate accessible documents automatically, consistently, and at scale within existing IT system landscapes.

Read more.

]]>
Why PDF/A-b Fails Machine Reading https://pdfa.org/why-pdfa-b-fails-machine-reading/?utm_source=rss&utm_medium=rss&utm_campaign=why-pdf-a-b-fails-machine-reading Thu, 12 Mar 2026 09:07:40 +0000 https://pdfa.org/?p=217926 There is a specific class of document failure that is uniquely treacherous because it is entirely invisible. You open the file. The text renders perfectly. A human reader sees exactly what was intended. But ask a machine to extract that text—a search indexer, a RAG pipeline, a language model—and what comes back is garbage. Or nothing at all.

This isn’t a bug in the extraction software. It is a direct consequence of what the PDF/A-b standard actually guarantees, and, crucially, what it deliberately ignores.

Glyphs Are Not Characters

To understand why this happens, we have to look at how PDF architecture handles text. It is less intuitive than most people assume. When a PDF renderer displays text on your screen, it isn’t drawing letters. It is drawing glyphs—the visual shapes associated with characters in a specific font program.

A standard PDF stores text as a sequence of these glyph identifiers. The renderer maps them to drawing instructions, and you see readable words. But structurally? The file is just holding a list of shape references.

To make that text searchable or extractable, you need a separate data structure: a ToUnicode CMap. This mapping table explicitly records which Unicode code point belongs to which glyph. Without it, extraction software is flying blind. It knows a shape was drawn on the page, but it has no structural idea what that shape actually means. This gap—between rendering a shape and encoding a character—is where PDF/A-b archives quietly fall apart.

What PDF/A-b Actually Promises

PDF/A-b (where the “b” stands for Basic) is defined under ISO 19005 to guarantee one thing: reliable visual reproduction. Its sole commitment is that a document will look exactly the same decades from now as it does today, regardless of the operating system or software used to open it.

To achieve this, PDF/A-b mandates that all fonts be fully embedded within the file. No external references, no risky runtime font substitutions. The visual output is permanently locked in.

However, PDF/A-b does not require those embedded fonts to carry a ToUnicode CMap. The standard mandates visual fidelity, not machine readability. For its original, historical purpose—preserving a visual record—this was a perfectly coherent trade-off. But for modern enterprises trying to feed their archives into AI models, it is a massive structural blind spot.

Where Extraction Breaks

When text extraction libraries (like PyMuPDF, Apache PDFBox, or the parsers buried inside enterprise content management systems) attempt to read a PDF, they look for the ToUnicode CMap first. If it is missing, they fall back to heuristics: guessing via standard encoding tables, glyph name lookups, or internal font metrics.

These fallbacks might survive a simple Latin text using standard system fonts. They fail completely against the reality of enterprise archives: documents with custom subset-embedded fonts, legacy files from older authoring tools, or scanned materials processed through basic OCR engines that never enforced Unicode normalisation.

The failure modes are notoriously messy. Sometimes the parser returns an empty string. Sometimes it spits out replacement characters. But the most dangerous failure mode is when it returns text that looks plausible but is fundamentally wrong—characters misidentified because a glyph index happened to align with a different code point in the fallback table. The downstream system accepts it as fact, because no error was ever thrown.

The RAG Pipeline Hallucination

This brings us to the modern AI stack. A Retrieval-Augmented Generation (RAG) system ingests documents by converting their text into vector embeddings. The quality of that embedding is entirely held hostage by the quality of the extracted text.

If you feed a PDF/A-b archive without Unicode mappings into a RAG pipeline, you get one of three outcomes. The document becomes functionally invisible (an empty embedding). It returns nonsensical results (a corrupted embedding). Or, worst of all, it generates a plausible-but-incorrect embedding built on silently misidentified characters.

The phrase “garbage in, garbage out” is too generous here. Garbage implies the system knows the data is bad. The failure mode here is confident wrongness—a language model retrieving and synthesizing corrupted text without a single warning flag that the source encoding was compromised.

The PDF/A-2u Mandate

ISO 19005 addressed this exact gap with the introduction of the “u” (Unicode) conformance level. PDF/A-2u enforces all the strict visual fidelity rules of Level b, but adds a non-negotiable condition: every font must include a ToUnicode CMap that correctly maps all glyphs to their Unicode equivalents.

Visually, a Level b and a Level u file are indistinguishable. Structurally, they exist in different eras. A PDF/A-2u file can be ingested by any standards-compliant extraction library without heuristics, without guessing, and without silent data corruption.

This is why specifying “PDF/A compliance” in a procurement document or data governance policy is essentially meaningless unless you specify the conformance level. If your archive is ever going to be queried by software—whether for legal eDiscovery, search indexing, or LLM ingestion—Level b is insufficient. The standard isn’t broken; we just stopped using it exclusively for human eyes.

]]>
OCR and PDF/A: The Foundation Your Enterprise AI Is Missing https://pdfa.org/ocr-and-pdfa-the-foundation-your-enterprise-ai-is-missing/?utm_source=rss&utm_medium=rss&utm_campaign=ocr-and-pdf-a-the-foundation-your-enterprise-ai-is-missing Wed, 11 Mar 2026 06:54:43 +0000 https://pdfa.org/?p=217831 Executive Summary

The Archival Deficit: Why Most Enterprise AI Initiatives Are Built on Blind Spots

The race to deploy AI across the enterprise has a quiet assumption embedded in it: that corporate data is machine-readable. For most organisations, it is not—and nobody budgeted for that.

Decades of fragmented, desktop-centric document management have left the typical enterprise with a repository that is part structured database, part digital landfill. Scanned contracts from 2007. Legacy regulatory filings saved as flat image PDFs. Acquisition records from a company absorbed fifteen years ago, migrated once, and never touched since. Individually, each of these documents represents institutional knowledge. Collectively, they are invisible to every AI ingestion engine currently being deployed to unlock that knowledge.

The failure mode is not obvious. A Retrieval-Augmented Generation (RAG) pipeline encountering a flat, image-based PDF does not return an error. It returns something—a hallucinated answer assembled from whatever corrupted text fragments the parser managed to extract before giving up. In a legal, compliance, or financial context, a confidently wrong answer is considerably more dangerous than no answer at all. This is not a prompt engineering problem. It is not a model quality problem. It is a document infrastructure problem, and it precedes every other AI investment on the roadmap.

What This Whitepaper Argues

The case made across the following sections is this: document processing has outgrown the desktop. What was once a reasonable default—installing a PDF client on every workstation and letting employees manage their own files—is now an active liability across three distinct enterprise risk dimensions, each of which demands the same architectural response:

  • AI readiness is a geometry problem, not a software problem: Effective OCR for enterprise AI ingestion is not about recognising characters; it is about preserving spatial relationships. A pipeline that skips layout analysis—missing the grid of a financial table or the logical flow of a multi-column brief—produces text that looks correct to a human but is structurally corrupted to a language model. Getting this right requires multi-stage cloud processing, not a desktop application running on a laptop between meetings.
  • Format obsolescence is a slow-motion compliance failure: A document stored in a standard PDF today may be unrenderable in fifteen years if it relies on external fonts or deprecated plugins. ISO 19005 (specifically the PDF/A-2u conformance level) prevents this by locking every font, colour profile, and character encoding inside the file. Enforcing this standard at scale requires centralised infrastructure; policy alone does not work.
  • The endpoint is the weakest link in document security: Beyond the active threat of PDF-based malware, the most pervasive risk is unmanaged data sprawl. Every time an employee downloads a sensitive document to process it locally, an unmanaged copy is created. Cloud-native document processing eliminates this by design, ensuring the endpoint remains a viewport rather than a storage vault.

The Path Forward

The remediation strategy is sequential, not simultaneous. First, stop new dark data from entering the repository by routing all document workflows through a centralised cloud pipeline immediately. Second, audit the existing archive computationally to separate structured documents from flat raster files. Third, process the backlog systematically through a high-fidelity OCR pipeline, outputting Unicode-mapped, geometrically structured, PDF/A-2u compliant assets at scale.

The result is not merely a better document management system. It is the foundation that every AI initiative in the organisation is currently missing—and quietly failing because of. The bottleneck in enterprise AI is not the model, the vector database, or the retrieval algorithm. It is the document format. This is the most tractable problem on the AI roadmap, with a known solution. The only remaining question is whether it gets treated with the urgency it deserves.


To explore how your organisation can transition to a secure, cloud-native document architecture, visit PDF Smart’s Enterprise Solutions.

Section 1: The Ingestion Problem Nobody Budgeted For

The pitch sounds straightforward: give employees a chat interface, connect it to your document repository, and let an AI answer questions from proprietary data. Every major cloud vendor is selling some version of this. The reality in production is messier.

Retrieval-Augmented Generation (RAG) systems depend entirely on the quality of what gets ingested. An LLM cannot retrieve what the pipeline never correctly processed. And the first thing that breaks a RAG pipeline—not the vector database, not the embedding model, not the prompt engineering—is the document format. When an AI ingestion engine hits a flat, image-based PDF, it reads pixels, not text. The result is a null value, a skipped document, or worse: a confident hallucination built on a corrupted text fragment extracted from a broken parser.

This is not a niche edge case. An enterprise migrating fifteen years of archived contracts, purchase orders, and compliance filings to a RAG-powered knowledge base will find that the majority of those documents are opaque to the AI entirely—unless an Optical Character Recognition (OCR) layer has already converted them into machine-readable, structured text. The industry has spent considerable attention on tuning LLMs; comparatively little has gone into auditing what the data pipeline actually ingests.

The Compliance Cost of a Dark Archive

The AI readiness problem is real, but it may not be the most pressing reason to remediate unstructured archives. That distinction belongs to regulatory exposure.

GDPR Article 15 and California’s CCPA both grant individuals the right to receive copies of their personal data on demand—typically within 30 days. eDiscovery requests in litigation operate under similarly unforgiving timelines. Neither compliance scenario is served by a legal team manually sifting through scanned TIFF images or flat PDFs that resist keyword search. The Association for Intelligent Information Management (AIIM) estimates that employees in document-intensive roles already spend 30–40% of their working time simply searching for information. When that information is locked inside unreadable raster images, that number does not improve—it compounds.

The operational math is stark. Audit costs scale with repository size. If a single compliance officer needs three minutes to manually review a flat document that a database query could surface in three seconds, multiply that across two million archived files. The bottleneck is not a people problem. It is a document format problem.

The Scale of the Archival Deficit

The volume of unstructured data sitting dormant in enterprise repositories is difficult to overstate. Gartner estimates that 80% of all enterprise data today exists in unstructured formats, and between 40% and 90% of that—depending on industry—qualifies as “dark data”: stored, retained, but entirely unused. BCG research puts the figure at roughly 50% of everything companies archive.

Three operational metrics from PDF Smart’s document processing telemetry illustrate the scope of this problem in concrete terms:

  • The Raster Majority: Across enterprise archive migration projects processed through PDF Smart’s cloud infrastructure, 67% of all uploaded documents are identified on ingestion as flat, image-based files—raster PDFs, TIFFs, or scanned JPEG composites—that require OCR before any downstream processing can occur.
  • Retrieval Velocity: Once flat files are processed through PDF Smart’s automated OCR pipeline and indexed as searchable, structured PDFs, organizations report an average 57% reduction in document retrieval time during compliance audits and eDiscovery exercises.
  • Storage Optimization: Converting high-resolution flat scans into compressed, text-layer PDFs reduces cloud storage overhead by an average of 41%, as structured PDF compression substantially outperforms the bloated raster formats typical of legacy scanning workflows.

Section 2: Decoding the Standards — PDF/A and the Archival Imperative

The Problem OCR Alone Cannot Solve

Getting text out of a scanned image is a solved problem. What happens to that text afterwards is not.

Once an enterprise has converted its dark archive into machine-readable documents, it faces a quieter, slower threat: format obsolescence. Digital files do not degrade the way paper does—they do not yellow or fade. They break all at once, usually at the worst possible moment, when the software that created them no longer exists and the fonts they reference have not shipped with an operating system in a decade.

The legal and operational implications of this are serious. Retention mandates in financial services, healthcare, and public sector procurement routinely demand that documents remain accessible and reproducible for 20, 30, even 99 years. A contract saved as a standard PDF in 2004—relying on an embedded ActiveX plugin, a system font pulled from a Windows XP installation, or a colour profile linked to an external ICC registry—may render as a blank page or a cascade of substitution characters today. The organisation stored it. The organisation retained it. The organisation simply cannot read it.

This is the archival imperative: structured and searchable is not enough. Documents must also be self-contained.

What PDF/A Actually Is (and Isn’t)

PDF/A—formally ISO 19005—is not simply a “safer” version of PDF. It is a deliberately constrained subset of the format, designed around a single governing principle: everything required to render the document must live inside the file itself, permanently.

That constraint has teeth. PDF/A prohibits JavaScript, embedded audio and video, real-time data connections, and encryption. More consequentially, it mandates that all fonts, colour profiles, and image assets are fully embedded rather than externally referenced. Open a PDF/A-compliant document on a machine with no internet connection, on an operating system that did not exist when the file was created, and it must render identically to how it looked the day it was signed. That guarantee is not advisory—it is the standard.

What PDF/A is not is a magic wand. Converting a poorly structured document to PDF/A compliance does not repair its content, correct OCR errors from an earlier processing step, or retroactively add semantic structure. The format preserves what is there. Which makes the quality of what gets put in—particularly the OCR output—the critical upstream variable.

Conformance Levels: The Detail That Derails Most Implementations

Specifying “PDF/A compliance” in a procurement requirement or data governance policy is nearly meaningless without specifying which conformance level. The standard has three, and for AI readiness purposes, the differences are not minor.

Conformance Level What It Guarantees AI & Enterprise Readiness
Level b — Basic Accurate visual reproduction; the document looks right. Low AI utility — pixels render correctly, but text extraction is not guaranteed; RAG pipelines may still fail.
Level u — Unicode All Level b requirements, plus every character maps to a standard Unicode value. High AI utility — text is reliably extractable, searchable, and ingestible by LLMs and vector embedding pipelines.
Level a — Accessible All Level u requirements, plus full structural tagging: reading order, headers, table boundaries. Maximum utility — optimal for accessibility mandates, structured data mining, and the most demanding RAG ingestion workflows.

The gap between Level b and Level u is where most enterprise AI projects quietly fail. A document can be visually perfect—legible to a human auditor, archivally preserved—while remaining functionally invisible to a language model because its character encoding was never normalised to Unicode. IT teams that deploy RAG systems against a PDF/A-b archive and then wonder why retrieval quality is poor are, more often than not, encountering exactly this gap.

For any organisation building an AI-ready data lake, PDF/A-2u is the practical minimum. Level a compliance is worth pursuing where accessibility regulation applies—the UK’s Public Sector Bodies Accessibility Regulations, for instance, or the European Accessibility Act—and where document structure (tables, hierarchical headers, reading order) is material to downstream data extraction.

Why Cloud Enforcement Is the Only Enforcement That Works

The technical case for PDF/A is straightforward. The operational case for how to enforce it at scale is where most enterprise implementations stall.

Asking employees to manually validate export settings before saving a file is not a compliance strategy. It is a wishlist. In practice, a financial analyst under deadline pressure, a paralegal processing a hundred documents before a filing date, or a procurement officer merging a vendor portfolio on a Friday afternoon is not reconfiguring PDF output settings. They are saving the file and moving on. The result is an archive that is mostly compliant, with exceptions distributed unpredictably across millions of documents—which is a compliance exposure, not a compliance programme.

The only reliable enforcement mechanism is to remove the decision from the user entirely. Cloud-native document processing infrastructure—routing conversion, compression, merging, and OCR through a centralised pipeline like PDF Smart rather than fragmented desktop applications—means that PDF/A-2u conformance becomes an output condition, not a user responsibility. The document enters the workflow in whatever format it was created. It exits as a validated, Unicode-mapped, self-contained archival asset. The end user experiences none of the friction. The data lake accumulates nothing but compliant, AI-ready documents.

This is the operational logic behind treating document standardisation as infrastructure rather than policy. Policies get ignored. Infrastructure does not give users the option.

Section 3: The Mechanics of Enterprise OCR — Beyond Basic Extraction

Getting Words Off a Page Is the Easy Part

Character recognition has been a solved problem since the early 1990s. The hard part—the part that determines whether an OCR output is usable by a downstream AI system or quietly corrupted—is spatial reasoning.

A pixel knows nothing about the pixel next to it. A naive OCR engine processing a scanned regulatory filing sees shapes that resemble letters and converts them into a text string, reading left to right, top to bottom, in a single pass. That is fine for a one-column memo. It is disastrous for a two-column legal brief, where the engine fuses the left and right columns into interleaved nonsense. It is worse for a financial table, where the engine strips away the grid and emits a flat list of orphaned figures with no relationship to their row or column headers.

The document looks fine to a human. To a RAG pipeline parsing the underlying text layer, it is garbage in. And garbage in, as any ML practitioner will tell you, does not produce useful answers—it produces confident wrong ones. This is why enterprise OCR is not primarily a character recognition problem. It is a geometry problem.

The Pipeline Most Implementations Skip Half Of

Producing a structured, AI-ready PDF from a flat raster scan requires several sequential processing stages. The output quality of each stage constrains the ceiling of every stage that follows.

  • Pre-processing and normalisation: Before a character is identified, the image must be remediated—algorithmically deskewed to correct crooked scanner beds, binarised to sharpen contrast, and cleaned of scan artifacts: line noise, bleed-through from the reverse side of thin paper, coffee-ring shadows. Skipping this step does not just reduce accuracy; it introduces systematic errors that no downstream correction can reliably fix.
  • Zoning and bounding box detection: The engine maps the geometry of the page, drawing boundaries around distinct content regions—body copy, headers, footnotes, captions, margin annotations—and flagging graphic elements that should not be parsed as text at all. This is where multi-column layouts are correctly identified, rather than read straight across.
  • Reading order determination: Establishing spatial boundaries is not the same as knowing the sequence those regions should be read in. A sidebar on the right half of a page may be visually adjacent to body copy but logically separate from it. Reading order heuristics determine the correct traversal path, so that the extracted text string reflects the document’s intended narrative flow rather than its physical geography.
  • Table and structure extraction: The most computationally demanding phase, and the one most often poorly implemented. A table is not just a grid of numbers—it is a set of relationships. Every data point has a row header and a column header, and those relationships must survive extraction intact if the data is to be queryable. A well-structured OCR engine translates this into tagged XML or PDF/A-a structural data; a mediocre one flattens it into a list and discards the relationships entirely.

Why Desktop Software Cannot Solve a Big Data Problem

A 400-page scanned contract processed through a local desktop PDF application will consume the available CPU of a standard corporate laptop for several minutes, lock the interface, and drain the battery noticeably. That is one document. Scale that to a 50,000-document discovery payload that legal needs processed before a Monday morning court filing, and the architectural failure of desktop-centric processing becomes immediately apparent.

Local processing is sequential by default. A laptop processes file one, then file two, then file three. There is no parallelisation. There is no elastic scaling. There is no way to throw more compute at a batch job because the compute is physically fixed to the endpoint sitting on someone’s desk. Enterprises that have attempted large-scale archive remediation using distributed desktop tooling—routing documents to employee machines via shared drives or scheduled tasks—consistently find the same outcome: inconsistent output quality, unpredictable processing times, and IT support queues full of “the PDF software froze again” tickets.

Enterprise OCR at archive scale is a data infrastructure problem. It requires treating document processing the way organisations already treat data transformation pipelines: as a workload that belongs in the cloud, not on an endpoint.

What Cloud-Native Ingestion Actually Changes

Shifting OCR execution from the local endpoint to a centralised cloud architecture replaces fixed hardware ceilings with dynamic, parallel compute. A cloud pipeline does not serialise a batch of ten thousand documents. It distributes them across concurrent processing threads, with infrastructure scaling automatically to match the volume of the workload. The time required to process a single document and the time required to process a hundred thousand documents are no longer on the same curve.

PDF Smart’s cloud processing telemetry, aggregated across enterprise archive migration projects, yields the following benchmarks:

  • Batch Processing Speed: Utilising parallel cloud compute, enterprise clients reduce bulk archive conversion time by an average of 81% compared to benchmarked desktop software processing the same document set sequentially—a workload that takes a distributed desktop deployment four days completes in under eighteen hours on equivalent cloud infrastructure.
  • Language and Semantic Accuracy: PDF Smart’s cloud OCR engine supports dynamic recognition across 183 languages, including right-to-left scripts and legacy character sets common in pre-2000 archived materials, achieving an average character accuracy rate of 97.3% even on degraded source scans below 150 DPI.
  • Geometric Fidelity: By enforcing multi-stage layout analysis across every document processed—not as an optional enhancement but as a mandatory pipeline step—93% of output documents retain their exact original formatting structure, with tables, multi-column layouts, and reading order correctly mapped for downstream vector embedding and RAG ingestion.

Section 4: The End of the Endpoint — Why Local PDF Processing Is a Security Liability

The Threat Vector Nobody Talks About in AI Procurement

Enterprise security teams spend considerable energy debating which LLM vendor to trust with proprietary data. They spend comparatively little time examining the application that opens the documents before they get anywhere near an LLM. That asymmetry is a problem.

Desktop PDF software is one of the most persistently exploited attack surfaces in enterprise IT, and has been for the better part of two decades. The reason is architectural: PDFs are not digital paper. They are execution environments. The format has historically supported embedded JavaScript, external process launching, and dynamic media rendering—capabilities that exist for legitimate purposes and are routinely weaponised for illegitimate ones. When a user opens a maliciously crafted PDF on a corporate workstation, the payload does not execute in isolation. It executes inside the corporate network perimeter, with access to whatever that endpoint can reach. From there, lateral movement to file servers, credential stores, and connected systems is well-documented and well-practised by threat actors.

Organisations respond to this by layering endpoint detection and response (EDR) tools, application whitelisting, and patch management programmes on top of the same fundamentally vulnerable architecture. These are damage-limitation measures. The more direct solution is to stop processing sensitive documents on the endpoint entirely.

The Cloud as a Sanitisation Layer

Moving document processing—OCR, compression, format conversion, redaction—to a cloud-native pipeline does something that no amount of endpoint hardening can replicate: it removes the endpoint from the execution path.

In a properly architected cloud document pipeline, uploaded files are ingested and processed inside ephemeral, isolated server containers. If a malicious file enters the system, any embedded exploit attempts to execute against a hardened, short-lived Linux environment with no network adjacency to corporate infrastructure. It finds nothing useful and is destroyed along with the container the moment processing completes. The output routed back to the enterprise is a sanitised, structurally flat PDF/A file. The original payload never reaches a user’s machine.

This is not a theoretical security posture. It is a direct application of zero-trust principles to document workflows: assume the file is hostile, process it somewhere the damage is contained, and only return the verified output. The endpoint becomes a viewport into the document, not the environment in which the document executes.

Data Sprawl: The Governance Problem That Compounds Quietly

The malware vector is visible and dramatic. The data sprawl problem is quieter, more pervasive, and in many regulatory environments, more immediately costly.

Every time an employee downloads a sensitive document to run OCR, apply a redaction, or add a signature using desktop software, they create an unmanaged copy of that document. It lands in a Downloads folder. It gets silently swept into a personal iCloud or Google Drive backup. It persists on a hard drive that will eventually be lost, stolen, repurposed, or improperly decommissioned. Multiply that behaviour across a thousand employees processing documents daily and the result is not a data governance programme—it is an uncontrolled proliferation of sensitive corporate data across devices the IT department cannot inventory, monitor, or wipe.

Cloud-native processing eliminates this by design rather than by policy. Documents are transmitted over AES-256 TLS-encrypted connections, processed entirely within secure server memory, and returned directly to the enterprise data lake or designated storage repository. Nothing rests on the endpoint. The distinction matters because policies get ignored and people get busy, while architecture simply does not give users the option to create the problem in the first place.

Architectural Security: Local vs. Cloud-Native

For IT and security teams conducting zero-trust compliance audits of document infrastructure, the differences between the two models are not marginal.

Security Vector Legacy Desktop PDF Software Cloud-Native Processing (PDF Smart)
Threat Isolation Poor — malicious files execute directly on the user’s OS, within the corporate network perimeter. Strong — execution is contained within ephemeral sandboxed containers; the endpoint never touches the payload.
Data Sprawl High risk — processing requires local file downloads, creating unmanaged sensitive data copies on employee devices. Eliminated — documents are processed in memory and returned to secure storage; no local retention occurs.
Auditability Fragmented — IT has no visibility into local file edits, conversions, or copies until a file is re-uploaded to a managed system. Comprehensive — every operation is logged in a centralised, immutable audit trail, queryable for compliance review.
Patch Surface Large and persistent — each installed desktop client is a versioned application requiring ongoing patch management and vulnerability monitoring. Minimal — updates are deployed centrally at the infrastructure level; no client-side patch cycle required.
Standards Enforcement User-dependent — output quality and compliance settings rely on individual configuration choices. Automated — PDF/A conformance, encryption standards, and access controls are enforced at the infrastructure level on every file.

The security case for cloud-native document processing is not contingent on AI readiness or archival compliance, though both benefit from it. It stands on its own: the endpoint is the most consistently breached layer of enterprise IT, and document processing is one of the most active threat vectors within it. Removing that workload from the endpoint is not an upgrade to existing security architecture. It is a fundamental change in where risk lives.

Section 5: The Infrastructure Imperative — Standardising with PDF Smart

Three Problems, One Root Cause

Enterprise IT has a habit of solving the same problem three times over because three different teams own three different symptoms.

The AI team is troubleshooting why RAG retrieval quality is poor and blaming the embedding model. The compliance team is managing an audit backlog because keyword search fails on half the archive. The security team is patching desktop PDF vulnerabilities and chasing down unmanaged file copies on employee laptops. All three are writing separate budget requests, attending separate vendor meetings, and reaching separate conclusions.

They are looking at the same failure from different angles. The root cause—in each case—is that document processing was never treated as infrastructure. It was treated as a desktop utility, distributed across thousands of endpoints, governed by individual user behaviour, and optimised for nothing in particular. You cannot build a reliable AI data lake on an architecture that lets employees individually configure their PDF export settings any more than you can enforce data retention policy by asking people to name their files consistently. The policy exists. The architecture does not support it.

What Centralised Processing Actually Changes

PDF Smart’s architecture resolves this not by adding another governance layer on top of the existing model, but by replacing the model. Document processing moves off the endpoint entirely and into a unified cloud pipeline. What changes downstream from that single architectural shift is significant.

  • Zero-footprint execution: Documents are transmitted via AES-256 TLS-encrypted connections to ephemeral, sandboxed processing containers. The endpoint never holds the payload. Whether a user is running a multi-pass OCR batch on a thousand archived contracts or compressing a financial portfolio before a board meeting, the compute happens remotely and the local machine remains uninvolved. There is no installed client to patch, no local copy to govern, no execution surface to harden.
  • Automated standardisation: The platform functions as a normalisation layer regardless of what enters it. A user can upload a folder of flat TIFFs from a 2003 filing cabinet scan, a set of Word documents exported without PDF/A settings, and a batch of rasterised JPEG composites from a mobile scanning app. What comes back is a set of Unicode-mapped, geometrically structured, PDF/A-2u compliant documents—ready for vector embedding and RAG ingestion without any manual remediation step.
  • Centralised auditability: Because every document action—conversion, OCR extraction, table mapping, eSignature—passes through the same pipeline, the audit trail is not assembled after the fact from fragmented local logs. It is generated automatically, stored centrally, and queryable on demand. For compliance teams responding to a Subject Access Request or a litigation hold, that difference is not a convenience. It is the difference between a manageable process and an emergency.

A Sequenced Remediation Strategy

Transitioning a legacy document environment to a structured, AI-ready data lake does not require a multi-year programme freeze or a wholesale infrastructure replacement. It requires sequencing the work correctly.

The first priority is stopping new dark data from entering the repository. Route all new document creation, conversion, merging, and signing workflows through a cloud-native pipeline immediately. Every document processed from this point forward exits as a compliant, searchable, AI-ready asset. The backlog stops growing.

The second step is auditing the existing archive—not manually, but computationally. Separate the documents that already carry a structured text layer from those that are flat raster images. This categorisation is the basis for prioritising remediation: high-value, frequently accessed documents first, then systematically outward.

The third step is processing the flat archive through a high-fidelity OCR pipeline to establish Unicode mapping, correct reading order, geometric bounding boxes, and PDF/A-2u conformance. This is not a one-time project. It is a pipeline that runs until the backlog is cleared, then retires.

Done in sequence, the result is an archive that is fully searchable, legally defensible, secure by design, and ready for AI ingestion—without a single employee having to change how they work.

The Real Cost of Deferring Document Infrastructure

Billions are currently being invested in LLMs, vector databases, and RAG pipelines. Most of those investments are predicated on the assumption that the underlying documents are readable. A significant proportion are not.

The irony is that the cheapest part of an enterprise AI stack—the document format—is the one most likely to determine whether the expensive parts deliver any value at all. A RAG system built on a dark archive does not return poor answers. It returns confident answers derived from whatever fragments the parser managed to extract, with no indication of what it missed. That is not a minor inefficiency. In a legal, financial, or compliance context, it is a liability.

Document infrastructure is not a prerequisite task to complete before the real AI work begins. It is the real AI work. Treating it as such—as a critical enterprise workload rather than a desktop utility—is the precondition for everything else in this document to function as described.

To explore how your organisation can transition to a secure, cloud-native document architecture, visit PDF Smart’s Enterprise Solutions.

]]>
Use Case: Efficient Accessibility for Public Authorities in the State of Berlin https://pdfa.org/use-case-efficient-accessibility-for-public-authorities-in-the-state-of-berlin/?utm_source=rss&utm_medium=rss&utm_campaign=use-case-efficient-accessibility-for-public-authorities-in-the-state-of-berlin Fri, 06 Mar 2026 17:05:11 +0000 https://pdfa.org/?p=217686 Accessibility in the State of Berlin

At the State Office for Health and Social Affairs (Lageso), many documents are created every day that are used both internally and externally. However, the standard functions in Microsoft Word are often not sufficient to meet legal accessibility requirements. Through a state license of the State of Berlin, Lageso gained access to axesWord and thus received a solution to create documents efficiently and inclusively accessible.

axesWord in Broad Use

With axesWord, Lageso introduced a solution that enables accessibility directly within the work process. Today, specialist departments work independently without needing support from accessibility officers. Sources of error and duplicate work steps were significantly reduced, while the automated checking function increased document quality. Through widespread use, accessibility has become firmly embedded in everyday workflows and noticeably relieves the central contact person.

Photo of Detlef Köppel. State Office for Health and Social Affairs
Detlef Köppel. State Office for Health and Social Affairs:

“If only a small team of experts uses axesWord, a bottleneck arises. Through broad adoption, many employees can create accessible documents directly within their work processes and relieve the accessibility officers.”

Structured Processes and Reliable Results

Today, processes at Lageso are clearer and faster. Documents are structured before publication, many errors are detected automatically, and specialist departments create accessible content independently. Especially for documents that need to be created quickly, such as reminders, notices, or published PDFs, preparation with axesWord saves a great deal of effort, as a wide range of document types is covered. Templates for notices also use axesWord to automatically convert document properties such as tables, image placeholders, or headers and footers into accessible formats. The result: noticeably more efficient workflows, consistently positive feedback, and the reassuring sense that accessibility is considered from the very beginning.

]]>
Accessible math in PDF – finally! https://pdfa.org/accessible-math-in-pdf-finally/?utm_source=rss&utm_medium=rss&utm_campaign=accessible-math-in-pdf-finally Wed, 04 Mar 2026 03:15:49 +0000 https://pdfa.org/?p=217459 On the web, math is made accessible to users with disabilities via MathML technology. Until very recently, math in PDF wasn’t accessible to those users, who had to make do with unstructured, alternative text descriptions of mathematical formulae that cannot be represented using braille math codes.

Today, math in PDF 2.0 is finally fully accessible so that math can be navigated by capable AT providing speech and/or braille.

Printed pages have included formulas for centuries. In the digital age, the first generation of tools focused only on the visual appearance of the page. Like the rest of the computer industry that developed in the 1970s and 1980s, typesetting systems were not designed to deliver content to users who required assistive technology.

Today, a large proportion of the typesetting in authoring and publishing STEM (Scientific Technical Education Mathematics) content is performed using the LaTeX open source typesetting system. Another significant source of STEM content is Microsoft Word.

The workflow for accessible STEM content

Four critical elements that have now come together that leverage ISO-standardized technologies to deliver accessible math:

  1. Suitable creation software
  2. Modern PDF with the necessary features (PDF 2.0)
  3. PDF reader software with the ability to process MathML
  4. Assistive technology that can handle MathML in the PDF context

Diagram illustrating LaTeX source to PDF with MathML to a viewer that supports PDF with MathML to assistive technology.

Implications for authors

The latest version of LaTeX is now able to automatically generate accessible mathematics by including MathML in exported PDF files, using either of the ISO-standardized mechanisms designed for this purpose. Microsoft Word also includes MathML when using their “export as PDF”, but this support is not ISO-standardized.

Implications for PDF viewers

PDF viewing software is gradually catching up with current-generation ISO standards for PDF.

For example, Foxit Reader and the Firefox browser now support PDF files that include either of the ISO-standardized methods for including MathML – structure elements or the Associated Files mechanism, while Adobe’s Reader today supports the structure elements method, but not the Associated Files method.

Implications for assistive technology users

In its 2025 release, NVDA, the well-known screen-reader, together with a MathML-supporting add-in such as MathCAT, provides real-world proof that PDF 2.0’s support for accessible mathematics is a game-changer for accessibility across the entire STEM community.

The JAWS screen reader, when used with FireFox, is now also capable of reading accessible PDF documents that include mathematics.

Implications for institutions

Many STEM organizations have a large corpus of existing PDF documents along with the corresponding LaTeX source files.

In the near future, this new workflow will make it possible to recompile most of these existing LaTeX files to produce accessible PDF files without manual intervention, enabling a wholesale refreshing of existing collections of PDF content.

Implications for accessibility checking software

In order to support accessible math in PDF, accessibility checkers must support both PDF 2.0 and PDF/UA-2 (both available at no cost thanks to our generous sponsors).

Many of today’s checkers provide inaccurate results on PDF 2.0 files because they test against PDF/UA-1 rules, which require, among other things, Formula tags to have alternative text.

Implications for organizations

Organizations using STEM content should provide their users with accessibility checkers that support PDF/UA-2.

As of February 2026, many existing tools are only aware of PDF/UA-1 (2014), which is inadequate for most STEM documents. As a result these tools incorrectly flag valid PDF/UA-2 (2024) documents as invalid.

As of Q1, 2026, PDF Association members supporting PDF/UA-2 in their accessibility checkers include:

More information

The LaTeX Project has prepared some demos, examples and short videos to help users unfamiliar with accessibility understand the difference between accessible math and its predecessors.

Conclusion

On reviewing a PDF file using this new technology, Louis Maher, Secretary of the Science and Engineering Division of the National Federation of the Blind, said:

“I could never read a PDF document with math in it – I always needed help to find out some of the content. In my testing, with these new tools, the math in PDF is spoken as correctly as it is in HTML. Your PDF work is very impressive.”

It’s time for other creation, reader and assistive technology software to come on board with ISO-standardized accessible math in PDF!

]]>
callas software appoints a Customer Success Manager for North America https://pdfa.org/callas-software-strengthens-north-american-presence-with-appointment-of-customer-success-manager/?utm_source=rss&utm_medium=rss&utm_campaign=callas-software-appoints-a-customer-success-manager-for-north-america Tue, 03 Mar 2026 17:57:49 +0000 https://pdfa.org/?p=217510 Berlin – callas software, PDF expert with 30 years of experience in building automated PDF quality control, correction, and archival solutions, today announced the appointment of Natacha De Kegel as Customer Success Manager for North America, effective March 1st, 2026. This marks the first time callas has established a dedicated company resource based in North America.

North America is a key growth market for callas. With an expanding installed base of OEM integrations, channel partners, and end users across the region, the company is investing in closer proximity and stronger regional support. Based in Florida and operating in the Eastern Time Zone, Natacha will serve as the primary point of contact for customers, channel partners, and OEMs throughout North America.

Natacha brings extensive industry experience to the role. After beginning her career in PDF and print workflow technology, she built her own consulting business before joining DistributorX, one of callas’ most important distribution and integration partners in the region. In that role, she worked closely with print service providers, publishers, and integrators, gaining deep insight into real-world production environments and workflow challenges.

North America as growth market

“North America has always been strategically important for callas,” said David van Driessche, Chief Evangelist at callas software. “Until now, it was supported by the team in Germany. As our presence in the region grew through OEM relationships and channel partners, the next logical step was to establish a dedicated local resource to support them.”

Van Driessche added, “Natacha and I have worked together before – this is actually the second time I’ve hired her. I’ve always valued her practical understanding of production workflows and her ability to connect technology with real customer needs. Her experience on the partner side gives her a perspective that is extremely valuable for callas.”

Strengthening collaboration

In her new role, Natacha will focus on strengthening collaboration with channel partners and OEMs, improving customer response times, and helping organizations in North America get the most out of their callas solutions.

“I’m very grateful for the years I spent at DistributorX,” said Natacha. “Working closely with customers across North America gave me a deep appreciation for the complexity of modern print and publishing workflows. callas software plays a critical role in many of those environments, and I’m excited to now contribute directly to the development and success of these solutions.”

She added, “Having a local callas presence in North America will make communication faster and collaboration easier. Being present at local trade shows and events — and bringing initiatives such as pdfCamp to North America — will make it easier for professionals to engage directly with the technology and explore practical solutions to their production challenges.”

With this appointment, callas reinforces its long-term commitment to North America and to building a stronger regional presence in key markets worldwide.

More information about callas can be found on its websites:
Main: https://www.callassoftware.com
For OEMs: https://oem.callassoftware.com

]]>
Smarter PDFs, happier teams: Discover What’s New in iLovePDF https://pdfa.org/smarter-pdfs-happier-teams-discover-whats-new-in-ilovepdf/?utm_source=rss&utm_medium=rss&utm_campaign=smarter-pdfs-happier-teams-discover-whats-new-in-ilovepdf Tue, 03 Mar 2026 16:50:43 +0000 https://pdfa.org/?p=217502 We’re excited to share a comprehensive platform update focused on AI-powered workflows, team administration, usability improvements, and enhanced security. These updates strengthen our commitment to secure, scalable, and standards-aligned PDF solutions for individuals and organizations worldwide.

Advanced PDF editing

We’ve enhanced the Advanced Edit tool to deliver better performance and smoother content adjustments. Editing PDFs should feel seamless, and now it does.

Translate PDF with formatting preserved

Our new Translate PDF tool makes multilingual workflows dramatically easier without breaking your layout. Documents are translated while preserving the original structure, styling, images, visual elements, and overall layout integrity. There is no need for copy and paste, rebuilding layouts, or post translation formatting fixes.The result is a translated document that looks and behaves like the original, ready to share immediately.

AI Credits and team-based AI management

We’ve introduced AI Credits, providing structured access to AI-powered PDF features. For teams, new AI credit settings allow administrators to manage allocation and usage at the organizational level. This ensures predictable consumption, better governance, and scalable AI adoption across departments.

Regional file processing

With Regions, organizations can define regional preferences and better support jurisdictional or data residency considerations. It is another step toward making distributed document workflows simpler and more compliant.

Tool settings: More control, fewer clicks

We’ve added a new Tool Settings section so users can fine tune how their PDFs behave.

General settings

  • Save task history
  • Automatically download output files

Tool specific controls

  • Compress
  • PDF/A conversion
  • OCR
  • PDF to JPG
  • Image to PDF
  • Edit PDF
  • Protect PDF

Whether you are optimizing for archival standards, automation, or speed, these controls help reduce friction and streamline repetitive tasks.

Updated Terms & Conditions

Our Terms & Conditions have been updated to reflect new features and ensure transparency across services and subscriptions.

Notifications

We’ve introduced expanded notification capabilities to improve communication around account activity, subscriptions, and system updates.

User Panel refresh

We’ve refined the user panel experience with improved plans and packages design, clearer section titles and hierarchy. Small design changes can make a big usability impact. This release reflects our continued focus on international scalability, usability improvements, and strong security foundations, while maintaining the reliability professionals expect from modern PDF tools.

]]>
Check your PDFs before you ship ‘em! https://pdfa.org/check-your-pdfs-before-you-ship-em/?utm_source=rss&utm_medium=rss&utm_campaign=check-your-pdfs-before-you-ship-em Tue, 24 Feb 2026 15:44:02 +0000 https://pdfa.org/?p=217099

The world is probably getting tired of hearing about users’ redaction mistakes, but covering content with black boxes is not the only way to get into trouble!

What possible explanation is there for sharing PDF documents without checking for comments, as in this case? We don’t want to say, but…

Anyone responsible for checking a PDF before it goes out must ensure they use a PDF viewer that understands basic PDF features, such as annotations (which, by the way, quite a few mobile device readers do NOT).

A PDF icon doesn’t mean that the link goes to a PDF

Once again, the ubiquity and trust in PDF documents is abused to trick users into dangerous situations – even when no PDF is used!

The emails in this phishing campaign don’t attach a document directly but include links to a file hosted on IPFS (InterPlanetary File System), a decentralized storage network increasingly used by cybercriminals as it can be accessed through normal web gateways. Those files are virtual hard disks that, when opened, mount as a local disk, bypassing some Windows security features. Inside the disk is a Windows Script File (WSF) purporting to be the expected PDF: When the user opens it, Windows executes the code in the file thus leaving the computer open to exploitation by remote users.

Register for PDF Week London

Joining us for PDF Week London coming up in May? We’ve posted some suggested hotels within a short walk of our meeting venue.

Save the date for ISO Week in South Korea

Thanks to our colleagues at Hancom, PDF Week for late 2026 will be held in Incheon during the week of October 12-16, 2026.

Beyond our regular meetings, members of the PDF Association and ISO committees meeting in Korea will offer a half-day seminar open to the public to provide information on PDF’s current status and future direction.

LLM text extraction paradox

This month’s PDFacademicBot includes a new paper, Speed, Simplicity, and Fidelity: A Multi-Metric Benchmark of Python PDF Extraction Libraries for RAG Pipelines, by A. Subramanian. An interesting (especially for those extracting text from PDFs for LLMs) finding:

We uncover an extraction paradox: tools specifically designed for LLM consumption (pymupdf4llm, Docling) significantly underperform simpler rule-based extractors (PyMuPDF, pypdfium2, pypdf) on text fidelity, while being 100–1,600× slower.

Share tokens instead of documents?

We’re hearing a lot of ideas about enhancing PDF files to make them easier for AIs to understand, and that’s great. We’re confident, however, that exchanging tokens instead of documents won’t go far.

Replace PDF with… Markdown? 😆😂😭

This article (in Dutch) has received attention on LinkedIn, where it was originally posted.

For an author who self-describes as a “Technology Philosopher”, this piece represents a naive understanding of the very real challenges of complex typography, diverse requirements of human communication, and the realities of real-world document workflows, including requirements for archival content.

As we’ve previously reported, long-term understanding of even “plain text” requires preservation of its context. One only has to look at EBCDIC, GB 18030, and the constant evolution of Unicode to understand that human communication goes far beyond “simple text”. Archivists, experienced software developers, and even comedians understand the importance of typefaces and appreciate the real-world challenges of dealing with legacy schema-less XML, undocumented JSON, or HTML files from the browser wars era.

In collaboration with ISO TC 171 SC 2 WG 10, our DocRM Liaison Working Group is helping develop the ISO 20271-1 reference model to address misunderstandings and misconceptions about textual preservation across all file formats.

Epstein PDFs analysis continues

Thanks to our coverage of the Epstein PDFs, reporters at The Verge asked for our thoughts on the oddities they and others are finding in emails converted to PDFs. Read more at theverge.com.

This article summarises what redaction is and lists some well-known redaction failures.

Redacting content versus hiding content, explained

If you are trying to remove content from a PDF, it’s essential that you understand the difference between “hiding” and “removing” content. The PDF Association’s new video explains the distinction.

Chrome now supports JPEG-XL

Further to our past announcements about selecting JPEG-XL as the preferred HDR format for future PDF and Chrome engineers reversing their previous decision against JPEG-XL, Chrome v145 now includes support for JPEG-XL. Try opening this JPEG-XL test page to see JPEG-XL in action.

The re-evaluation began in November 2025, when the Chromium team announced its resumption. Several factors were decisive: Apple had implemented JPEG XL support in Safari, Mozilla had abandoned its neutral stance, and the PDF Association had included the format in PDF specifications as recommended in October 2025. Technically, Chromium plans to integrate “jxl-rs,” a Rust-based JPEG XL decoder. Google is already using the format in practice: the Google Cloud Platform DICOM API uses JPEG XL to reduce file size by 20 percent.

Super Mario 64 in a PDF

Maybe you’ve played Doom in a PDF. Or taken a cue from Michael Demay, and ditched the PS5 for PDF 2.0.

Not everyone loves a first-person shooter. Perhaps you are more of a Mario fan? Now you can play Super Mario 64 on your favorite substrate: PDF!

Brotli gains media attention

The PDF Technical Working Group’s work on promoting Brolti as a new general compression algorithm was picked up by various news outlets in Germany and the US. Is your PDF technology ready for this “breaking change” that steps up PDF file size reductions?

PDFacademicBot for February 2026

The PDFacademicBot brings academic research on PDF and related technologies to the industry’s attention.

Açıkgöz, Z., Arslan, S., and Arslan, R.S. (Nov. 2025) “Enhancing File Security with an Optimized Auto-Classification Framework Based on Learning Models,” in 2025 9th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), pp. 1–6. https://doi.org/10.1109/ISMSIT67332.2025.11268095.

Bouwel, J.V. and Kock, T.D. (Jan. 2026) “FROM PDF TO PRESERVATION: USING AUTOMATED TEXT ANALYSIS TO UNCOVER STONE USE MOTIVATIONS IN 19TH CENTURY PUBLIC ARCHITECTURE,” in. 15th International Congress on the Deterioration and Conservation of Stone, pp. 600–602. https://repository.uantwerpen.be/docman/irua/1c215cmotoM3f.

Hanson, M.D. (2026) “Confronting the Urgent Challenge of Using LaTeX to Create Accessible Course Materials,” ChemRxiv, 2026 (0129). https://doi.org/10.26434/chemrxiv.10001708/v1.

Jain, R. and Kumar, S.R. (Jan. 2026) “Perfecting Tax Returns Like Code: A Verifier-Swarm, Codebase-Style Architecture that Solves TaxCalcBench,” p. 3. https://prime-meridian-papers.s3.us-west-2.amazonaws.com/solving_taxes_like_code.pdf

Kuligin, L. (Jan. 2026) “Layout-Aware Text Extraction Using Heuristic Segmentation and LLM-Based Refinement.” Technical Disclosure Commons – Defensive Publication Series. https://www.tdcommons.org/cgi/viewcontent.cgi?article=10508&context=dpubs_series.

Lam, D., Li, L. and Gabrielson, A. (Jan. 2026) “Parser Weakness Enumeration,” p. 8. https://drive.usercontent.google.com/download?id=1VUPYR9yTvnQgiSpj3CrMrLKbnZdWP9xu&export=download&authuser=0

Prakash, P. et al. (Nov. 2025) “Revolutionizing PDF Q&A with Local LLMs and Privacy-Enhanced Retrieval-Augmented Generation,” in 2025 International Conference on Green Energy, Computing and Sustainable Technology (GECOST). 2025 International Conference on Green Energy, Computing and Sustainable Technology (GECOST), pp. 1–6. https://doi.org/10.1109/GECOST66002.2025.11324623.

Rigal, B. et al. (Feb. 2026) “Benchmarking Vision-Language Models for French PDF-to-Markdown Conversion.” arXiv. https://doi.org/10.48550/arXiv.2602.11960.

Sharmila S. P (Jan. 2026) “PDFInspect: A Unified Feature Extraction Framework for Malicious Document Detection.” arXiv. https://doi.org/10.48550/arXiv.2601.12866.

Silaen, C.J. et al. (Nov. 2025) “Automatic Generation of Presentation Slides from PDF Using Retrieval-Augmented Chatbot,” in 2025 IEEE 11th International Conference on Computing, Engineering and Design (ICCED), pp. 1–6. https://doi.org/10.1109/ICCED68324.2025.11324852.

Subramanian, A. (Feb. 2026) Speed, Simplicity, and Fidelity: A Multi-Metric Benchmark of Python PDF Extraction Libraries for RAG Pipelines. https://doi.org/10.13140/RG.2.2.29289.56168.

Vasepalli, K. et al. (2025) “Intelligent Model for PDF Malware Detection” in Proceedings of the 1st International Conference on Research and Development in Information, Communication, and Computing Technologies. INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION, COMMUNICATION, AND COMPUTING TECHNOLOGIES, Nagapattinam, India: SCITEPRESS – Science and Technology Publications, pp. 800–806.: https://doi.org/10.5220/0013943800004919.

Wallwater, I. et al. (Jan. 2026) “ChemSIE: From Document Based Records to Machine Actionable Experimental Data,” ChemRxiv, 2026(0121). https://doi.org/10.26434/chemrxiv.10001481/v1.

Waseem, A., Zia, M.A.M. and Adedayo, O.M. (Jan. 2026) “A Comparative Study of Forensic File Type Identification Methods for Tool Type Identification,” IEEE Open Access, p. 14. https://doi.org/10.1109/ACCESS.2026.3655461.

Zhu, W., Mazeen Mujthaba, M., and Wong, K. (Jan. 2026) “Reversible data hiding in PDF files by overlapping characters,” Journal of Information Security and Applications, 97, p. 104375. https://doi.org/10.1016/j.jisa.2026.104375.

 

]]>
Datalogics Adds Support for PAdES B-T Digital Signatures to Adobe PDF Library SDK https://pdfa.org/datalogics-adds-support-for-pades-b-t-digital-signatures-to-adobe-pdf-library-sdk/?utm_source=rss&utm_medium=rss&utm_campaign=datalogics-adds-support-for-pades-b-t-digital-signatures-to-adobe-pdf-library-sdk Mon, 23 Feb 2026 19:18:54 +0000 https://pdfa.org/?p=217047 Datalogics has added new API support to Adobe PDF Library SDK in a new release, making it easier to create PAdES B-T digital signatures in PDF documents.  In practical terms, that means developers can now generate PAdES Baseline B-T signatures directly through a dedicated PAdES signature class in the SDK.    So what does that mean for PDF security?

  • PAdES is a European standard for advanced electronic signatures in PDFs.
  • Baseline B-T adds a trusted timestamp to the signature.
  • That timestamp proves not just who signed the document, but when it was signed and that the signature was valid at that time.

With this update, you don’t have to manually piece together all the required components for a compliant PAdES B-T signature. Adobe PDF Library SDK now handles the heavy lifting through a purpose-built API, so it’s now much easier to create standards-compliant, timestamped digital signatures in PDFs using the Adobe PDF Library SDK.  Here’s some more information about PAdES:  PAdES stands for PDF Advanced Electronic Signatures. It is a set of standards that defines how to create advanced and qualified electronic signatures inside PDF documents, based on European regulations (notably eIDAS) and ETSI standards. *Note that Datalogics only supports PAdES B-T at this time.  What PAdES Is (in simple terms)  PAdES builds on:

  • PDF digital signature capabilities
  • PKI (Public Key Infrastructure)
  • X.509 certificates
  • ETSI standards for advanced electronic signatures

It specifies how to embed long-term, legally valid digital signatures directly into a PDF file.

Why PAdES Is Useful in PDF

PAdES signatures can meet eIDAS requirements, meaning:

  • They can qualify as Advanced Electronic Signatures (AdES)
  • Or even Qualified Electronic Signatures (QES)
  • A QES has the same legal value as a handwritten signature in the EU

Built for PDF

PAdES integrates directly into the PDF structure using standard PDF signature fields.  That means:

  • The document remains a normal PDF
  • It can be opened in Adobe Acrobat and other readers
  • Signature validation is built-in

 Long-Term Validation

One of PAdES’s biggest advantages is long-term validation support.  It allows embedding:

  • Certificate chains
  • OCSP responses
  • CRLs (revocation info)
  • Timestamps

This ensures:

  • The signature can still be validated years or decades later; even if certificates expire.
  • This is critical for industries that require long document retention.

Multiple Signature Levels

PAdES defines different levels of assurance:

  • PAdES-B-Basic: basic signature
  • PAdES-B-T: includes trusted timestamp
  • PAdES-B-LT: long-term validation material included
  • PAdES-B-LTA: archival-level protection

Organizations can choose the level depending on regulatory needs.

Why It Matters in Enterprise PDF Workflows

For companies working with PDF technology (like digital document processing, validation, archival, or compliance), PAdES is important because:

  • It enables compliant e-signature solutions.
  • It supports automated validation workflows.
  • It aligns with European regulatory requirements.
  • It reduces legal risk.

To see this new capability in action, check out Adobe PDF Library SDK and start a free trial.

]]>
The PDF Association at 20 https://pdfa.org/the-pdf-association-at-20/?utm_source=rss&utm_medium=rss&utm_campaign=the-pdf-association-at-20 Mon, 23 Feb 2026 16:46:41 +0000 https://pdfa.org/?p=216934 In 2026 the PDF Association marks its 20th year of operation.

Since 2006 we’ve grown from five German companies collaborating on a common understanding of ISO 19005-1 to the technical support system for the global ecosystem of all PDF technology.

What began in 2006 as a focus on archival-quality digital documents has grown into an international hub for PDF developers. From common understandings to technical collaboration to industry events, ISO standards, and our own specifications, today’s PDF Association serves PDF”s stakeholders (organizations, developers, governments and users) with a vendor-neutral platform for considering and enhancing PDF technology.

Why do we focus on collaboration and shared experience in a competitive marketplace? Because PDF technology depends on interoperability – users’ ability to share documents without worrying about the recipient’s choice of software for viewing them.

From PDF/A to a global standards community

The PDF Association’s roots trace back to 2006, with the establishment of the PDF/A Competence Center, an initiative to highlight the newly published ISO standard for archival PDF: PDF/A-1 (ISO 19005-1:2005). At that time, long-term digital preservation was hindered by proprietary formats and software dependencies. PDF/A-1, published in 2005, offered a robust, self-contained file specification engineered to embed all necessary rendering information — fonts, color profiles, and structural metadata — so that documents could be reliably preserved and accessed across decades independent of platform or viewer technology.

Based in Europe, the PDF/A Competence Center community grew rapidly, as libraries, cultural heritage institutions, corporations, and regulators adopted PDF/A for trusted electronic recordkeeping and long-term access. The need to validate, implement, and interoperate with PDF/A catalyzed a broader technical and professional dialogue that naturally expanded the original Competence Center’s scope and continues to this day.

In 2011, the Board changed the organization’s name to PDF Association, expanding its mission beyond PDF/A to encompass the entire spectrum of PDF technology specifications and standards. This evolution reflected a recognition that, as a universal, platform-agnostic, document format, PDF’s scope implied the need for a broader framework for consensus-building and standards development.

Today, the PDF Association counts more than 150 members from roughly 30 countries. Our membership includes software vendors, service providers, standards implementers, government bodies, libraries, enterprises, technical professionals, and other stakeholders engaged with a broad range of PDF-related technologies and workflows.

Mission and role in standards

From inception, the PDF Association has worked to develop and advance ISO standards for PDF technology. Through its category A liaison with ISO, the Association facilitates its members’ participation in the international working groups that define and evolve PDF and related technologies. This engagement ensures that PDF remains an open, widely implemented, vendor-neutral, globally relevant standard.

PDF is often implemented by open source developers with low (or no) budgets. To remove barriers to adoption, the PDF Association enables members to sponsor ISO standards for PDF, making them available at no cost. Providing public access to these standards expands participation and implementation across the whole ecosystem, especially for developers and organizations that would otherwise face cost barriers in obtaining ISO-published documents.

PDF’s technical ecosystem

Underpinning the PDF Association’s standards work are our Technical and Liaison Working Groups, in which subject matter experts collaborate on specifications, best-practice documents, test suites, techniques, implementation notes, and other explanatory and educational materials.

These groups cover a wide range of domains, including archiving, accessibility, print, engineering, cryptography, rich media, and more. Their activities and publications feed into ISO processes, industry documentation, validation efforts, and community-driven knowledge resources.

Throughout 2025, the PDF Association’s working groups released numerous deliverables, including new accessibility techniques covering lists and headings, guidance on including custom metadata structures in PDF, an FAQ on HDR in PDF/A and PDF/X, and a new TWG focused on cryptography and provenance. The association also published guidance on conforming to multiple subsets in a single PDF and dramatically expanded its use of GitHub for technical collaboration.

These working group products play a significant role in practical advancement and adoption. They provide a bridge between evolving standards and real-world implementation concerns, helping implementers realize consistent, interoperable software and workflows.

Events and community engagement

Events that convene stakeholders, create opportunities for technical dialogue, and spotlight emerging priorities are key elements in PDF’s technical ecosystem. Among these, PDF Week holds a central place. Held three times each year around the globe, PDF Week events are stakeholder-oriented forums where working group meetings, ISO committee sessions, and networking opportunities converge in a focused period.

Looking ahead to 2026, the PDF Association continues this tradition with a sequence of meetings that will take place online and in person, including PDF Week Online 2026 (February), PDF Week London 2026 (May), and PDF Week Incheon in South Korea in October 2026. These events expand global participation and provide platforms for deep discussions on the wide variety of topics engaged by PDF’s technical community.

In addition to PDF Week events, the association hosts symposiums, webinars and workshops on focused technical topics such as accessibility (such as our upcoming webinar on Techniques for Accessible PDF: Lists), future directions for 3D PDF, and practical techniques for developers and document professionals. These sessions extend the community’s reach beyond core members and foster broader engagement. The PDF Association’s PDF Days event, last held in Berlin in September 2025, brings together vendors and end users for presentations and discussions and serves as another way to foster collaboration. 

Impact and industry significance

The PDF Association’s impact over the past 20 years can be measured in several ways:

  • A global community of stakeholders collaborating to define, implement, and advance interoperable approaches.
  • Collaboratively-developed working documents, techniques, and education materials that support developers and implementers tackling real-world PDF challenges.
  • Collaboration with 3rd parties, including W3C and DARPA
  • Sustained international standards development for PDF technology that meets industry needs in document exchange, accessibility, preservation, printing, engineering, and more.
  • A portfolio of events and engagement opportunities that strengthen professional networks and technical coherence across the ecosystem.

The PDF Association’s work not only helps to ensure PDF’s position as a ubiquitous format for digital publishing but also provides a vendor-neutral means of developing guidance to ensure structured, accessible, secure, semantically rich, and highly functional documents in a diverse array of industries and use cases.

Looking ahead

As the PDF Association enters its third decade, the mission remains grounded in fostering open, consensus-based vendor-neutral digital document technology that adapts to changing needs. From state-of-the-art compression and cryptography to accessibility and integration with artificial intelligence workflows, the themes driving PDF development continue to evolve.

In an era where digital trust, long-term preservation, and secure interoperability are paramount, the PDF Association’s role as a vendor-neutral, consensus-based technical community and standards facilitator is as relevant today as it was at its founding. Our 20th anniversary is a recognition, not just of the organization’s longevity, but of the value of continuous technical stewardship of the world’s most enduring digital document format.

]]>