Skip to content

hueper/pdf-verapdf-service

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VeraPDF Validation Service

Live Demo License: MIT OpenAPI 3.0

HTTP service for PDF/UA and PDF/A validation using VeraPDF.

Live Demo: https://pdf-verapdf-service.onrender.com


Why This Exists

The German Barrierefreiheitsstärkungsgesetz (BFSG), which came into force on 28 June 2025, implements the European Accessibility Act (EAA) at national level. Together with increasingly strict procurement requirements from libraries and public institutions, this means that accessibility compliance is no longer optional for publishers. In practice, PDF/UA has become the central standard for meeting these obligations.

Enterprise-grade validation tools exist, but not every team or organisation has the budget, infrastructure, or organisational capacity to deploy them at scale. This service explores what is realistically achievable with constrained resources: limited CPU, limited memory, and simple cloud infrastructure - while still addressing real publishing requirements.

At the same time, in many publishing workflows accessibility validation still happens late, manually, and file by file. This makes it hard to scale, difficult to audit, and poorly suited for integration into modern production pipelines. Treating PDF validation as an API rather than a desktop tool enables earlier feedback, reproducible checks, and clearer responsibility boundaries between editorial, production, and technology.

This service wraps the open-source VeraPDF validation engine in a modern, cloud-native API. It enables:

  • Accessibility audits at scale — batch validation of entire document repositories
  • CI/CD integration — validate PDFs as part of automated publishing workflows
  • Real-time feedback — WebSocket-based progress reporting for responsive UIs
  • Cloud infrastructure — deploy on Render’s free tier or your own AWS environment

What this is (and what it is not)

This repository is a deliberately small, opinionated reference implementation of how PDF/UA and PDF/A validation can be exposed as a cloud-native service using open standards.

It is not a finished product, not a hosted offering, and not intended to replace existing enterprise validation tools.

It is meant to spark discussion around:

  • automation vs. manual QA in publishing
  • accessibility validation as infrastructure, not as desktop task
  • what 'good enough' cloud deployments look like for mid-sized publishers

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                              Internet                           │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
                                  ▼
                        ┌──────────────────┐
                        │   Load Balancer  │
                        │   (ALB / Render) │
                        └─────────┬────────┘
                                  │
                                  ▼
                  ┌────────────────────────────────┐
                  │        VeraPDF Service         │
                  │      (Java 21 / Javalin)       │
                  │                                │
                  │  ┌──────────────────────────┐  │
                  │  │   Validation Engine      │  │
                  │  │   - Queue management     │  │
                  │  │   - Progress tracking    │  │
                  │  │   - Concurrent execution │  │
                  │  └──────────────────────────┘  │
                  │                                │
                  │  ┌──────────────────────────┐  │
                  │  │   WebSocket Handler      │  │
                  │  │   - Real-time updates    │  │
                  │  │   - Session management   │  │
                  │  └──────────────────────────┘  │
                  └────────────────────────────────┘

Key Technical Decisions

Decision Rationale
Javalin over Spring Minimal footprint for a focused microservice; faster cold starts on free-tier hosting
WebSocket for progress PDF validation can take 30+ seconds; real-time feedback prevents timeout assumptions
Queue with admission control Graceful degradation under load; capacity signals via Retry-After headers
OpenAPI as contract Contract tests validate responses against the spec; consumers get accurate documentation
Environment-based config 12-factor compliance; same image works from laptop to production

API

Endpoint Method Description
/health GET Health check
/status GET Server status, queue info, and capacity
/config GET Current service configuration
/profiles GET List available validation profiles
/validate/async POST Validate a PDF with WebSocket progress updates
/validate/batch POST Validate multiple PDFs synchronously

Batch Validation (API Consumers)

Validate multiple PDFs in a single synchronous request:

# Single file
curl -F "[email protected]" -F "profile=ua1" \
  https://verapdf-service.onrender.com/validate/batch

# Multiple files
curl -F "[email protected]" -F "[email protected]" -F "profile=ua1" \
  https://verapdf-service.onrender.com/validate/batch

Response:

{
  "totalFiles": 2,
  "compliantCount": 1,
  "nonCompliantCount": 1,
  "totalDurationSeconds": 28.5,
  "results": [
    {
      "compliant": true,
      "profile": "ua1",
      "profileName": "PDF/UA-1 (Universal Accessibility)",
      "rulesViolated": 0,
      "failedChecks": 0,
      "passedChecks": 150,
      "violations": [],
      "validationDurationSeconds": 12.3,
      "fileSize": 1024000,
      "summary": "Document is compliant with PDF/UA-1",
      "filename": "doc1.pdf"
    }
  ]
}

Async Validation (Frontend / WebSocket)

For responsive UIs, use async validation with real-time progress:

curl -F "[email protected]" -F "profile=ua1" \
  https://verapdf-service.onrender.com/validate/async

Response (202 Accepted):

{
  "validationSessionId": "550e8400-e29b-41d4-a716-446655440000",
  "status": "queued",
  "queuePosition": 1,
  "estimatedWaitSeconds": 30,
  "message": "Connect to WebSocket for progress updates."
}

Connect to WebSocket at wss://verapdf-service.onrender.com/ws:

const ws = new WebSocket('wss://verapdf-service.onrender.com/ws');

ws.send(JSON.stringify({
  type: 'register',
  validationSessionId: 'your-session-id'
}));

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data);
  // msg.type: 'queued' | 'started' | 'progress' | 'complete' | 'error'
};

Limits & Profiles

Default limits (configurable via environment):

  • Maximum 10 files per batch
  • Maximum 20MB per file
  • Maximum 200MB total request size

Available profiles: ua1, ua2, 1a, 1b, 2a, 2b, 2u, 3a, 3b, 3u, 4, 4e, 4f


Run Locally

With Docker

cd backend
docker build -t verapdf-service .
docker run -p 8080:8080 verapdf-service

With Maven

Requires Java 21+.

cd backend
mvn package
java -jar target/verapdf-service-1.0.0.jar

Or for development:

mvn compile exec:java

Service runs at http://localhost:8080.

Configuration

Copy .env.example to .env and customize:

# Resource limits
VERAPDF_LIMIT_MAX_QUEUE_SIZE=5
VERAPDF_LIMIT_MAX_CONCURRENT=1
VERAPDF_LIMIT_MAX_FILE_SIZE_MB=20

# Validation defaults
VERAPDF_VALIDATION_DEFAULT_PROFILE=ua1

See backend/.env.example for all options including deployment profiles for different resource tiers.


CI/CD

This project uses a deliberately minimal CI pipeline that focuses on build reproducibility and API stability rather than exhaustive quality gates. The goal is fast feedback and production confidence, not pipeline complexity.

Deployment

Render (Recommended for demos)

  1. Push this repo to GitHub
  2. Go to render.comNew +Web Service
  3. Connect your repo
  4. Render auto-detects the Dockerfile
  5. Select Free plan, confirm port 8080
  6. Click Deploy

AWS (Production)

The terraform/ directory contains a complete ECS Fargate deployment:

cd terraform

# Configure
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your container image URI

# Deploy
terraform init
terraform plan
terraform apply

See terraform/README.md for detailed configuration options.


Production Considerations

When deploying to production, review these settings:

Setting Default Production Recommendation
VERAPDF_SERVER_CORS_ALLOW_ALL true Set to false and specify allowed origins
VERAPDF_LIMIT_MAX_CONCURRENT 1 Increase based on available CPU
Log retention 7 days Adjust in Terraform for compliance needs
HTTPS Not included Add ACM certificate and HTTPS listener

Project Structure

pdf-verapdf-service/
├── openapi.yaml          # API specification (contract source of truth)
├── backend/
│   ├── Dockerfile
│   ├── pom.xml
│   └── src/
│       ├── main/java/com/pdfvalidator/
│       │   ├── Application.java        # Entry point, route definitions
│       │   ├── Config.java             # Environment-based configuration
│       │   ├── ProgressAwarePdfValidator.java
│       │   ├── ValidationWebSocket.java
│       │   └── ...
│       └── test/java/com/pdfvalidator/
│           └── ContractTest.java       # OpenAPI contract validation
├── frontend/
│   └── src/
│       ├── App.jsx
│       ├── useValidationWebSocket.jsx  # WebSocket hook
│       └── ProgressIndicator.jsx
└── terraform/
    ├── main.tf
    ├── ecs.tf
    ├── alb.tf
    └── ...

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/improvement)
  3. Run tests (cd backend && mvn test)
  4. Submit a pull request

For major changes, please open an issue first to discuss the approach.


License

This project is licensed under the MIT License.

This software uses VeraPDF (MPL-2.0 / GPL-3.0), Javalin (Apache 2.0), and other open-source libraries.

About

HTTP service for PDF/UA and PDF/A validation using VeraPDF

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors