TypeScript library for extracting structured data from archaeological excavation report PDFs
English | νκ΅μ΄
β οΈ macOS Only: This project currently supports only macOS (Apple Silicon or Intel). See @heripo/pdf-parser README for detailed system requirements.
βΉοΈ Notes (v0.1.x):
- Mixed Script Detection: Korean-Hanja mixed documents are automatically detected and corrected via VLM (Vision Language Model)
- TOC Dependency: Reports without a TOC will fail (intentional). Rare extraction failures will be addressed via human intervention
- Vertical Text: Old vertical-text documents with Chinese numeral page numbers are a long-term goal, not currently scheduled
π Online Demo: Try it without local installation β engine-demo.heripo.com
- Introduction
- Key Features
- Architecture
- Installation
- Packages
- Usage Examples
- Demo Application
- Documentation
- Roadmap
- Contributing
- Citation and Attribution
- License
heripo engine is a collection of tools for analyzing archaeological excavation report PDFs and extracting structured data. It is designed to effectively process documents that span hundreds of pages and contain complex layouts, tables, diagrams, and photographs.
heripo lab is an open-source R&D group that combines archaeological domain knowledge with software engineering expertise to drive practical research efficiency.
- Role: Design of LLM-based unstructured data extraction pipeline and system implementation
- Background: Software Engineer (B.S. in Computer Science and B.A. in Archaeology)
- Research:
- A Study on Archaeological Informatization Using Large Language Models (LLMs): Proof of Concept for an Automated Metadata Extraction Pipeline from Archaeological Excavation Reports (2025, Heritage: History and Science Vol. 58 No. 3, KCI Listed)
- Role: Archaeological data ontology design, data schema definition, and academic validation
- Background: Ph.D. Student in Archaeology, M.A. in Cultural Informatics
- Research:
- Considerations for Structuring Maritime Cultural Heritage Data (2025, Journal of the Island Culture No. 66, KCI Listed)
- Semantic Data Design for Maritime Cultural Heritage: Focusing on Ancient Shipwrecks and Wooden Tablets Excavated from the Taean Mado waters (2025, Master's Thesis)
- Role: Development of archaeology research platforms
- Background:
- Software Engineer
- M.A. in Archaeology (Coursework Completed)
- B.A. in Archaeology
- B.A. in Library and Information Science
Archaeological excavation reports contain valuable cultural heritage information, but are often available only in PDF format, making systematic analysis and utilization difficult. heripo engine solves the following problems:
- OCR Quality: High accuracy recognition of scanned documents using Docling SDK
- Structure Extraction: Automatic identification of document structure including table of contents, chapters/sections, images, and tables
- Cost Efficiency: Cost savings through local processing instead of cloud OCR (free)
Beyond Archaeology: While heripo engine is optimized for archaeological reports, its PDF structuring capabilities (text, tables, images, TOC extraction) work well with heavily damaged scanned PDFs and documents from other domains (architecture, history, etc.). Feel free to fork and adapt it to your needs.
Raw Data Extraction β Archaeological Data Ledger β Archaeological Data Standard β Domain Ontology β DB Storage
| Stage | Description |
|---|---|
| Raw Data Extraction | Document data structurally extracted in the original format of PDF reports (no archaeological interpretation) |
| Data Ledger | Immutable ledger structured using a universal model covering global archaeology |
| Data Standard | Extensible standard model (base standard β country-specific β domain-specific extensions) |
| Ontology | Domain-specific semantic models and knowledge graphs |
| DB Storage | Independent storage and utilization for each pipeline stage |
Current Implementation (v0.1.x):
- β PDF parsing and OCR (Docling SDK)
- β Document structure extraction (TOC, chapters/sections, page mapping)
- β Image/table extraction and caption parsing
Planned Stages:
- π Immutable Ledger (universal archaeological model, concept extraction)
- π Extensible Standardization (hierarchical standard model, normalization)
- π Ontology (semantic model, knowledge graph)
- π Production Ready (performance optimization, API stability)
For a detailed roadmap, see docs/roadmap.md.
- High-Quality OCR: Document recognition using Docling SDK (ocrmac / Apple Vision Framework)
- Mixed Script Auto-Detection & Correction: Automatically detects Korean-Hanja mixed pages and corrects them via VLM β ocrmac excels at speed and quality for large-scale processing, but cannot handle mixed character systems, so only affected pages are targeted for VLM correction
- Apple Silicon Optimized: GPU acceleration on M1/M2/M3/M4/M5 chips
- Automatic Environment Setup: Automatic Python virtual environment and docling-serve installation
- Image Extraction: Automatic extraction and saving of images from PDFs
- TOC Extraction: Automatic TOC recognition with rule-based + LLM fallback
- Hierarchical Structure: Automatic generation of chapter/section/subsection hierarchy
- Page Mapping: Actual page number mapping using Vision LLM
- Caption Parsing: Automatic parsing of image and table captions
- LLM Flexibility: Support for various LLMs including OpenAI, Anthropic, Google
- ProcessedDocument: Intermediate data model optimized for LLM analysis
- DoclingDocument: Raw output format from Docling SDK
- Type Safety: Complete TypeScript type definitions
heripo engine is organized as a pnpm workspace-based monorepo.
heripo-engine/
βββ packages/ # Core libraries
β βββ pdf-parser/ # PDF β DoclingDocument
β βββ document-processor/ # DoclingDocument β ProcessedDocument
β βββ model/ # Data models and type definitions
β βββ shared/ # Internal utilities (not published)
βββ apps/ # Applications
β βββ demo-web/ # Next.js web demo
βββ tools/ # Build tool configurations
βββ logger/ # Logging utility (not published)
βββ tsconfig/ # Shared TypeScript config
βββ tsup-config/ # Build config
βββ vitest-config/ # Test config
For detailed architecture explanation, see docs/architecture.md.
- macOS (Apple Silicon or Intel)
- Node.js >= 24.0.0
- pnpm >= 10.0.0
- Python 3.9 - 3.12 (
β οΈ Python 3.13+ is not supported) - jq (JSON processing tool)
- poppler (PDF text extraction tools)
# Install Python 3.11 (recommended)
brew install [email protected]
# Install jq
brew install jq
# Install poppler
brew install poppler
# Install Node.js and pnpm
brew install node
npm install -g pnpmFor detailed installation guide, see @heripo/pdf-parser README.
# Install individual packages
pnpm add @heripo/pdf-parser
pnpm add @heripo/document-processor
pnpm add @heripo/model
# Or install all at once
pnpm add @heripo/pdf-parser @heripo/document-processor @heripo/model| Package | Version | Description |
|---|---|---|
| @heripo/pdf-parser | 0.1.x | PDF parsing and OCR |
| @heripo/document-processor | 0.1.x | Document structure analysis and LLM processing |
| @heripo/model | 0.1.x | Data models and type definitions |
import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';
import { DocumentProcessor } from '@heripo/document-processor';
import { Logger } from '@heripo/logger';
import { PDFParser } from '@heripo/pdf-parser';
const logger = Logger(...);
// 1. PDF Parsing
const pdfParser = new PDFParser({
port: 5001,
logger,
});
await pdfParser.init();
const tokenUsageReport = await pdfParser.parse(
'path/to/report.pdf',
'report-001',
async (outputPath) => {
// 2. Document Processing (inside callback)
const processor = new DocumentProcessor({
logger,
fallbackModel: anthropic('claude-opus-4-5'),
pageRangeParserModel: openai('gpt-5.2'),
tocExtractorModel: openai('gpt-5.1'),
captionParserModel: openai('gpt-5-mini'),
textCleanerBatchSize: 10,
captionParserBatchSize: 5,
captionValidatorBatchSize: 5,
});
const { document, usage } = await processor.process(
doclingDocument,
'report-001',
outputPath,
);
// 3. Use Results
console.log('TOC:', document.chapters);
console.log('Images:', document.images);
console.log('Tables:', document.tables);
console.log('Footnotes:', document.footnotes);
console.log('Token Usage:', usage.total);
},
true, // cleanupAfterCallback
{}, // PDFConvertOptions
);
// Cleanup
await pdfParser.dispose();// Specify LLM models per component + fallback retry
const processor = new DocumentProcessor({
logger,
fallbackModel: anthropic('claude-opus-4-5'), // For retry on failure
pageRangeParserModel: openai('gpt-5.2'),
tocExtractorModel: openai('gpt-5.1'),
validatorModel: openai('gpt-5.2'),
visionTocExtractorModel: openai('gpt-5-mini'),
captionParserModel: openai('gpt-5-nano'),
textCleanerBatchSize: 20,
captionParserBatchSize: 10,
captionValidatorBatchSize: 10,
maxRetries: 3,
maxValidationRetries: 3,
enableFallbackRetry: true, // Automatically retry with fallbackModel on failure (default: false)
onTokenUsage: (report) => console.log('Token usage:', report.total),
});Try it without local installation:
π https://engine-demo.heripo.com
The online demo has a daily usage limit (3 times). For full functionality, local execution is recommended.
A web application providing real-time PDF processing monitoring:
cd apps/demo-web
cp .env.example .env
# Set LLM API keys in .env file
pnpm install
pnpm devAccess http://localhost:3000 in your browser
Key Features:
- PDF upload and processing option configuration
- Real-time processing status monitoring (SSE)
- Processing result visualization (TOC, images, tables)
- Job queue management
For detailed usage, see apps/demo-web/README.md.
- Architecture Document - System design and structure
- Roadmap - Development plans and vision
- Contributing Guide - How to contribute
- Security Policy - Vulnerability reporting procedure
- Code of Conduct - Community code of conduct
Current version: v0.1.x (Initial Release)
- β PDF parsing with OCR
- β Document structure extraction (TOC, chapters/sections)
- β Image/table extraction
- β Page mapping
- β Caption parsing
- Universal data model design covering global archaeology
- Archaeological concept extraction (features, artifacts, strata, excavation units)
- LLM-based information extraction pipeline
- Hierarchical standard model design (base β country-specific β domain-specific)
- Normalization pipeline
- Data validation
- Domain-specific semantic models
- Knowledge graph construction
- Performance optimization
- API stability guarantee
- Comprehensive testing
For details, see docs/roadmap.md.
# Install dependencies
pnpm install
# Build all
pnpm build
# Type check
pnpm typecheck
# Lint
pnpm lint
pnpm lint:fix
# Format
pnpm format
pnpm format:check
# Run all tests
pnpm test
pnpm test:coverage
pnpm test:ci
# Test specific package
pnpm --filter @heripo/pdf-parser test
pnpm --filter @heripo/document-processor test# Build specific package
pnpm --filter @heripo/pdf-parser build
# Test specific package (with coverage)
pnpm --filter @heripo/pdf-parser test:coverage
# Watch mode for specific package
pnpm --filter @heripo/pdf-parser devThank you for contributing to the heripo engine project! For contribution guidelines, see CONTRIBUTING.md.
- Fork this repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'feat: add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Create a Pull Request
- All tests must pass (
pnpm test) - 100% code coverage must be maintained
- ESLint and Prettier rules must be followed
- Commit messages must follow Conventional Commits
- Issue Tracker: GitHub Issues
- Discussions: GitHub Discussions
- Security Vulnerabilities: See Security Policy
If you use this project in research, services, or derivative works, please include the following attribution:
Powered by heripo engine
Such attribution helps support the open-source project and gives credit to contributors.
For academic papers or research documents, you may use the following BibTeX entry:
@software{heripo_engine,
author = {Kim, Hongyeon and Cho, Hayoung and Kim, Gaeun},
title = {heripo engine: TypeScript Library for Extracting Structured Data from Archaeological Excavation Report PDFs},
year = {2026},
url = {https://github.com/heripo-lab/heripo-engine},
note = {Apache License 2.0}
}This project is distributed under the Apache License 2.0.
This project uses the following open-source projects:
- Docling SDK - PDF parsing and OCR
- Vercel AI SDK - LLM integration
heripo lab | GitHub | heripo engine