Skip to content

heripo-lab/heripo-engine

heripo engine

TypeScript library for extracting structured data from archaeological excavation report PDFs

CI Node.js Python pnpm coverage License

English | ν•œκ΅­μ–΄

⚠️ macOS Only: This project currently supports only macOS (Apple Silicon or Intel). See @heripo/pdf-parser README for detailed system requirements.

ℹ️ Notes (v0.1.x):

  • Mixed Script Detection: Korean-Hanja mixed documents are automatically detected and corrected via VLM (Vision Language Model)
  • TOC Dependency: Reports without a TOC will fail (intentional). Rare extraction failures will be addressed via human intervention
  • Vertical Text: Old vertical-text documents with Chinese numeral page numbers are a long-term goal, not currently scheduled

🌐 Online Demo: Try it without local installation β†’ engine-demo.heripo.com

Table of Contents

Introduction

heripo engine is a collection of tools for analyzing archaeological excavation report PDFs and extracting structured data. It is designed to effectively process documents that span hundreds of pages and contain complex layouts, tables, diagrams, and photographs.

About heripo lab

heripo lab is an open-source R&D group that combines archaeological domain knowledge with software engineering expertise to drive practical research efficiency.

Kim, Hongyeon (Lead Engineer)

  • Role: Design of LLM-based unstructured data extraction pipeline and system implementation
  • Background: Software Engineer (B.S. in Computer Science and B.A. in Archaeology)
  • Research:

Cho, Hayoung (Domain Researcher)

Kim, Gaeun (Software Engineer)

  • Role: Development of archaeology research platforms
  • Background:
    • Software Engineer
    • M.A. in Archaeology (Coursework Completed)
    • B.A. in Archaeology
    • B.A. in Library and Information Science

Why heripo engine?

Archaeological excavation reports contain valuable cultural heritage information, but are often available only in PDF format, making systematic analysis and utilization difficult. heripo engine solves the following problems:

  • OCR Quality: High accuracy recognition of scanned documents using Docling SDK
  • Structure Extraction: Automatic identification of document structure including table of contents, chapters/sections, images, and tables
  • Cost Efficiency: Cost savings through local processing instead of cloud OCR (free)

Beyond Archaeology: While heripo engine is optimized for archaeological reports, its PDF structuring capabilities (text, tables, images, TOC extraction) work well with heavily damaged scanned PDFs and documents from other domains (architecture, history, etc.). Feel free to fork and adapt it to your needs.

Data Pipeline

Raw Data Extraction β†’ Archaeological Data Ledger β†’ Archaeological Data Standard β†’ Domain Ontology β†’ DB Storage
Stage Description
Raw Data Extraction Document data structurally extracted in the original format of PDF reports (no archaeological interpretation)
Data Ledger Immutable ledger structured using a universal model covering global archaeology
Data Standard Extensible standard model (base standard β†’ country-specific β†’ domain-specific extensions)
Ontology Domain-specific semantic models and knowledge graphs
DB Storage Independent storage and utilization for each pipeline stage

Current Implementation (v0.1.x):

  • βœ… PDF parsing and OCR (Docling SDK)
  • βœ… Document structure extraction (TOC, chapters/sections, page mapping)
  • βœ… Image/table extraction and caption parsing

Planned Stages:

  • πŸ”œ Immutable Ledger (universal archaeological model, concept extraction)
  • πŸ”œ Extensible Standardization (hierarchical standard model, normalization)
  • πŸ”œ Ontology (semantic model, knowledge graph)
  • πŸ”œ Production Ready (performance optimization, API stability)

For a detailed roadmap, see docs/roadmap.md.

Key Features

PDF Parsing (@heripo/pdf-parser)

  • High-Quality OCR: Document recognition using Docling SDK (ocrmac / Apple Vision Framework)
  • Mixed Script Auto-Detection & Correction: Automatically detects Korean-Hanja mixed pages and corrects them via VLM β€” ocrmac excels at speed and quality for large-scale processing, but cannot handle mixed character systems, so only affected pages are targeted for VLM correction
  • Apple Silicon Optimized: GPU acceleration on M1/M2/M3/M4/M5 chips
  • Automatic Environment Setup: Automatic Python virtual environment and docling-serve installation
  • Image Extraction: Automatic extraction and saving of images from PDFs

Document Processing (@heripo/document-processor)

  • TOC Extraction: Automatic TOC recognition with rule-based + LLM fallback
  • Hierarchical Structure: Automatic generation of chapter/section/subsection hierarchy
  • Page Mapping: Actual page number mapping using Vision LLM
  • Caption Parsing: Automatic parsing of image and table captions
  • LLM Flexibility: Support for various LLMs including OpenAI, Anthropic, Google

Data Models (@heripo/model)

  • ProcessedDocument: Intermediate data model optimized for LLM analysis
  • DoclingDocument: Raw output format from Docling SDK
  • Type Safety: Complete TypeScript type definitions

Architecture

heripo engine is organized as a pnpm workspace-based monorepo.

heripo-engine/
β”œβ”€β”€ packages/              # Core libraries
β”‚   β”œβ”€β”€ pdf-parser/        # PDF β†’ DoclingDocument
β”‚   β”œβ”€β”€ document-processor/ # DoclingDocument β†’ ProcessedDocument
β”‚   β”œβ”€β”€ model/             # Data models and type definitions
β”‚   └── shared/            # Internal utilities (not published)
β”œβ”€β”€ apps/                  # Applications
β”‚   └── demo-web/          # Next.js web demo
└── tools/                 # Build tool configurations
    β”œβ”€β”€ logger/            # Logging utility (not published)
    β”œβ”€β”€ tsconfig/          # Shared TypeScript config
    β”œβ”€β”€ tsup-config/       # Build config
    └── vitest-config/     # Test config

For detailed architecture explanation, see docs/architecture.md.

Installation

System Requirements

  • macOS (Apple Silicon or Intel)
  • Node.js >= 24.0.0
  • pnpm >= 10.0.0
  • Python 3.9 - 3.12 (⚠️ Python 3.13+ is not supported)
  • jq (JSON processing tool)
  • poppler (PDF text extraction tools)
# Install Python 3.11 (recommended)
brew install [email protected]

# Install jq
brew install jq

# Install poppler
brew install poppler

# Install Node.js and pnpm
brew install node
npm install -g pnpm

For detailed installation guide, see @heripo/pdf-parser README.

Package Installation

# Install individual packages
pnpm add @heripo/pdf-parser
pnpm add @heripo/document-processor
pnpm add @heripo/model

# Or install all at once
pnpm add @heripo/pdf-parser @heripo/document-processor @heripo/model

Packages

Package Version Description
@heripo/pdf-parser 0.1.x PDF parsing and OCR
@heripo/document-processor 0.1.x Document structure analysis and LLM processing
@heripo/model 0.1.x Data models and type definitions

Usage Examples

Basic Usage

import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';
import { DocumentProcessor } from '@heripo/document-processor';
import { Logger } from '@heripo/logger';
import { PDFParser } from '@heripo/pdf-parser';

const logger = Logger(...);

// 1. PDF Parsing
const pdfParser = new PDFParser({
  port: 5001,
  logger,
});

await pdfParser.init();

const tokenUsageReport = await pdfParser.parse(
  'path/to/report.pdf',
  'report-001',
  async (outputPath) => {
    // 2. Document Processing (inside callback)
    const processor = new DocumentProcessor({
      logger,
      fallbackModel: anthropic('claude-opus-4-5'),
      pageRangeParserModel: openai('gpt-5.2'),
      tocExtractorModel: openai('gpt-5.1'),
      captionParserModel: openai('gpt-5-mini'),
      textCleanerBatchSize: 10,
      captionParserBatchSize: 5,
      captionValidatorBatchSize: 5,
    });

    const { document, usage } = await processor.process(
      doclingDocument,
      'report-001',
      outputPath,
    );

    // 3. Use Results
    console.log('TOC:', document.chapters);
    console.log('Images:', document.images);
    console.log('Tables:', document.tables);
    console.log('Footnotes:', document.footnotes);
    console.log('Token Usage:', usage.total);
  },
  true, // cleanupAfterCallback
  {}, // PDFConvertOptions
);

// Cleanup
await pdfParser.dispose();

Advanced Usage

// Specify LLM models per component + fallback retry
const processor = new DocumentProcessor({
  logger,
  fallbackModel: anthropic('claude-opus-4-5'), // For retry on failure
  pageRangeParserModel: openai('gpt-5.2'),
  tocExtractorModel: openai('gpt-5.1'),
  validatorModel: openai('gpt-5.2'),
  visionTocExtractorModel: openai('gpt-5-mini'),
  captionParserModel: openai('gpt-5-nano'),
  textCleanerBatchSize: 20,
  captionParserBatchSize: 10,
  captionValidatorBatchSize: 10,
  maxRetries: 3,
  maxValidationRetries: 3,
  enableFallbackRetry: true, // Automatically retry with fallbackModel on failure (default: false)
  onTokenUsage: (report) => console.log('Token usage:', report.total),
});

Demo Application

Online Demo

Try it without local installation:

πŸ”— https://engine-demo.heripo.com

The online demo has a daily usage limit (3 times). For full functionality, local execution is recommended.

Web Demo (Next.js)

A web application providing real-time PDF processing monitoring:

cd apps/demo-web
cp .env.example .env
# Set LLM API keys in .env file

pnpm install
pnpm dev

Access http://localhost:3000 in your browser

Key Features:

  • PDF upload and processing option configuration
  • Real-time processing status monitoring (SSE)
  • Processing result visualization (TOC, images, tables)
  • Job queue management

For detailed usage, see apps/demo-web/README.md.

Documentation

Package Documentation

Roadmap

Current version: v0.1.x (Initial Release)

v0.1.x - Raw Data Extraction (Current)

  • βœ… PDF parsing with OCR
  • βœ… Document structure extraction (TOC, chapters/sections)
  • βœ… Image/table extraction
  • βœ… Page mapping
  • βœ… Caption parsing

v0.2.x - Immutable Ledger

  • Universal data model design covering global archaeology
  • Archaeological concept extraction (features, artifacts, strata, excavation units)
  • LLM-based information extraction pipeline

v0.3.x - Extensible Standardization

  • Hierarchical standard model design (base β†’ country-specific β†’ domain-specific)
  • Normalization pipeline
  • Data validation

v0.4.x - Ontology

  • Domain-specific semantic models
  • Knowledge graph construction

v1.0.x - Production Ready

  • Performance optimization
  • API stability guarantee
  • Comprehensive testing

For details, see docs/roadmap.md.

Development

Monorepo Commands

# Install dependencies
pnpm install

# Build all
pnpm build

# Type check
pnpm typecheck

# Lint
pnpm lint
pnpm lint:fix

# Format
pnpm format
pnpm format:check

# Run all tests
pnpm test
pnpm test:coverage
pnpm test:ci

# Test specific package
pnpm --filter @heripo/pdf-parser test
pnpm --filter @heripo/document-processor test

Package-Specific Commands

# Build specific package
pnpm --filter @heripo/pdf-parser build

# Test specific package (with coverage)
pnpm --filter @heripo/pdf-parser test:coverage

# Watch mode for specific package
pnpm --filter @heripo/pdf-parser dev

Contributing

Thank you for contributing to the heripo engine project! For contribution guidelines, see CONTRIBUTING.md.

How to Contribute

  1. Fork this repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Create a Pull Request

Development Guidelines

  • All tests must pass (pnpm test)
  • 100% code coverage must be maintained
  • ESLint and Prettier rules must be followed
  • Commit messages must follow Conventional Commits

Community

Citation and Attribution

If you use this project in research, services, or derivative works, please include the following attribution:

Powered by heripo engine

Such attribution helps support the open-source project and gives credit to contributors.

BibTeX Citation

For academic papers or research documents, you may use the following BibTeX entry:

@software{heripo_engine,
  author = {Kim, Hongyeon and Cho, Hayoung and Kim, Gaeun},
  title = {heripo engine: TypeScript Library for Extracting Structured Data from Archaeological Excavation Report PDFs},
  year = {2026},
  url = {https://github.com/heripo-lab/heripo-engine},
  note = {Apache License 2.0}
}

License

This project is distributed under the Apache License 2.0.

Acknowledgments

This project uses the following open-source projects:


heripo lab | GitHub | heripo engine