Skip to content

run-llama/liteparse

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

169 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

LiteParse

CI | npm version | License | Docs

out

LiteParse is a standalone OSS PDF parsing tool focused exclusively on fast and light parsing. It provides high-quality spatial text parsing with bounding boxes, without proprietary LLM features or cloud dependencies. Everything runs locally on your machine.

Hitting the limits of local parsing? For complex documents (dense tables, multi-column layouts, charts, handwritten text, or scanned PDFs), you'll get significantly better results with LlamaParse, our cloud-based document parser built for production document pipelines. LlamaParse handles the hard stuff so your models see clean, structured data and markdown.

👉 Sign up for LlamaParse free

Overview

  • Fast Text Parsing: Spatial text parsing using PDF.js
  • Flexible OCR System:
    • Built-in: Tesseract.js (zero setup, works out of the box!)
    • HTTP Servers: Plug in any OCR server (EasyOCR, PaddleOCR, custom)
    • Standard API: Simple, well-defined OCR API specification
  • Screenshot Generation: Generate high-quality page screenshots for LLM agents
  • Multiple Output Formats: JSON and Text
  • Bounding Boxes: Precise text positioning information
  • Standalone Binary: No cloud dependencies, runs entirely locally
  • Multi-platform: Linux, macOS (Intel/ARM), Windows

Installation

CLI Tool

Option 1: Global Install (Recommended)

Install globally via npm to use the lit command anywhere:

npm i -g @llamaindex/liteparse

Then use it:

lit parse document.pdf
lit screenshot document.pdf

For macOS and Linux users, liteparse can be also installed via brew:

brew tap run-llama/liteparse
brew install llamaindex-liteparse

Option 2: Install from Source

You can clone the repo and install the CLI globally from source:

git clone https://github.com/run-llama/liteparse.git
cd liteparse
npm run build
npm pack
npm install -g ./liteparse-*.tgz

Agent Skill

You can use liteparse as an agent skill, downloading it with the skills CLI tool:

npx skills add run-llama/llamaparse-agent-skills --skill liteparse

Or copy-pasting the SKILL.md file to your own skills setup.

Usage

Parse Files

# Basic parsing
lit parse document.pdf

# Parse with specific format
lit parse document.pdf --format json -o output.md

# Parse specific pages
lit parse document.pdf --target-pages "1-5,10,15-20"

# Parse without OCR
lit parse document.pdf --no-ocr

Batch Parsing

You can also parse an entire directory of documents:

lit batch-parse ./input-directory ./output-directory

Generate Screenshots

Screenshots are essential for LLM agents to extract visual information that text alone cannot capture.

# Screenshot all pages
lit screenshot document.pdf -o ./screenshots

# Screenshot specific pages
lit screenshot document.pdf --target-pages "1,3,5" -o ./screenshots

# Custom DPI
lit screenshot document.pdf --dpi 300 -o ./screenshots

# Screenshot page range
lit screenshot document.pdf --target-pages "1-10" -o ./screenshots

Library Usage

Install as a dependency in your project:

npm install @llamaindex/liteparse
# or
pnpm add @llamaindex/liteparse
import { LiteParse } from '@llamaindex/liteparse';

const parser = new LiteParse({ ocrEnabled: true });
const result = await parser.parse('document.pdf');
console.log(result.text);

Buffer / Uint8Array Input

You can pass raw bytes directly instead of a file path. PDF buffers are parsed with zero disk I/O — no temp files are written:

import { LiteParse } from '@llamaindex/liteparse';
import { readFile } from 'fs/promises';

const parser = new LiteParse();

// From a file read
const pdfBytes = await readFile('document.pdf');
const result = await parser.parse(pdfBytes);

// From an HTTP response
const response = await fetch('https://example.com/document.pdf');
const buffer = Buffer.from(await response.arrayBuffer());
const result2 = await parser.parse(buffer);

Non-PDF buffers (images, Office documents) are written to a temp directory for format conversion. Screenshots also work with buffer input:

const screenshots = await parser.screenshot(pdfBytes, [1, 2, 3]);

CLI Options

Parse Command

$ lit parse --help
Usage: lit parse [options] <file>

Parse a document file (PDF, DOCX, XLSX, PPTX, images, etc.)

Options:
  -o, --output <file>     Output file path
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse (default: "10000")
  --target-pages <pages>  Target pages (e.g., "1-5,10,15-20")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --preserve-small-text   Preserve very small text
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Batch Parse Command

$ lit batch-parse --help
Usage: lit batch-parse [options] <input-dir> <output-dir>

Parse multiple documents in batch mode (reuses PDF engine for efficiency)

Options:
  --format <format>       Output format: json|text (default: "text")
  --ocr-server-url <url>  HTTP OCR server URL (uses Tesseract if not provided)
  --no-ocr                Disable OCR
  --ocr-language <lang>   OCR language(s) (default: "en")
  --num-workers <n>       Number of pages to OCR in parallel (default: CPU cores - 1)
  --max-pages <n>         Max pages to parse per file (default: "10000")
  --dpi <dpi>             DPI for rendering (default: "150")
  --no-precise-bbox       Disable precise bounding boxes
  --recursive             Recursively search input directory
  --extension <ext>       Only process files with this extension (e.g., ".pdf")
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

Screenshot Command

$ lit screenshot --help
Usage: lit screenshot [options] <file>

Generate screenshots of PDF pages

Options:
  -o, --output-dir <dir>  Output directory for screenshots (default: "./screenshots")
  --target-pages <pages>  Page numbers to screenshot (e.g., "1,3,5" or "1-5")
  --dpi <dpi>             DPI for rendering (default: "150")
  --format <format>       Image format: png|jpg (default: "png")
  --config <file>         Config file (JSON)
  -q, --quiet             Suppress progress output
  -h, --help              display help for command

OCR Setup

Default: Tesseract.js

# Tesseract is enabled by default
lit parse document.pdf

# Specify language
lit parse document.pdf --ocr-language fra

# Disable OCR
lit parse document.pdf --no-ocr

By default, Tesseract.js downloads language data from the internet on first use. For offline or air-gapped environments, set the TESSDATA_PREFIX environment variable to a directory containing pre-downloaded .traineddata files:

export TESSDATA_PREFIX=/path/to/tessdata
lit parse document.pdf --ocr-language eng

You can also pass tessdataPath in the library config:

const parser = new LiteParse({ tessdataPath: '/path/to/tessdata' });

Optional: HTTP OCR Servers

For higher accuracy or better performance, you can use an HTTP OCR server. We provide ready-to-use example wrappers for popular OCR engines:

You can integrate any OCR service by implementing the simple LiteParse OCR API specification (see OCR_API_SPEC.md).

The API requires:

  • POST /ocr endpoint
  • Accepts file and language parameters
  • Returns JSON: { results: [{ text, bbox: [x1,y1,x2,y2], confidence }] }

See the example servers in ocr/easyocr/ and ocr/paddleocr/ as templates.

For the complete OCR API specification, see OCR_API_SPEC.md.

Multi-Format Input Support

LiteParse supports automatic conversion of various document formats to PDF before parsing. This makes it unique compared to other PDF-only parsing tools!

Supported Input Formats

Office Documents (via LibreOffice)

  • Word: .doc, .docx, .docm, .odt, .rtf
  • PowerPoint: .ppt, .pptx, .pptm, .odp
  • Spreadsheets: .xls, .xlsx, .xlsm, .ods, .csv, .tsv

Just install the dependency and LiteParse will automatically convert these formats to PDF for parsing:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
apt-get install libreoffice

# Windows
choco install libreoffice-fresh # might require admin permissions

For Windows, you might need to add the path to the directory containing LibreOffice CLI executable (generally C:\Program Files\LibreOffice\program) to the environment variables and re-start the machine.

Images (via ImageMagick)

  • Formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .webp, .svg

Just install ImageMagick and LiteParse will convert images to PDF for parsing (with OCR):

# macOS
brew install imagemagick

# Ubuntu/Debian
apt-get install imagemagick

# Windows
choco install imagemagick.app # might require admin permissions

Environment Variables

Variable Description
TESSDATA_PREFIX Path to a directory containing Tesseract .traineddata files. Used for offline/air-gapped environments where Tesseract.js cannot download language data from the internet.
LITEPARSE_TMPDIR Override the temp directory used for format conversion and intermediate files. Defaults to the OS temp directory (os.tmpdir()). Useful in containerized or read-only filesystem environments.

Configuration

You can configure parsing options via CLI flags or a JSON config file. The config file allows you to set sensible defaults and override as needed.

Config File Example

Create a liteparse.config.json file:

{
  "ocrLanguage": "en",
  "ocrEnabled": true,
  "maxPages": 1000,
  "dpi": 150,
  "outputFormat": "json",
  "preciseBoundingBox": true,
  "preserveVerySmallText": false
}

For HTTP OCR servers, just add ocrServerUrl:

{
  "ocrServerUrl": "http://localhost:8828/ocr",
  "ocrLanguage": "en",
  "outputFormat": "json"
}

Use with:

lit parse document.pdf --config liteparse.config.json

Development

We provide a fairly rich AGENTS.md/CLAUDE.md that we recommend using to help with development + coding agents.

# Install dependencies
npm install

# Build TypeScript (Linux/macOs)
npm run build

# Build Typescript (Windows)
npm run build:windows

# Watch mode
npm run dev

# Test parsing
npm test

License

Apache 2.0

Credits

Built on top of: