Zero-dependency document-to-markdown conversion for Node.js.
Convert HTML, plain text, and markdown documents into clean, structure-preserving markdown suitable for RAG (Retrieval-Augmented Generation) pipelines, knowledge base construction, and LLM ingestion. Accepts string or Buffer input, auto-detects the format, routes to the appropriate converter, extracts metadata and image references, and returns a typed ConversionResult. No external services, no Python runtime, no network calls -- everything runs locally in Node.js.
npm install docling-node-tsRequires Node.js 18 or later.
import { convert } from 'docling-node-ts';
// Convert HTML to markdown
const result = convert('<h1>Quarterly Report</h1><p>Revenue grew <strong>15%</strong> year-over-year.</p>');
console.log(result.markdown);
// # Quarterly Report
//
// Revenue grew **15%** year-over-year.
console.log(result.metadata);
// { wordCount: 5, headingCount: 1, imageCount: 0, readingTimeMinutes: 1 }
console.log(result.durationMs);
// 2// Convert a Buffer with auto-detection
import { readFileSync } from 'fs';
const buf = readFileSync('report.html');
const { markdown, metadata, images, warnings } = convert(buf);- HTML to Markdown -- Converts headings (h1-h6), paragraphs, bold, italic, strikethrough, inline code, links, images, ordered and unordered lists (including nested), GFM pipe tables, fenced code blocks with language hints, blockquotes, horizontal rules,
<figure>/<figcaption>, and<sup>/<sub>elements. - Plain Text to Markdown -- Detects setext-style headings (underlined with
===or---), ALL CAPS headings, unordered and ordered lists, and paragraph breaks. Normalizes list markers and line endings. - Markdown Normalization -- Cleans and normalizes existing markdown: collapses excessive blank lines, standardizes list markers to
-, normalizes heading levels to eliminate gaps, fixes broken links with empty hrefs, and ensures consistent spacing around headings. - Format Auto-Detection -- Detects the input format automatically using file extension, magic bytes (for Buffer inputs), and content analysis (HTML tags, markdown patterns). Supports explicit format override via options.
- Metadata Extraction -- Returns word count, heading count, image count, and estimated reading time. For HTML inputs, extracts title, author, and date from
<title>,<meta>, and Open Graph tags. - Image Reference Extraction -- Collects all image references from HTML with their
id,alttext, andsrcpath. Can be disabled withextractImages: false. - Binary Format Guidance -- Detects PDF, DOCX, and PPTX inputs (via magic bytes or extension) and returns informative messages with suggested packages (
pdfjs-dist,mammoth,jszip) and code examples. No binary parsers are bundled to keep the dependency tree at zero. - HTML Sanitization -- Strips
<script>,<style>,<noscript>,<iframe>,<svg>,<canvas>,<nav>,<footer>,<header>, and<aside>elements. Decodes HTML entities including numeric and hex character references. - Zero Dependencies -- No runtime dependencies. Only devDependencies for building and testing.
The primary conversion function. Accepts a string or Buffer, auto-detects the format (or uses the explicit format from options), converts to markdown, and returns a ConversionResult.
function convert(input: string | Buffer, options?: ConvertOptions): ConversionResult;Parameters:
| Parameter | Type | Description |
|---|---|---|
input |
string | Buffer |
The document content to convert |
options |
ConvertOptions |
Optional conversion settings |
Returns: ConversionResult
import { convert } from 'docling-node-ts';
const result = convert('<table><thead><tr><th>Name</th><th>Age</th></tr></thead><tbody><tr><td>Alice</td><td>30</td></tr></tbody></table>');
console.log(result.markdown);
// | Name | Age |
// | --- | --- |
// | Alice | 30 |Convenience function that converts HTML to markdown. Equivalent to calling convert(html, { format: 'html' }).
function convertHtml(html: string): ConversionResult;import { convertHtml } from 'docling-node-ts';
const { markdown } = convertHtml('<ul><li>First</li><li>Second</li></ul>');
// - First
// - SecondCleans and normalizes existing markdown. Standardizes list markers, normalizes heading levels, collapses blank lines, removes broken links, and ensures consistent formatting. Equivalent to calling convert(md, { format: 'markdown' }).
function convertMarkdown(md: string): ConversionResult;import { convertMarkdown } from 'docling-node-ts';
const { markdown } = convertMarkdown('# Title\n\n\n\n\n#### Skipped Level\n\n* Item');
// # Title
//
// ## Skipped Level
//
// - ItemConverts plain text to markdown. Detects headings, lists, and paragraph structure. Equivalent to calling convert(text, { format: 'text' }).
function convertText(text: string): ConversionResult;import { convertText } from 'docling-node-ts';
const { markdown } = convertText('INTRODUCTION\n\nSome body text.\n\n1) First step\n2) Second step');
// ## INTRODUCTION
//
// Some body text.
//
// 1. First step
// 2. Second stepDetects the format of a document from its content or file name.
Detection priority:
- File extension from
fileName(.pdf,.docx,.pptx,.html,.htm,.xhtml,.txt,.md,.markdown) - Magic bytes for Buffer inputs (
%PDFfor PDF,PK\x03\x04for ZIP-based Office formats) - Content analysis (HTML tags, markdown patterns)
- Default:
'text'
function detectFormat(input: string | Buffer, fileName?: string): InputFormat;import { detectFormat } from 'docling-node-ts';
detectFormat('', 'report.pdf'); // 'pdf'
detectFormat('<html><body>Hi</body></html>'); // 'html'
detectFormat('# Title\n\n## Section'); // 'markdown'
detectFormat('Just plain text.'); // 'text'
const pdfBuffer = Buffer.from('%PDF-1.4 ...');
detectFormat(pdfBuffer); // 'pdf'Extracts metadata from a markdown string. Computes word count, heading count, image count, and estimated reading time.
function extractMetadata(markdown: string): Pick<
DocumentMetadata,
'wordCount' | 'headingCount' | 'imageCount' | 'readingTimeMinutes'
>;import { extractMetadata } from 'docling-node-ts';
const meta = extractMetadata('# Title\n\nSome **bold** text with .\n');
// { wordCount: 4, headingCount: 1, imageCount: 1, readingTimeMinutes: 1 }Word counting strips markdown syntax (headings, bold/italic, code blocks, image references, links, blockquotes, horizontal rules, table pipes, and HTML tags) before counting. Reading time is calculated at 200 words per minute, rounded up, with a minimum of 1 minute.
The return type of all conversion functions.
interface ConversionResult {
/** The converted markdown string */
markdown: string;
/** Extracted document metadata */
metadata: DocumentMetadata;
/** Image references found in the document */
images: ImageReference[];
/** Per-page content breakdown (for paginated formats) */
pages: PageContent[];
/** Warnings generated during conversion */
warnings: string[];
/** Conversion duration in milliseconds */
durationMs: number;
}Options for the convert function.
interface ConvertOptions {
/** Explicitly specify the input format (skips auto-detection) */
format?: InputFormat;
/** Whether to extract image references (default: true) */
extractImages?: boolean;
/** Whether to preserve document structure like headings and lists (default: true) */
preserveStructure?: boolean;
/** Maximum number of pages to process (for paginated formats) */
maxPages?: number;
/** Whether to insert page break markers (default: false) */
pageBreaks?: boolean;
/** File name hint for format detection */
fileName?: string;
}Supported input format identifiers.
type InputFormat = 'html' | 'markdown' | 'text' | 'pdf' | 'docx' | 'pptx';Metadata extracted from a converted document.
interface DocumentMetadata {
title?: string;
author?: string;
date?: string;
pageCount?: number;
wordCount: number;
headingCount: number;
imageCount: number;
readingTimeMinutes: number;
}A reference to an image found in the document.
interface ImageReference {
/** Unique identifier for the image (e.g., "img-1") */
id: string;
/** Alt text for the image */
alt: string;
/** Source URL or path of the image */
src: string;
/** Page number where the image was found (if applicable) */
page?: number;
}Content of a single page in a paginated document.
interface PageContent {
/** Page number (1-based) */
pageNumber: number;
/** Markdown content of the page */
markdown: string;
/** Headings found on this page */
headings: string[];
}Skip auto-detection by specifying the format explicitly:
const result = convert(content, { format: 'html' });Provide a file name for extension-based format detection:
const result = convert(buffer, { fileName: 'report.html' });Suppress image reference collection:
const result = convert(html, { extractImages: false });
console.log(result.images); // []Produce plain text output with no markdown syntax:
const result = convert('# Heading\n\n**bold** and *italic*', {
format: 'markdown',
preserveStructure: false,
});
console.log(result.markdown);
// Heading
//
// bold and italicAll conversion functions are synchronous and do not throw under normal operation. Errors and edge cases are communicated through the warnings array in the ConversionResult.
When a binary format (PDF, DOCX, PPTX) is detected, the library does not throw. Instead, it returns a ConversionResult with an informative markdown message describing the detected format, suggested external packages, and example code:
const result = convert(pdfBuffer);
console.log(result.warnings);
// [
// 'Binary format "pdf" detected. Install a dedicated parser for full support.',
// 'Suggested packages: `pdfjs-dist`, `pdf-parse`, `pdf2json`'
// ]If the detected format does not match any known converter, the input is treated as plain text and a warning is added:
// result.warnings: ['Unexpected format: xyz. Treating as plain text.']Empty strings and whitespace-only input produce minimal output without errors:
const result = convert('');
console.log(result.markdown); // '\n'
console.log(result.metadata.wordCount); // 0Use docling-node-ts as the first stage in a document ingestion pipeline. The output markdown is designed for downstream chunking and embedding:
import { convert } from 'docling-node-ts';
function ingestDocument(html: string) {
const { markdown, metadata, images, warnings } = convert(html);
if (warnings.length > 0) {
console.warn('Conversion warnings:', warnings);
}
// Chunk the markdown for embedding (e.g., with chunk-smart)
// const chunks = chunkMarkdown(markdown, { maxTokens: 512 });
return { markdown, metadata, images };
}import { convert } from 'docling-node-ts';
function handleUpload(buffer: Buffer, originalFileName: string) {
const result = convert(buffer, { fileName: originalFileName });
return {
markdown: result.markdown,
title: result.metadata.title,
wordCount: result.metadata.wordCount,
readingTime: result.metadata.readingTimeMinutes,
imageCount: result.images.length,
};
}When converting HTML, the library extracts metadata from <head> elements:
import { convert } from 'docling-node-ts';
const html = `
<html>
<head>
<title>Annual Report 2024</title>
<meta name="author" content="Finance Team">
<meta name="date" content="2024-12-01">
<meta property="og:title" content="Annual Report">
</head>
<body>
<h1>Annual Report</h1>
<p>Revenue increased by 20%.</p>
</body>
</html>
`;
const result = convert(html);
console.log(result.metadata.title); // 'Annual Report 2024'
console.log(result.metadata.author); // 'Finance Team'
console.log(result.metadata.date); // '2024-12-01'Title extraction priority: <title> tag, then og:title. Author extraction checks both name="author" and property="article:author". Date extraction checks both name="date" and property="article:published_time".
Clean up markdown from external sources that may have inconsistent formatting:
import { convertMarkdown } from 'docling-node-ts';
const messy = `
# Title
#### Jumped Heading Level
* Mixed
+ List
- Markers
Click [broken]() link.
[Valid link](https://example.com)
`;
const { markdown } = convertMarkdown(messy);
// Heading levels normalized (#### becomes ##)
// List markers standardized to -
// Broken link text extracted without brackets
// Excessive blank lines collapsedTables are converted to GitHub Flavored Markdown pipe tables with column normalization and pipe escaping:
import { convertHtml } from 'docling-node-ts';
const html = `
<table>
<thead>
<tr><th>Product</th><th>Q1</th><th>Q2</th></tr>
</thead>
<tbody>
<tr><td>Widget A</td><td>$1,200</td><td>$1,500</td></tr>
<tr><td>Widget B</td><td>$800</td><td>$950</td></tr>
</tbody>
</table>
`;
const { markdown } = convertHtml(html);
// | Product | Q1 | Q2 |
// | --- | --- | --- |
// | Widget A | $1,200 | $1,500 |
// | Widget B | $800 | $950 |Rows with fewer columns are padded with empty cells. Pipe characters (|) inside cell content are escaped as \|.
This package is written in TypeScript and ships type declarations (dist/index.d.ts) alongside the compiled JavaScript. All public types are exported from the package entry point:
import type {
ConversionResult,
ConvertOptions,
InputFormat,
DocumentMetadata,
ImageReference,
PageContent,
} from 'docling-node-ts';Compiled with strict: true, targeting ES2022 with CommonJS module output.
MIT