Defuddle extracts the main content from web pages, removing clutter like comments, sidebars, headers, and footers to return clean, readable HTML.
npm install defuddle
For Node.js use, install a DOM implementation:
npm install defuddle linkedom
Or use JSDOM:
npm install defuddle jsdom
To use the CLI globally, install with -g, or use npx to run without installing globally:
# Install globally
npm install -g defuddle
# Or use npx
npx defuddle parse https://example.com/article
In the browser, create a Defuddle instance with a Document object and call parse().
import Defuddle from 'defuddle';
const result = new Defuddle(document).parse();
console.log(result.content); // cleaned HTML string
console.log(result.title); // page title
console.log(result.author); // author name
You can also parse HTML strings using DOMParser:
const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, 'text/html');
const result = new Defuddle(doc).parse();
Pass options as the second argument:
const result = new Defuddle(document, {
url: 'https://example.com/article',
debug: true
}).parse();
The Node.js API accepts a DOM Document from any implementation (JSDOM, linkedom, happy-dom, etc.) and returns a promise.
import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';
const { document } = parseHTML(htmlString);
const result = await Defuddle(document, 'https://example.com/article', {
markdown: true
});
Or with JSDOM:
import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';
const dom = new JSDOM(htmlString, { url: 'https://example.com/article' });
const result = await Defuddle(dom.window.document, 'https://example.com/article');
defuddle/node to import properly, your package.json must have "type": "module".
Defuddle includes a CLI for parsing web pages from the terminal. You can run it with npx or install it globally with npm install -g defuddle.
# Parse a local HTML file
npx defuddle parse page.html
# Parse a URL
npx defuddle parse https://example.com/article
# Output as markdown
npx defuddle parse page.html --markdown
# Output as JSON with metadata
npx defuddle parse page.html --json
# Extract a specific property
npx defuddle parse page.html --property title
# Save output to a file
npx defuddle parse page.html --output result.html
| Option | Alias | Description |
|---|---|---|
--output <file> | -o | Write output to a file instead of stdout |
--markdown | -m | Convert content to markdown |
--md | Alias for --markdown | |
--json | -j | Output as JSON with metadata and content |
--property <name> | -p | Extract a specific property |
--debug | Enable debug mode |
Options can be passed when creating a Defuddle instance (browser) or as the third argument (Node.js).
| Option | Type | Default | Description |
|---|---|---|---|
url | string | URL of the page being parsed | |
markdown | boolean | false | Convert content to Markdown |
separateMarkdown | boolean | false | Keep content as HTML and return Markdown in contentMarkdown |
removeExactSelectors | boolean | true | Remove elements matching exact selectors (ads, social buttons, etc.) |
removePartialSelectors | boolean | true | Remove elements matching partial selectors |
removeHiddenElements | boolean | true | Remove elements hidden via CSS (display:none, visibility:hidden, etc.) |
removeLowScoring | boolean | true | Remove non-content blocks by scoring (navigation, link lists, etc.) |
removeSmallImages | boolean | true | Remove small images (icons, tracking pixels, etc.) |
removeImages | boolean | false | Remove images from the output |
useAsync | boolean | true | Allow async extractors to fetch from third-party APIs when no local content is available. |
standardize | boolean | true | Standardize HTML (footnotes, headings, code blocks, etc.) |
contentSelector | string | CSS selector to use as the main content element, bypassing auto-detection | |
language | string | Preferred language (BCP 47 tag, e.g. en, fr). Sets Accept-Language header and selects transcript language. | |
includeReplies | boolean | 'extractors' | 'extractors' | Include replies: 'extractors' for site-specific extractors only, true for all, false for none |
debug | boolean | false | Enable debug logging and return debug info in the response |
The parse() method returns an object with the following properties:
| Property | Type | Description |
|---|---|---|
content | string | Cleaned HTML string of the extracted content |
contentMarkdown | string | Markdown version (when separateMarkdown is true) |
title | string | Title of the article |
description | string | Description or summary |
author | string | Author of the article |
site | string | Name of the website |
domain | string | Domain name of the website |
favicon | string | URL of the website's favicon |
image | string | URL of the article's main image |
language | string | Language of the page in BCP 47 format (e.g. en, en-US) |
published | string | Publication date |
wordCount | number | Number of words in the extracted content |
parseTime | number | Time taken to parse in milliseconds |
metaTags | object[] | Meta tags from the page |
schemaOrgData | object | Schema.org data extracted from the page |
extractorType | string | Type of site-specific extractor used, if any |
debug | object | Debug info including content selector and removals (when debug: true) |
Defuddle is available in three bundles:
| Bundle | Import | Description |
|---|---|---|
| Core | defuddle | Browser usage. No dependencies. Handles math content but without MathML/LaTeX conversion fallbacks. |
| Full | defuddle/full | Includes math equation parsing (MathML ↔ LaTeX) and Markdown conversion via Turndown. |
| Node.js | defuddle/node | For Node.js. Accepts any DOM Document (linkedom, JSDOM, happy-dom, etc.). Includes full capabilities for math and Markdown conversion. |
The core bundle is recommended for most use cases.
Defuddle standardizes HTML elements to provide a consistent input for downstream tools like Markdown converters.
Code blocks are standardized. Line numbers and syntax highlighting are removed, but the language is retained.
<pre>
<code data-lang="js" class="language-js">
// code
</code>
</pre>
Inline references and footnotes are converted to a standard format using sup, a, and an ordered list with class="footnote".
Math elements, including MathJax and KaTeX, are converted to standard MathML with a data-latex attribute containing the original LaTeX source.
Callout and alert elements from various sources are standardized to the Obsidian Publish callout format. When converting to Markdown, these become Obsidian-style callouts.
Supported sources:
div.markdown-alert)div.callout[data-callout])aside.callout-*)div.alert.alert-*)<div data-callout="info" class="callout">
<div class="callout-title">
<div class="callout-title-inner">Info</div>
</div>
<div class="callout-content">
<p>This is an informational callout.</p>
</div>
</div>
When debug mode is enabled:
debug field in the response with detailed information about content extractiondata-* attributesconst result = new Defuddle(document, { debug: true }).parse();
// CSS selector path of chosen main content element
console.log(result.debug.contentSelector);
// Array of removed elements with step, reason, selector, and text preview
console.log(result.debug.removals);
The debug field contains:
| Property | Type | Description |
|---|---|---|
contentSelector | string | CSS selector path of the chosen main content element |
removals | array | List of elements removed during processing |
Each removal entry contains:
| Property | Type | Description |
|---|---|---|
step | string | Pipeline step (e.g. removeLowScoring, removeBySelector, removeHiddenElements) |
selector | string | CSS selector or pattern that matched |
reason | string | Why the element was removed (e.g. score: -20, display:none) |
text | string | First 200 characters of removed element's text content |
Disable individual pipeline steps to diagnose content extraction issues:
// Skip content scoring
const result = new Defuddle(document, { removeLowScoring: false }).parse();
// Skip hidden element removal
const result = new Defuddle(document, { removeHiddenElements: false }).parse();
// Skip small image removal
const result = new Defuddle(document, { removeSmallImages: false }).parse();
// Skip HTML standardization
const result = new Defuddle(document, { standardize: false }).parse();
Use contentSelector to bypass auto-detection and specify the main content element directly. Falls back to auto-detection if the selector doesn't match.
const result = new Defuddle(document, {
contentSelector: 'article.post-content'
}).parse();