Docs

Defuddle extracts the main content from web pages, removing clutter like comments, sidebars, headers, and footers to return clean, readable HTML.

Installation

npm install defuddle

For Node.js use, install a DOM implementation:

npm install defuddle linkedom

Or use JSDOM:

npm install defuddle jsdom

To use the CLI globally, install with -g, or use npx to run without installing globally:

# Install globally
npm install -g defuddle

# Or use npx
npx defuddle parse https://example.com/article

Browser use

In the browser, create a Defuddle instance with a Document object and call parse().

import Defuddle from 'defuddle';

const result = new Defuddle(document).parse();

console.log(result.content);  // cleaned HTML string
console.log(result.title);    // page title
console.log(result.author);   // author name

You can also parse HTML strings using DOMParser:

const parser = new DOMParser();
const doc = parser.parseFromString(htmlString, 'text/html');
const result = new Defuddle(doc).parse();

Pass options as the second argument:

const result = new Defuddle(document, {
  url: 'https://example.com/article',
  debug: true
}).parse();

Node.js use

The Node.js API accepts a DOM Document from any implementation (JSDOM, linkedom, happy-dom, etc.) and returns a promise.

import { parseHTML } from 'linkedom';
import { Defuddle } from 'defuddle/node';

const { document } = parseHTML(htmlString);
const result = await Defuddle(document, 'https://example.com/article', {
  markdown: true
});

Or with JSDOM:

import { JSDOM } from 'jsdom';
import { Defuddle } from 'defuddle/node';

const dom = new JSDOM(htmlString, { url: 'https://example.com/article' });
const result = await Defuddle(dom.window.document, 'https://example.com/article');
Note: For defuddle/node to import properly, your package.json must have "type": "module".

CLI use

Defuddle includes a CLI for parsing web pages from the terminal. You can run it with npx or install it globally with npm install -g defuddle.

# Parse a local HTML file
npx defuddle parse page.html

# Parse a URL
npx defuddle parse https://example.com/article

# Output as markdown
npx defuddle parse page.html --markdown

# Output as JSON with metadata
npx defuddle parse page.html --json

# Extract a specific property
npx defuddle parse page.html --property title

# Save output to a file
npx defuddle parse page.html --output result.html

CLI options

OptionAliasDescription
--output <file>-oWrite output to a file instead of stdout
--markdown-mConvert content to markdown
--mdAlias for --markdown
--json-jOutput as JSON with metadata and content
--property <name>-pExtract a specific property
--debugEnable debug mode

Options

Options can be passed when creating a Defuddle instance (browser) or as the third argument (Node.js).

OptionTypeDefaultDescription
urlstringURL of the page being parsed
markdownbooleanfalseConvert content to Markdown
separateMarkdownbooleanfalseKeep content as HTML and return Markdown in contentMarkdown
removeExactSelectorsbooleantrueRemove elements matching exact selectors (ads, social buttons, etc.)
removePartialSelectorsbooleantrueRemove elements matching partial selectors
removeHiddenElementsbooleantrueRemove elements hidden via CSS (display:none, visibility:hidden, etc.)
removeLowScoringbooleantrueRemove non-content blocks by scoring (navigation, link lists, etc.)
removeSmallImagesbooleantrueRemove small images (icons, tracking pixels, etc.)
removeImagesbooleanfalseRemove images from the output
useAsyncbooleantrueAllow async extractors to fetch from third-party APIs when no local content is available.
standardizebooleantrueStandardize HTML (footnotes, headings, code blocks, etc.)
contentSelectorstringCSS selector to use as the main content element, bypassing auto-detection
languagestringPreferred language (BCP 47 tag, e.g. en, fr). Sets Accept-Language header and selects transcript language.
includeRepliesboolean | 'extractors''extractors'Include replies: 'extractors' for site-specific extractors only, true for all, false for none
debugbooleanfalseEnable debug logging and return debug info in the response

Response

The parse() method returns an object with the following properties:

PropertyTypeDescription
contentstringCleaned HTML string of the extracted content
contentMarkdownstringMarkdown version (when separateMarkdown is true)
titlestringTitle of the article
descriptionstringDescription or summary
authorstringAuthor of the article
sitestringName of the website
domainstringDomain name of the website
faviconstringURL of the website's favicon
imagestringURL of the article's main image
languagestringLanguage of the page in BCP 47 format (e.g. en, en-US)
publishedstringPublication date
wordCountnumberNumber of words in the extracted content
parseTimenumberTime taken to parse in milliseconds
metaTagsobject[]Meta tags from the page
schemaOrgDataobjectSchema.org data extracted from the page
extractorTypestringType of site-specific extractor used, if any
debugobjectDebug info including content selector and removals (when debug: true)

Bundles

Defuddle is available in three bundles:

BundleImportDescription
CoredefuddleBrowser usage. No dependencies. Handles math content but without MathML/LaTeX conversion fallbacks.
Fulldefuddle/fullIncludes math equation parsing (MathML ↔ LaTeX) and Markdown conversion via Turndown.
Node.jsdefuddle/nodeFor Node.js. Accepts any DOM Document (linkedom, JSDOM, happy-dom, etc.). Includes full capabilities for math and Markdown conversion.

The core bundle is recommended for most use cases.

HTML standardization

Defuddle standardizes HTML elements to provide a consistent input for downstream tools like Markdown converters.

Headings

Code blocks

Code blocks are standardized. Line numbers and syntax highlighting are removed, but the language is retained.

<pre>
  <code data-lang="js" class="language-js">
    // code
  </code>
</pre>

Footnotes

Inline references and footnotes are converted to a standard format using sup, a, and an ordered list with class="footnote".

Math

Math elements, including MathJax and KaTeX, are converted to standard MathML with a data-latex attribute containing the original LaTeX source.

Callouts

Callout and alert elements from various sources are standardized to the Obsidian Publish callout format. When converting to Markdown, these become Obsidian-style callouts.

Supported sources:

<div data-callout="info" class="callout">
  <div class="callout-title">
    <div class="callout-title-inner">Info</div>
  </div>
  <div class="callout-content">
    <p>This is an informational callout.</p>
  </div>
</div>

Debugging

Debug mode

When debug mode is enabled:

const result = new Defuddle(document, { debug: true }).parse();

// CSS selector path of chosen main content element
console.log(result.debug.contentSelector);

// Array of removed elements with step, reason, selector, and text preview
console.log(result.debug.removals);

The debug field contains:

PropertyTypeDescription
contentSelectorstringCSS selector path of the chosen main content element
removalsarrayList of elements removed during processing

Each removal entry contains:

PropertyTypeDescription
stepstringPipeline step (e.g. removeLowScoring, removeBySelector, removeHiddenElements)
selectorstringCSS selector or pattern that matched
reasonstringWhy the element was removed (e.g. score: -20, display:none)
textstringFirst 200 characters of removed element's text content

Pipeline toggles

Disable individual pipeline steps to diagnose content extraction issues:

// Skip content scoring
const result = new Defuddle(document, { removeLowScoring: false }).parse();

// Skip hidden element removal
const result = new Defuddle(document, { removeHiddenElements: false }).parse();

// Skip small image removal
const result = new Defuddle(document, { removeSmallImages: false }).parse();

// Skip HTML standardization
const result = new Defuddle(document, { standardize: false }).parse();

Content selector

Use contentSelector to bypass auto-detection and specify the main content element directly. Falls back to auto-detection if the selector doesn't match.

const result = new Defuddle(document, {
  contentSelector: 'article.post-content'
}).parse();