FireParse

A document parser playground built with Next.js 16, React 19, and the Firecrawl v2 API. Paste a URL to any PDF, Word document, or spreadsheet and get structured JSON, rendered markdown, or raw HTML back — with an interactive PDF viewer that overlays color-coded bounding boxes on extracted fields.

Features

Universal document parsing — PDFs, Excel (.xlsx, .xls), Word (.docx, .doc, .odt, .rtf), and web pages
Structured JSON extraction — LLM-powered extraction with a comprehensive default prompt that handles resumes, research papers, invoices, presentations, spreadsheets, legal documents, and more
Custom schema & prompt — Provide your own JSON Schema or natural language prompt to control extraction structure
PDF viewer with bounding boxes — After parsing, click "View on PDF" to see extracted fields highlighted directly on the PDF pages with color-coded overlays
Multiple output formats — JSON, Markdown, HTML, and Summary modes
Password-protected PDFs — Enter a PDF password in Advanced Config to parse encrypted documents
Auto file type detection — Detects document type from URL extension and shows format-specific options
API key persistence — Stored in localStorage, never sent to any server except Firecrawl directly

Architecture

src/
  app/
    layout.tsx              Server component — next/font (Inter + JetBrains Mono), metadata
    page.tsx                Client component — state orchestrator, wires sidebar + main panel
    globals.css             Tailwind v4 @theme, custom scrollbar, bbox overlay, result styles
  components/
    sidebar/
      sidebar.tsx           440px left panel with header + scrollable form
      api-key-input.tsx     Password input with key icon, localStorage persistence
      url-input.tsx         URL input with file upload + detected type badge
      format-grid.tsx       JSON / Markdown / HTML / Summary toggle grid
      advanced-config.tsx   Collapsible: schema, prompt, parse mode, toggles, PDF password
      parse-button.tsx      Orange CTA with loading / success states
      sidebar-items.tsx     Extracted items list for viewer mode
    main-panel/
      main-panel.tsx        Right panel orchestrator (empty / result / viewer views)
      toolbar.tsx           Tab row, View on PDF, status badge, copy
      empty-state.tsx       Server component — dot grid + placeholder
      result-area.tsx       Rendered / Markdown / JSON / Extra tabs
      pdf-viewer.tsx        Container for PDF.js canvas rendering
      floating-legend.tsx   Server component — bounding box category legend
    ui/
      status-badge.tsx      Reusable loading / success badge
      dot-grid.tsx          Server component — subtle dot pattern background
  lib/
    types.ts                TypeScript interfaces for all data shapes
    constants.ts            API base, viewer scale, file types, default extraction prompt + schema
    firecrawl.ts            Firecrawl v2 API client (buildRequestBody, parseDocument, getFileType)
    normalize-items.ts      Universal JSON normalizer for any Firecrawl output shape
    pdf-viewer-engine.ts    PDF.js text matching, line merging, bounding box positioning
  hooks/
    use-parse.ts            Parse lifecycle state machine (idle → loading → success / error)
    use-pdf-viewer.ts       PDF.js CDN loading, canvas rendering, bounding box matching

Design decisions

Server / Client split — layout.tsx, empty-state.tsx, dot-grid.tsx, and floating-legend.tsx are server components (static markup, no JS shipped). Everything interactive lives under a single "use client" boundary at page.tsx.

PDF.js via CDN — Loaded at runtime via a script tag to avoid bundling the canvas polyfill that breaks Turbopack. The worker is also loaded from CDN. This keeps the bundle small and avoids SSR issues.

Bounding box rendering — Uses PDF.js getTextContent() to extract precise text positions, then matches them against JSON extraction results using substring search with line-merging (groups adjacent text items on the same Y-coordinate into single boxes). Scale-aware tolerances ensure boxes align properly at any zoom level.

Fonts — Inter and JetBrains Mono loaded via next/font/google with CSS variables, integrated into Tailwind v4 via @theme inline.

API calls — All Firecrawl API calls go directly from the browser to api.firecrawl.dev. No Next.js API routes or server-side proxying. The API key never touches any server except Firecrawl.

Getting started

Prerequisites

Node.js 18+
A Firecrawl API key (fc-...)

Install

git clone <repo-url>
cd fireparse-app
npm install

Development

npm run dev

Open http://localhost:3000.

Production build

npm run build
npm start

Deploy to Vercel

npx vercel

No additional configuration needed — the app is a standard Next.js project.

Usage

Enter your Firecrawl API key in the Authentication field
Paste a document URL (PDF, DOCX, XLSX, or any web page)
Select an output format (JSON is recommended for structured extraction)
Optionally configure Advanced Options: JSON schema, extraction prompt, parse mode, PDF password
Click Parse Document
View results in the Rendered, Markdown, or JSON tabs
For PDFs: click View on PDF in the toolbar to see bounding boxes overlaid on the actual document

Default extraction

When no custom prompt or schema is provided and JSON format is selected, FireParse uses a comprehensive default prompt and schema that handles:

Document Type	What gets extracted
Resumes	Name, contact details, each job with responsibilities, skills, education, awards, talks
Research papers	Title, authors, abstract, keywords, sections, references, equations, figure captions
Invoices	Line items, amounts, tax, totals, dates, account numbers, company details
Spreadsheets	Every cell with column headers as field names, sheet names as context
Presentations	Slide titles, bullet points, speaker notes, slide numbers
Legal documents	Clauses, party names, dates, defined terms, signature blocks
Web pages	Headings, paragraphs, links, metadata, structured data

The schema uses 32 field types: section_header, contact, personal_info, work_experience, responsibility, education, skill, award, certification, project, publication, reference, citation, talk, table_header, table_cell, table_total, list_item, paragraph, figure_caption, equation, abstract, keyword, footnote, slide_content, financial_amount, line_item, legal_clause, date, identifier, metadata, other.

Tech stack

Next.js 16 with App Router and Turbopack
React 19 with Server Components
Tailwind CSS v4 with @theme inline
Firecrawl v2 API for document parsing and LLM extraction
PDF.js 3.11 for client-side PDF rendering
Marked for markdown-to-HTML rendering
next/font for Inter + JetBrains Mono

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
public		public
src		src
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
bun.lock		bun.lock
eslint.config.mjs		eslint.config.mjs
next.config.ts		next.config.ts
package-lock.json		package-lock.json
package.json		package.json
postcss.config.mjs		postcss.config.mjs
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FireParse

Features

Architecture

Design decisions

Getting started

Prerequisites

Install

Development

Production build

Deploy to Vercel

Usage

Default extraction

Tech stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FireParse

Features

Architecture

Design decisions

Getting started

Prerequisites

Install

Development

Production build

Deploy to Vercel

Usage

Default extraction

Tech stack

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages