A document parser playground built with Next.js 16, React 19, and the Firecrawl v2 API. Paste a URL to any PDF, Word document, or spreadsheet and get structured JSON, rendered markdown, or raw HTML back — with an interactive PDF viewer that overlays color-coded bounding boxes on extracted fields.
- Universal document parsing — PDFs, Excel (.xlsx, .xls), Word (.docx, .doc, .odt, .rtf), and web pages
- Structured JSON extraction — LLM-powered extraction with a comprehensive default prompt that handles resumes, research papers, invoices, presentations, spreadsheets, legal documents, and more
- Custom schema & prompt — Provide your own JSON Schema or natural language prompt to control extraction structure
- PDF viewer with bounding boxes — After parsing, click "View on PDF" to see extracted fields highlighted directly on the PDF pages with color-coded overlays
- Multiple output formats — JSON, Markdown, HTML, and Summary modes
- Password-protected PDFs — Enter a PDF password in Advanced Config to parse encrypted documents
- Auto file type detection — Detects document type from URL extension and shows format-specific options
- API key persistence — Stored in localStorage, never sent to any server except Firecrawl directly
src/
app/
layout.tsx Server component — next/font (Inter + JetBrains Mono), metadata
page.tsx Client component — state orchestrator, wires sidebar + main panel
globals.css Tailwind v4 @theme, custom scrollbar, bbox overlay, result styles
components/
sidebar/
sidebar.tsx 440px left panel with header + scrollable form
api-key-input.tsx Password input with key icon, localStorage persistence
url-input.tsx URL input with file upload + detected type badge
format-grid.tsx JSON / Markdown / HTML / Summary toggle grid
advanced-config.tsx Collapsible: schema, prompt, parse mode, toggles, PDF password
parse-button.tsx Orange CTA with loading / success states
sidebar-items.tsx Extracted items list for viewer mode
main-panel/
main-panel.tsx Right panel orchestrator (empty / result / viewer views)
toolbar.tsx Tab row, View on PDF, status badge, copy
empty-state.tsx Server component — dot grid + placeholder
result-area.tsx Rendered / Markdown / JSON / Extra tabs
pdf-viewer.tsx Container for PDF.js canvas rendering
floating-legend.tsx Server component — bounding box category legend
ui/
status-badge.tsx Reusable loading / success badge
dot-grid.tsx Server component — subtle dot pattern background
lib/
types.ts TypeScript interfaces for all data shapes
constants.ts API base, viewer scale, file types, default extraction prompt + schema
firecrawl.ts Firecrawl v2 API client (buildRequestBody, parseDocument, getFileType)
normalize-items.ts Universal JSON normalizer for any Firecrawl output shape
pdf-viewer-engine.ts PDF.js text matching, line merging, bounding box positioning
hooks/
use-parse.ts Parse lifecycle state machine (idle → loading → success / error)
use-pdf-viewer.ts PDF.js CDN loading, canvas rendering, bounding box matching
Server / Client split — layout.tsx, empty-state.tsx, dot-grid.tsx, and floating-legend.tsx are server components (static markup, no JS shipped). Everything interactive lives under a single "use client" boundary at page.tsx.
PDF.js via CDN — Loaded at runtime via a script tag to avoid bundling the canvas polyfill that breaks Turbopack. The worker is also loaded from CDN. This keeps the bundle small and avoids SSR issues.
Bounding box rendering — Uses PDF.js getTextContent() to extract precise text positions, then matches them against JSON extraction results using substring search with line-merging (groups adjacent text items on the same Y-coordinate into single boxes). Scale-aware tolerances ensure boxes align properly at any zoom level.
Fonts — Inter and JetBrains Mono loaded via next/font/google with CSS variables, integrated into Tailwind v4 via @theme inline.
API calls — All Firecrawl API calls go directly from the browser to api.firecrawl.dev. No Next.js API routes or server-side proxying. The API key never touches any server except Firecrawl.
- Node.js 18+
- A Firecrawl API key (
fc-...)
git clone <repo-url>
cd fireparse-app
npm installnpm run devOpen http://localhost:3000.
npm run build
npm startnpx vercelNo additional configuration needed — the app is a standard Next.js project.
- Enter your Firecrawl API key in the Authentication field
- Paste a document URL (PDF, DOCX, XLSX, or any web page)
- Select an output format (JSON is recommended for structured extraction)
- Optionally configure Advanced Options: JSON schema, extraction prompt, parse mode, PDF password
- Click Parse Document
- View results in the Rendered, Markdown, or JSON tabs
- For PDFs: click View on PDF in the toolbar to see bounding boxes overlaid on the actual document
When no custom prompt or schema is provided and JSON format is selected, FireParse uses a comprehensive default prompt and schema that handles:
| Document Type | What gets extracted |
|---|---|
| Resumes | Name, contact details, each job with responsibilities, skills, education, awards, talks |
| Research papers | Title, authors, abstract, keywords, sections, references, equations, figure captions |
| Invoices | Line items, amounts, tax, totals, dates, account numbers, company details |
| Spreadsheets | Every cell with column headers as field names, sheet names as context |
| Presentations | Slide titles, bullet points, speaker notes, slide numbers |
| Legal documents | Clauses, party names, dates, defined terms, signature blocks |
| Web pages | Headings, paragraphs, links, metadata, structured data |
The schema uses 32 field types: section_header, contact, personal_info, work_experience, responsibility, education, skill, award, certification, project, publication, reference, citation, talk, table_header, table_cell, table_total, list_item, paragraph, figure_caption, equation, abstract, keyword, footnote, slide_content, financial_amount, line_item, legal_clause, date, identifier, metadata, other.
- Next.js 16 with App Router and Turbopack
- React 19 with Server Components
- Tailwind CSS v4 with
@theme inline - Firecrawl v2 API for document parsing and LLM extraction
- PDF.js 3.11 for client-side PDF rendering
- Marked for markdown-to-HTML rendering
- next/font for Inter + JetBrains Mono
MIT