Skip to content

adithyaakrishna/fireparse

Repository files navigation

FireParse

A document parser playground built with Next.js 16, React 19, and the Firecrawl v2 API. Paste a URL to any PDF, Word document, or spreadsheet and get structured JSON, rendered markdown, or raw HTML back — with an interactive PDF viewer that overlays color-coded bounding boxes on extracted fields.

Features

  • Universal document parsing — PDFs, Excel (.xlsx, .xls), Word (.docx, .doc, .odt, .rtf), and web pages
  • Structured JSON extraction — LLM-powered extraction with a comprehensive default prompt that handles resumes, research papers, invoices, presentations, spreadsheets, legal documents, and more
  • Custom schema & prompt — Provide your own JSON Schema or natural language prompt to control extraction structure
  • PDF viewer with bounding boxes — After parsing, click "View on PDF" to see extracted fields highlighted directly on the PDF pages with color-coded overlays
  • Multiple output formats — JSON, Markdown, HTML, and Summary modes
  • Password-protected PDFs — Enter a PDF password in Advanced Config to parse encrypted documents
  • Auto file type detection — Detects document type from URL extension and shows format-specific options
  • API key persistence — Stored in localStorage, never sent to any server except Firecrawl directly

Architecture

src/
  app/
    layout.tsx              Server component — next/font (Inter + JetBrains Mono), metadata
    page.tsx                Client component — state orchestrator, wires sidebar + main panel
    globals.css             Tailwind v4 @theme, custom scrollbar, bbox overlay, result styles
  components/
    sidebar/
      sidebar.tsx           440px left panel with header + scrollable form
      api-key-input.tsx     Password input with key icon, localStorage persistence
      url-input.tsx         URL input with file upload + detected type badge
      format-grid.tsx       JSON / Markdown / HTML / Summary toggle grid
      advanced-config.tsx   Collapsible: schema, prompt, parse mode, toggles, PDF password
      parse-button.tsx      Orange CTA with loading / success states
      sidebar-items.tsx     Extracted items list for viewer mode
    main-panel/
      main-panel.tsx        Right panel orchestrator (empty / result / viewer views)
      toolbar.tsx           Tab row, View on PDF, status badge, copy
      empty-state.tsx       Server component — dot grid + placeholder
      result-area.tsx       Rendered / Markdown / JSON / Extra tabs
      pdf-viewer.tsx        Container for PDF.js canvas rendering
      floating-legend.tsx   Server component — bounding box category legend
    ui/
      status-badge.tsx      Reusable loading / success badge
      dot-grid.tsx          Server component — subtle dot pattern background
  lib/
    types.ts                TypeScript interfaces for all data shapes
    constants.ts            API base, viewer scale, file types, default extraction prompt + schema
    firecrawl.ts            Firecrawl v2 API client (buildRequestBody, parseDocument, getFileType)
    normalize-items.ts      Universal JSON normalizer for any Firecrawl output shape
    pdf-viewer-engine.ts    PDF.js text matching, line merging, bounding box positioning
  hooks/
    use-parse.ts            Parse lifecycle state machine (idle → loading → success / error)
    use-pdf-viewer.ts       PDF.js CDN loading, canvas rendering, bounding box matching

Design decisions

Server / Client splitlayout.tsx, empty-state.tsx, dot-grid.tsx, and floating-legend.tsx are server components (static markup, no JS shipped). Everything interactive lives under a single "use client" boundary at page.tsx.

PDF.js via CDN — Loaded at runtime via a script tag to avoid bundling the canvas polyfill that breaks Turbopack. The worker is also loaded from CDN. This keeps the bundle small and avoids SSR issues.

Bounding box rendering — Uses PDF.js getTextContent() to extract precise text positions, then matches them against JSON extraction results using substring search with line-merging (groups adjacent text items on the same Y-coordinate into single boxes). Scale-aware tolerances ensure boxes align properly at any zoom level.

Fonts — Inter and JetBrains Mono loaded via next/font/google with CSS variables, integrated into Tailwind v4 via @theme inline.

API calls — All Firecrawl API calls go directly from the browser to api.firecrawl.dev. No Next.js API routes or server-side proxying. The API key never touches any server except Firecrawl.

Getting started

Prerequisites

Install

git clone <repo-url>
cd fireparse-app
npm install

Development

npm run dev

Open http://localhost:3000.

Production build

npm run build
npm start

Deploy to Vercel

npx vercel

No additional configuration needed — the app is a standard Next.js project.

Usage

  1. Enter your Firecrawl API key in the Authentication field
  2. Paste a document URL (PDF, DOCX, XLSX, or any web page)
  3. Select an output format (JSON is recommended for structured extraction)
  4. Optionally configure Advanced Options: JSON schema, extraction prompt, parse mode, PDF password
  5. Click Parse Document
  6. View results in the Rendered, Markdown, or JSON tabs
  7. For PDFs: click View on PDF in the toolbar to see bounding boxes overlaid on the actual document

Default extraction

When no custom prompt or schema is provided and JSON format is selected, FireParse uses a comprehensive default prompt and schema that handles:

Document Type What gets extracted
Resumes Name, contact details, each job with responsibilities, skills, education, awards, talks
Research papers Title, authors, abstract, keywords, sections, references, equations, figure captions
Invoices Line items, amounts, tax, totals, dates, account numbers, company details
Spreadsheets Every cell with column headers as field names, sheet names as context
Presentations Slide titles, bullet points, speaker notes, slide numbers
Legal documents Clauses, party names, dates, defined terms, signature blocks
Web pages Headings, paragraphs, links, metadata, structured data

The schema uses 32 field types: section_header, contact, personal_info, work_experience, responsibility, education, skill, award, certification, project, publication, reference, citation, talk, table_header, table_cell, table_total, list_item, paragraph, figure_caption, equation, abstract, keyword, footnote, slide_content, financial_amount, line_item, legal_clause, date, identifier, metadata, other.

Tech stack

License

MIT

About

Parse PDFs Easily with Firecrawl

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors