unpdf

A collection of utilities to work with PDFs. Uses Mozilla's PDF.js under the hood and lazily initializes the library.

unpdf takes advantage of export conditions to circumvent build issues in serverless environments. For example, PDF.js depends on the optional canvas module, which doesn't work inside worker threads.

This library is also intended as a modern alternative to the unmaintained but still popular pdf-parse.

Features

🏗️ Conditional exports for Node.js, worker and browser environments
💬 Extract text and images from PDFs
🧱 Opt-in to legacy PDF.js build

Installation

Run the following command to add unpdf to your project.

# pnpm
pnpm add unpdf

# npm
npm install unpdf

# yarn
yarn add unpdf

Usage

import { extractPDFText } from 'unpdf'

// Fetch a PDF file from the web
const pdf = await fetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')
  .then(res => res.arrayBuffer())

// Or load it from the filesystem
const pdf = await readFile('./dummy.pdf')

// Pass the PDF buffer to the relevant method
const { totalPages, text } = await extractPDFText(
  new Uint8Array(pdf), { mergePages: true }
)

Use Legacy Or Custom PDF.js Build

// Before using any other methods, define the PDF.js module
import { defineUnPDFConfig } from 'unpdf'

// Use the legacy build
defineUnPDFConfig({
  pdfjs: () => import('pdfjs-dist/legacy/build/pdf.js')
})

// Now, you can use the other methods
// …

Access the PDF.js Module

import { getResolvedPDFJS } from 'unpdf'

const { version } = await getResolvedPDFJS()

Config

interface UnPDFConfiguration {
  /**
   * By default, UnPDF will use the latest version of PDF.js. If you want to
   * use an older version or the legacy build, set a promise that resolves to
   * the PDF.js module.
   *
   * @example
   * () => import('pdfjs-dist/legacy/build/pdf.js')
   */
  pdfjs?: () => Promise<typeof PDFJS>
}

Methods

`defineUnPDFConfig`

Define a custom PDF.js module, like the legacy build. Make sure to call this method before using any other methods.

function defineUnPDFConfig(config: UnPDFConfiguration): Promise<void>

`getResolvedPDFJS`

Returns the resolved PDF.js module. If no build is defined, the latest version will be initialized.

function getResolvedPDFJS(): Promise<typeof import('pdfjs-dist')>

`getPDFMeta`

function getPDFMeta(
  data: BinaryData | PDFDocumentProxy
): Promise<{
  info: Record<string, any>
  metadata: Record<string, any>
}>

`extractPDFText`

function extractPDFText(
  data: BinaryData | PDFDocumentProxy,
  { mergePages }?: { mergePages?: boolean }
): Promise<{
  totalPages: number
  text: string | string[]
}>

`getImagesFromPage`

function getImagesFromPage(
  data: BinaryData | PDFDocumentProxy,
  pageNumber: number
): Promise<ArrayBuffer[]>

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
.vscode		.vscode
src		src
test		test
.editorconfig		.editorconfig
.eslintrc		.eslintrc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.config.ts		build.config.ts
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
renovate.json		renovate.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

unpdf

Features

Installation

Usage

Use Legacy Or Custom PDF.js Build

Access the PDF.js Module

Config

Methods

`defineUnPDFConfig`

`getResolvedPDFJS`

`getPDFMeta`

`extractPDFText`

`getImagesFromPage`

License

About

Uh oh!

Releases 43

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

unpdf

Features

Installation

Usage

Use Legacy Or Custom PDF.js Build

Access the PDF.js Module

Config

Methods

defineUnPDFConfig

getResolvedPDFJS

getPDFMeta

extractPDFText

getImagesFromPage

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 43

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

`defineUnPDFConfig`

`getResolvedPDFJS`

`getPDFMeta`

`extractPDFText`

`getImagesFromPage`

Packages