Skip to content

Commit 4f0d60f

Browse files
feat: upgrade PDF.js to v5.6.205
1 parent c488cf8 commit 4f0d60f

File tree

11 files changed

+2533
-2036
lines changed

11 files changed

+2533
-2036
lines changed

README.md

Lines changed: 22 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,25 @@
11
# unpdf
22

3-
A collection of utilities for PDF extraction and rendering. Designed specifically for serverless environments, but it also works in Node.js, Deno, Bun and the browser. `unpdf` is particularly useful for serverless AI applications, especially for summarizing PDF documents in document analysis workflows.
3+
Utilities for PDF extraction and rendering across all JavaScript runtimes – Node.js, Deno, Bun, the browser, and serverless environments like Cloudflare Workers. Especially useful for AI applications that need to summarize or analyze PDF documents.
44

5-
This library ships with a serverless build/redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) that is optimized for edge environments. Some string replacements, global mocks and inlining the PDF.js worker allow the browser code to become platform agnostic. See [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) for the details.
6-
7-
This library is also intended as a modern alternative to the unmaintained but still popular [`pdf-parse`](https://www.npmjs.com/package/pdf-parse).
5+
Ships with a serverless build of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js), optimized for edge environments. If you're coming from [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), `unpdf` is a modern, actively maintained alternative with broader runtime support.
86

97
## Features
108

11-
- 🏗️ Made for Node.js, browser and serverless environments
9+
- 🏗️ Works in Node.js, browser and serverless environments
1210
- 🪭 Includes serverless build of PDF.js ([`unpdf/pdfjs`](./package.json#L34))
1311
- 💬 Extract [text](#extract-text-from-pdf), [links](#extractlinks), and [images](#extractimages) from PDF files
1412
- 🧠 Perfect for AI applications and PDF summarization
15-
- 🧱 Opt-in to legacy PDF.js build
16-
- 💨 Zero dependencies
17-
18-
## PDF.js Compatibility
19-
20-
> [!Tip]
21-
> The serverless PDF.js bundle provided by `unpdf` is built from PDF.js v5.4.394.
22-
23-
You can use an [official PDF.js build](#official-or-legacy-pdfjs-build) by using the [`definePDFJSModule`](#definepdfjsmodule) method. This is useful if you want to use a specific version or a custom build of PDF.js.
13+
- 🧱 Opt-in to official or legacy PDF.js build
2414

2515
## Installation
2616

27-
Run the following command to add `unpdf` to your project.
28-
2917
```bash
3018
# pnpm
31-
pnpm add -D unpdf
19+
pnpm add unpdf
3220

3321
# npm
34-
npm install -D unpdf
35-
36-
# yarn
37-
yarn add -D unpdf
22+
npm install unpdf
3823
```
3924

4025
## Usage
@@ -44,15 +29,11 @@ yarn add -D unpdf
4429
```ts
4530
import { extractText, getDocumentProxy } from 'unpdf'
4631

47-
// Either fetch a PDF file from the web or load it from the file system
32+
// Fetch a PDF from the web or load it from the file system
4833
const buffer = await fetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf')
4934
.then(res => res.arrayBuffer())
50-
const buffer = await readFile('./dummy.pdf')
5135

52-
// Then, load the PDF file into a PDF.js document
5336
const pdf = await getDocumentProxy(new Uint8Array(buffer))
54-
55-
// Finally, extract the text from the PDF file
5637
const { totalPages, text } = await extractText(pdf, { mergePages: true })
5738

5839
console.log(`Total pages: ${totalPages}`)
@@ -64,9 +45,9 @@ console.log(text)
6445
Usually you don't need to worry about the PDF.js build. `unpdf` ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.
6546

6647
> [!WARNING]
67-
> PDF.js v5.x uses `Promise.withResolvers`, which may not be supported in all environments, such as Node < 22. Consider to use the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.
48+
> PDF.js v5.x uses `Promise.withResolvers`, which may not be supported in all environments, such as Node < 22. Consider using the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.
6849
69-
For example, if you want to use the official PDF.js build, you can do the following:
50+
For example, if you want to use the official PDF.js build:
7051

7152
```ts
7253
import { definePDFJSModule, extractText, getDocumentProxy } from 'unpdf'
@@ -107,6 +88,17 @@ const document = await getDocument(new Uint8Array(data)).promise
10788
console.log(await document.getMetadata())
10889
```
10990

91+
## How It Works
92+
93+
> [!NOTE]
94+
> The serverless PDF.js bundle is built from PDF.js v5.6.205.
95+
96+
Heart and soul of this package is the [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) file. It uses [Rollup](https://rollupjs.org/) to bundle PDF.js into a single file for serverless environments. The key techniques:
97+
98+
- **String replacements** strip browser-specific references from the PDF.js source.
99+
- **Worker inlining** embeds the PDF.js worker directly into the main bundle, since serverless runtimes can't load separate worker files.
100+
- **Global polyfills** provide missing APIs like `FinalizationRegistry` (unavailable in Cloudflare Workers).
101+
110102
## API
111103

112104
### `definePDFJSModule`
@@ -209,15 +201,7 @@ for (const link of links) console.log(link)
209201

210202
### `extractImages`
211203

212-
Extracts images from a specific page of a PDF document, including necessary metadata such as width, height, and calculated color channels.
213-
214-
> [!NOTE]
215-
> This method will only work in Node.js and browser environments.
216-
217-
In order to use this method, make sure to meet the following requirements:
218-
219-
- Use the official PDF.js build (see below for details).
220-
- Install the [`@napi-rs/canvas`](https://github.com/Brooooooklyn/canvas) package if you are using Node.js. This package is required to render the PDF page as an image.
204+
Extracts images from a specific page of a PDF document, including necessary metadata such as width, height, and calculated color channels. Works with both the serverless and official PDF.js build.
221205

222206
**Type Declaration**
223207

@@ -285,7 +269,7 @@ To render a PDF page as an image, you can use the `renderPageAsImage` method. Th
285269

286270
In order to use this method, make sure to meet the following requirements:
287271

288-
- Use the official PDF.js build (see below for details).
272+
- Use the official PDF.js build (see [Official or Legacy PDF.js Build](#official-or-legacy-pdfjs-build)).
289273
- Install the [`@napi-rs/canvas`](https://github.com/Brooooooklyn/canvas) package if you are using Node.js. This package is required to render the PDF page as an image.
290274

291275
**Type Declaration**

examples/cloudflare/package.json

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
"dev": "esbuild --bundle --platform=neutral --outfile=build/index.js index.ts && wrangler dev build/index.js"
77
},
88
"devDependencies": {
9-
"esbuild": "^0.25.12",
10-
"wrangler": "^4.51.0"
9+
"esbuild": "^0.28.0",
10+
"wrangler": "^4.81.1"
1111
}
1212
}

package.json

Lines changed: 21 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"name": "unpdf",
33
"type": "module",
44
"version": "1.4.0",
5-
"packageManager": "pnpm@10.24.0",
5+
"packageManager": "pnpm@10.33.0",
66
"description": "PDF extraction and rendering across all JavaScript runtimes",
77
"author": "Johann Schopplich <[email protected]>",
88
"license": "MIT",
@@ -15,10 +15,17 @@
1515
"url": "https://github.com/unjs/unpdf/issues"
1616
},
1717
"keywords": [
18+
"cloudflare",
19+
"edge",
20+
"extract",
1821
"parse",
19-
"pdfjs-dist",
2022
"pdf",
21-
"serverless"
23+
"pdf.js",
24+
"pdfjs-dist",
25+
"rendering",
26+
"serverless",
27+
"text-extraction",
28+
"workers"
2229
],
2330
"sideEffects": false,
2431
"exports": {
@@ -70,24 +77,21 @@
7077
}
7178
},
7279
"devDependencies": {
73-
"@antfu/eslint-config": "^6.2.0",
74-
"@napi-rs/canvas": "^0.1.83",
75-
"@rollup/plugin-alias": "^6.0.0",
76-
"@rollup/plugin-inject": "^5.0.5",
80+
"@antfu/eslint-config": "^8.2.0",
81+
"@napi-rs/canvas": "^0.1.97",
7782
"@rollup/plugin-node-resolve": "^16.0.3",
7883
"@rollup/plugin-replace": "^6.0.3",
79-
"@rollup/plugin-terser": "^0.4.4",
84+
"@rollup/plugin-terser": "^1.0.0",
8085
"@rollup/plugin-typescript": "^12.3.0",
81-
"@types/node": "^24.10.1",
82-
"bumpp": "^10.3.2",
83-
"eslint": "^9.39.1",
84-
"fast-glob": "^3.3.3",
85-
"pdfjs-dist": "~5.4.394",
86-
"rollup": "^4.53.3",
87-
"tinyglobby": "^0.2.15",
86+
"@types/node": "^24.12.2",
87+
"bumpp": "^11.0.1",
88+
"eslint": "^10.2.0",
89+
"pdfjs-dist": "~5.6.205",
90+
"rollup": "^4.60.1",
91+
"tinyglobby": "^0.2.16",
8892
"tslib": "^2.8.1",
89-
"typescript": "^5.9.3",
93+
"typescript": "^6.0.2",
9094
"unbuild": "^3.6.1",
91-
"vitest": "^4.0.14"
95+
"vitest": "^4.1.4"
9296
}
9397
}

pdfjs.rollup.config.ts

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ export default defineConfig({
3636
preventAssignment: true,
3737
values: {
3838
// Force inlining the PDF.js worker.
39-
'await import(/*webpackIgnore: true*/this.workerSrc)': '__pdfjsWorker__',
39+
'await import(\n /*webpackIgnore: true*/\n /*@vite-ignore*/\n this.workerSrc)': '__pdfjsWorker__',
4040
// Force setting up fake PDF.js worker.
4141
'#isWorkerDisabled = false': '#isWorkerDisabled = true',
4242
// Remove WASM code from the worker.

0 commit comments

Comments
 (0)