You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+22-38Lines changed: 22 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,40 +1,25 @@
1
1
# unpdf
2
2
3
-
A collection of utilities for PDF extraction and rendering. Designed specifically for serverless environments, but it also works in Node.js, Deno, Bun and the browser. `unpdf` is particularly useful for serverless AI applications, especially for summarizing PDF documents in document analysis workflows.
3
+
Utilities for PDF extraction and rendering across all JavaScript runtimes – Node.js, Deno, Bun, the browser, and serverless environments like Cloudflare Workers. Especially useful for AI applications that need to summarize or analyze PDF documents.
4
4
5
-
This library ships with a serverless build/redistribution of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js) that is optimized for edge environments. Some string replacements, global mocks and inlining the PDF.js worker allow the browser code to become platform agnostic. See [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) for the details.
6
-
7
-
This library is also intended as a modern alternative to the unmaintained but still popular [`pdf-parse`](https://www.npmjs.com/package/pdf-parse).
5
+
Ships with a serverless build of Mozilla's [PDF.js](https://github.com/mozilla/pdf.js), optimized for edge environments. If you're coming from [`pdf-parse`](https://www.npmjs.com/package/pdf-parse), `unpdf` is a modern, actively maintained alternative with broader runtime support.
8
6
9
7
## Features
10
8
11
-
- 🏗️ Made for Node.js, browser and serverless environments
9
+
- 🏗️ Works in Node.js, browser and serverless environments
12
10
- 🪭 Includes serverless build of PDF.js ([`unpdf/pdfjs`](./package.json#L34))
13
11
- 💬 Extract [text](#extract-text-from-pdf), [links](#extractlinks), and [images](#extractimages) from PDF files
14
12
- 🧠 Perfect for AI applications and PDF summarization
15
-
- 🧱 Opt-in to legacy PDF.js build
16
-
- 💨 Zero dependencies
17
-
18
-
## PDF.js Compatibility
19
-
20
-
> [!Tip]
21
-
> The serverless PDF.js bundle provided by `unpdf` is built from PDF.js v5.4.394.
22
-
23
-
You can use an [official PDF.js build](#official-or-legacy-pdfjs-build) by using the [`definePDFJSModule`](#definepdfjsmodule) method. This is useful if you want to use a specific version or a custom build of PDF.js.
13
+
- 🧱 Opt-in to official or legacy PDF.js build
24
14
25
15
## Installation
26
16
27
-
Run the following command to add `unpdf` to your project.
const pdf =awaitgetDocumentProxy(newUint8Array(buffer))
54
-
55
-
// Finally, extract the text from the PDF file
56
37
const { totalPages, text } =awaitextractText(pdf, { mergePages: true })
57
38
58
39
console.log(`Total pages: ${totalPages}`)
@@ -64,9 +45,9 @@ console.log(text)
64
45
Usually you don't need to worry about the PDF.js build. `unpdf` ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.
65
46
66
47
> [!WARNING]
67
-
> PDF.js v5.x uses `Promise.withResolvers`, which may not be supported in all environments, such as Node < 22. Consider to use the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.
48
+
> PDF.js v5.x uses `Promise.withResolvers`, which may not be supported in all environments, such as Node < 22. Consider using the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.
68
49
69
-
For example, if you want to use the official PDF.js build, you can do the following:
50
+
For example, if you want to use the official PDF.js build:
> The serverless PDF.js bundle is built from PDF.js v5.6.205.
95
+
96
+
Heart and soul of this package is the [`pdfjs.rollup.config.ts`](./pdfjs.rollup.config.ts) file. It uses [Rollup](https://rollupjs.org/) to bundle PDF.js into a single file for serverless environments. The key techniques:
97
+
98
+
-**String replacements** strip browser-specific references from the PDF.js source.
99
+
-**Worker inlining** embeds the PDF.js worker directly into the main bundle, since serverless runtimes can't load separate worker files.
100
+
-**Global polyfills** provide missing APIs like `FinalizationRegistry` (unavailable in Cloudflare Workers).
101
+
110
102
## API
111
103
112
104
### `definePDFJSModule`
@@ -209,15 +201,7 @@ for (const link of links) console.log(link)
-Installthe [`@napi-rs/canvas`](https://github.com/Brooooooklyn/canvas) package if you are using Node.js. This package is required to render the PDF page as an image.
-Installthe [`@napi-rs/canvas`](https://github.com/Brooooooklyn/canvas) package if you are using Node.js. This package is required to render the PDF page as an image.
0 commit comments