You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Parse client is a library to interact with OneOffTech PDF Parsing service based on [PDFAct](https://github.com/data-house/pdfact). OneOffTech Parse is designed to extract text from PDF files maintaining the structure of the document to improve interaction with Large Language Models (LLMs).
7
+
Parse client is a library to interact with [OneOffTech Parse](https://parse.oneofftech.de) service. OneOffTech Parse is designed to extract text from PDF files preserving the [structure of the document](#document-structure) to improve interaction with Large Language Models (LLMs).
8
+
9
+
OneOffTech Parse is based on [PDF Text extractor](https://github.com/data-house/pdf-text-extractor). The client is suitable to connect to self-hosted versions of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor).
8
10
9
11
10
12
> [!INFO]
@@ -13,22 +15,109 @@ Parse client is a library to interact with OneOffTech PDF Parsing service based
13
15
14
16
## Installation
15
17
16
-
You can install the package via composer:
18
+
You can install the package via Composer:
17
19
18
20
```bash
19
21
composer require oneofftech/parse-client
20
22
```
21
23
22
24
## Usage
23
25
26
+
The Parse client is able to connect to self-hosted instances of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) service or the cloud hosted [OneOffTech Parse](https://parse.oneofftech.de) service.
27
+
28
+
### Use with self-hosted instance
29
+
30
+
Before proceeding a running instance of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) is required. Once you have a running instance create an instance of the connector client passing the url on which your instance is listening.
31
+
32
+
```php
33
+
use OneOffTech\Parse\Client\Connectors\ParseConnector;
34
+
35
+
$client = new ParseConnector(baseUrl: "http://localhost:5000");
> - The URL of the document must be accessible without authentication.
61
+
> - Documents are downloaded for the time of processing and then the file is immediately deleted.
62
+
63
+
64
+
### Specify the preferred extraction method
65
+
66
+
Parse service supports different processors, [`pymupdf`](https://github.com/pymupdf/PyMuPDF) or [`pdfact`](https://github.com/data-house/pdfact). You can specify the preferred processor for each request.
67
+
24
68
```php
25
-
...
69
+
use OneOffTech\Parse\Client\ParseOption;
70
+
use OneOffTech\Parse\Client\DocumentProcessor;
71
+
use OneOffTech\Parse\Client\Connectors\ParseConnector;
options: new ParseOption(DocumentProcessor::PYMUPDF)
79
+
);
80
+
```
81
+
82
+
### PDFAct vs PyMuPDF
83
+
84
+
PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.
| Text styles (e.g. bold or italic) |:white_check_mark:| - |
92
+
| Page header |:white_check_mark:| - |
93
+
| Page footer |:white_check_mark:| - |
94
+
95
+
96
+
97
+
98
+
## Document structure
99
+
100
+
Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.
101
+
102
+
```
103
+
Document
104
+
├─Page
105
+
│ ├─Text (category: heading)
106
+
│ └─Text (category: body)
107
+
└─Page
108
+
├─Text (category: heading)
109
+
└─Text (category: body)
26
110
```
27
111
112
+
For a more in-depth explanation of the structure see [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python).
113
+
114
+
28
115
## Testing
29
116
30
117
Parse client is tested using [PEST](https://pestphp.com/). Tests run for each commit and pull request.
31
118
119
+
To execute the test suite run:
120
+
32
121
```bash
33
122
composer test
34
123
```
@@ -39,7 +128,7 @@ Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed re
39
128
40
129
## Contributing
41
130
42
-
Thank you for considering contributing to the Librarian client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
131
+
Thank you for considering contributing to the Parse client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
0 commit comments