Skip to content

Commit 070007c

Browse files
committed
Document parsing
1 parent 28f0602 commit 070007c

26 files changed

+970
-16
lines changed

.github/FUNDING.yml

Lines changed: 0 additions & 1 deletion
This file was deleted.

README.md

Lines changed: 93 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,9 @@
44
[![Tests](https://img.shields.io/github/actions/workflow/status/oneofftech/oneofftech-parse-client/run-tests.yml?branch=main&label=tests&style=flat-square)](https://github.com/oneofftech/oneofftech-parse-client/actions/workflows/run-tests.yml)
55
[![Total Downloads](https://img.shields.io/packagist/dt/oneofftech/oneofftech-parse-client.svg?style=flat-square)](https://packagist.org/packages/oneofftech/oneofftech-parse-client)
66

7-
Parse client is a library to interact with OneOffTech PDF Parsing service based on [PDFAct](https://github.com/data-house/pdfact). OneOffTech Parse is designed to extract text from PDF files maintaining the structure of the document to improve interaction with Large Language Models (LLMs).
7+
Parse client is a library to interact with [OneOffTech Parse](https://parse.oneofftech.de) service. OneOffTech Parse is designed to extract text from PDF files preserving the [structure of the document](#document-structure) to improve interaction with Large Language Models (LLMs).
8+
9+
OneOffTech Parse is based on [PDF Text extractor](https://github.com/data-house/pdf-text-extractor). The client is suitable to connect to self-hosted versions of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor).
810

911

1012
> [!INFO]
@@ -13,22 +15,109 @@ Parse client is a library to interact with OneOffTech PDF Parsing service based
1315

1416
## Installation
1517

16-
You can install the package via composer:
18+
You can install the package via Composer:
1719

1820
```bash
1921
composer require oneofftech/parse-client
2022
```
2123

2224
## Usage
2325

26+
The Parse client is able to connect to self-hosted instances of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) service or the cloud hosted [OneOffTech Parse](https://parse.oneofftech.de) service.
27+
28+
### Use with self-hosted instance
29+
30+
Before proceeding a running instance of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) is required. Once you have a running instance create an instance of the connector client passing the url on which your instance is listening.
31+
32+
```php
33+
use OneOffTech\Parse\Client\Connectors\ParseConnector;
34+
35+
$client = new ParseConnector(baseUrl: "http://localhost:5000");
36+
37+
/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
38+
$document = $client->parse("https://domain.internal/document.pdf");
39+
```
40+
41+
> [!INFO]
42+
> - The URL of the document must be accessible without authentication.
43+
> - Documents are downloaded for the time of processing and then the file is immediately deleted.
44+
45+
46+
### Use the cloud hosted service
47+
48+
Go to [parse.oneofftech.de](https://parse.oneofftech.de) and obtain an access token. Instantiate the client and provide a URL of a PDF document.
49+
50+
```php
51+
use OneOffTech\Parse\Client\Connectors\ParseConnector;
52+
53+
$client = new ParseConnector("token");
54+
55+
/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
56+
$document = $client->parse("https://domain.internal/document.pdf");
57+
```
58+
59+
> [!INFO]
60+
> - The URL of the document must be accessible without authentication.
61+
> - Documents are downloaded for the time of processing and then the file is immediately deleted.
62+
63+
64+
### Specify the preferred extraction method
65+
66+
Parse service supports different processors, [`pymupdf`](https://github.com/pymupdf/PyMuPDF) or [`pdfact`](https://github.com/data-house/pdfact). You can specify the preferred processor for each request.
67+
2468
```php
25-
...
69+
use OneOffTech\Parse\Client\ParseOption;
70+
use OneOffTech\Parse\Client\DocumentProcessor;
71+
use OneOffTech\Parse\Client\Connectors\ParseConnector;
72+
73+
$client = new ParseConnector("token");
74+
75+
/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
76+
$document = $client->parse(
77+
url: "https://domain.internal/document.pdf",
78+
options: new ParseOption(DocumentProcessor::PYMUPDF)
79+
);
80+
```
81+
82+
### PDFAct vs PyMuPDF
83+
84+
PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.
85+
86+
| feature | PDFAct | PyMuPDF |
87+
|-----------------------------------|--------|---------|
88+
| Text extraction | :white_check_mark: | :white_check_mark: |
89+
| Pagination | :white_check_mark: | :white_check_mark: |
90+
| Headings identification | :white_check_mark: | - |
91+
| Text styles (e.g. bold or italic) | :white_check_mark: | - |
92+
| Page header | :white_check_mark: | - |
93+
| Page footer | :white_check_mark: | - |
94+
95+
96+
97+
98+
## Document structure
99+
100+
Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.
101+
102+
```
103+
Document
104+
├─Page
105+
│ ├─Text (category: heading)
106+
│ └─Text (category: body)
107+
└─Page
108+
├─Text (category: heading)
109+
└─Text (category: body)
26110
```
27111

112+
For a more in-depth explanation of the structure see [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python).
113+
114+
28115
## Testing
29116

30117
Parse client is tested using [PEST](https://pestphp.com/). Tests run for each commit and pull request.
31118

119+
To execute the test suite run:
120+
32121
```bash
33122
composer test
34123
```
@@ -39,7 +128,7 @@ Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed re
39128

40129
## Contributing
41130

42-
Thank you for considering contributing to the Librarian client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
131+
Thank you for considering contributing to the Parse client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
43132

44133
## Security Vulnerabilities
45134

composer.json

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
11
{
22
"name": "oneofftech/parse-client",
3-
"description": "This is my package oneofftech-parse-client",
3+
"description": "Parse PDF document keeping the structure.",
44
"keywords": [
5-
"OneOffTech",
6-
"oneofftech-parse-client"
5+
"pdf",
6+
"parse",
7+
"parsing",
8+
"text-extract"
79
],
810
"homepage": "https://github.com/oneofftech/oneofftech-parse-client",
911
"license": "MIT",
@@ -19,8 +21,9 @@
1921
"saloonphp/saloon": "^3.10"
2022
},
2123
"require-dev": {
22-
"pestphp/pest": "^2.20",
23-
"laravel/pint": "^1.0"
24+
"jonpurvis/lawman": "^1.2",
25+
"laravel/pint": "^1.0",
26+
"pestphp/pest": "^2.20"
2427
},
2528
"autoload": {
2629
"psr-4": {

src/Connectors/ParseConnector.php

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
<?php
2+
3+
namespace OneOffTech\Parse\Client\Connectors;
4+
5+
use OneOffTech\Parse\Client\DocumentProcessor;
6+
use OneOffTech\Parse\Client\Dto\DocumentDto;
7+
use OneOffTech\Parse\Client\ParseOption;
8+
use OneOffTech\Parse\Client\Requests\ExtractTextRequest;
9+
use OneOffTech\Parse\Client\Responses\ParseResponse;
10+
use Saloon\Contracts\Authenticator;
11+
use Saloon\Http\Auth\NullAuthenticator;
12+
use Saloon\Http\Auth\TokenAuthenticator;
13+
use Saloon\Http\Connector;
14+
use Saloon\Http\Response;
15+
use Saloon\Traits\Plugins\AcceptsJson;
16+
use Saloon\Traits\Plugins\AlwaysThrowOnErrors;
17+
use Saloon\Traits\Plugins\HasTimeout;
18+
use SensitiveParameter;
19+
20+
class ParseConnector extends Connector
21+
{
22+
use AcceptsJson;
23+
use AlwaysThrowOnErrors;
24+
use HasTimeout;
25+
26+
protected int $connectTimeout = 30;
27+
28+
protected int $requestTimeout = 120;
29+
30+
protected ?string $response = ParseResponse::class;
31+
32+
public function __construct(
33+
34+
/**
35+
* The authentication token
36+
*/
37+
#[SensitiveParameter]
38+
public readonly ?string $token = null,
39+
40+
/**
41+
* The base url where the API listen
42+
*/
43+
protected readonly string $baseUrl = 'https://parse.oneofftech.de/api/v0',
44+
) {
45+
//
46+
}
47+
48+
public function resolveBaseUrl(): string
49+
{
50+
return $this->baseUrl;
51+
}
52+
53+
protected function defaultAuth(): Authenticator
54+
{
55+
if (is_null($this->token)) {
56+
return new NullAuthenticator;
57+
}
58+
59+
return new TokenAuthenticator($this->token);
60+
}
61+
62+
/**
63+
* Determine if the request has failed.
64+
*/
65+
public function hasRequestFailed(Response $response): ?bool
66+
{
67+
return $response->serverError() || $response->clientError();
68+
}
69+
70+
// Resources and helper methods
71+
72+
/**
73+
* Parse a document hosted on a web server
74+
*
75+
* @param string $url The URL under which the document is accessible
76+
* @param string $mimeType The mime type of the document. Default application/pdf
77+
* @param \OneOffTech\Parse\Client\ParseOption $options Specifiy additional options for the specific parsing processor
78+
*/
79+
public function parse(string $url, string $mimeType = 'application/pdf', ?ParseOption $options = null): DocumentDto
80+
{
81+
return $this
82+
->send((new ExtractTextRequest(
83+
url: $url,
84+
mimeType: $mimeType,
85+
preferredDocumentProcessor: $options?->processor?->value ?? DocumentProcessor::PDFACT->value,
86+
))->validate())
87+
->dto();
88+
}
89+
}
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
<?php
2+
3+
namespace OneOffTech\Parse\Client\DocumentFormat;
4+
5+
use Countable;
6+
use OneOffTech\Parse\Client\Exceptions\EmptyDocumentException;
7+
use OneOffTech\Parse\Client\Exceptions\InvalidDocumentFormatException;
8+
use RecursiveArrayIterator;
9+
use RecursiveIteratorIterator;
10+
11+
class DocumentNode implements Countable
12+
{
13+
14+
public function __construct(
15+
public readonly array $content,
16+
public readonly array $attributes = [],
17+
) {}
18+
19+
20+
public function type(): string
21+
{
22+
return 'doc';
23+
}
24+
25+
26+
/**
27+
* The number of pages in this document as extracted by the parser.
28+
*/
29+
public function count(): int
30+
{
31+
return count($this->content);
32+
}
33+
34+
/**
35+
* Test if the document is empty, i.e. contains no pages or has no textual content on any of the pages
36+
*/
37+
public function isEmpty(): bool
38+
{
39+
return $this->count() === 0 || !$this->hasContent();
40+
}
41+
42+
/**
43+
* Test if the document has discernible textual content on any of the pages
44+
*/
45+
public function hasContent(): bool
46+
{
47+
foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
48+
if($key === 'text' && !empty($value)){
49+
return true;
50+
}
51+
}
52+
53+
return false;
54+
}
55+
56+
57+
/**
58+
* The pages in this document
59+
*
60+
* @return \OneOffTech\Parse\Client\DocumentFormat\PageNode[]
61+
*/
62+
public function pages(): array
63+
{
64+
return array_map(fn($page) => PageNode::fromArray($page), $this->content);
65+
}
66+
67+
public function text(): string
68+
{
69+
$text = [];
70+
71+
foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
72+
if($key === 'text' && !empty($value)){
73+
$text[] = $value;
74+
}
75+
}
76+
77+
return join(PHP_EOL, $text);
78+
}
79+
80+
81+
/**
82+
* Throw exception if document has no textual content
83+
*
84+
* @throws OneOffTech\Parse\Client\Exceptions\EmptyDocumentException when document has no textual content
85+
*/
86+
public function throwIfNoContent(): self
87+
{
88+
if(!$this->hasContent()){
89+
throw new EmptyDocumentException("Document has no textual content.");
90+
}
91+
92+
return $this;
93+
}
94+
95+
96+
/**
97+
* Create a document node from associative array
98+
*/
99+
public static function fromArray(array $data): DocumentNode
100+
{
101+
if(!(isset($data['category']) && isset($data['content']))){
102+
throw new InvalidDocumentFormatException("Unexpected document structure. Missing category or content.");
103+
}
104+
105+
if($data['category'] !== 'doc'){
106+
throw new InvalidDocumentFormatException("Unexpected node category. Expecting [doc] found [{$data['category']}].");
107+
}
108+
109+
if(!is_array($data['content'])){
110+
throw new InvalidDocumentFormatException("Unexpected content format. Expecting [array].");
111+
}
112+
113+
return new DocumentNode($data['content'] ?? [], $data['attributes'] ?? []);
114+
}
115+
}

0 commit comments

Comments
 (0)