OneOffTech
diff --git a/‎.github/FUNDING.yml‎
Lines changed: 0 additions & 1 deletion b/‎.github/FUNDING.yml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 93 additions & 4 deletions b/‎README.md‎
Lines changed: 93 additions & 4 deletions
diff --git a/‎composer.json‎
Lines changed: 8 additions & 5 deletions b/‎composer.json‎
Lines changed: 8 additions & 5 deletions
diff --git a/‎src/Connectors/ParseConnector.php‎
Lines changed: 89 additions & 0 deletions b/‎src/Connectors/ParseConnector.php‎
Lines changed: 89 additions & 0 deletions
diff --git a/‎src/DocumentFormat/DocumentNode.php‎
Lines changed: 115 additions & 0 deletions b/‎src/DocumentFormat/DocumentNode.php‎
Lines changed: 115 additions & 0 deletions
@@ -4,7 +4,9 @@
 [![Tests](https://img.shields.io/github/actions/workflow/status/oneofftech/oneofftech-parse-client/run-tests.yml?branch=main&label=tests&style=flat-square)](https://github.com/oneofftech/oneofftech-parse-client/actions/workflows/run-tests.yml)
 [![Total Downloads](https://img.shields.io/packagist/dt/oneofftech/oneofftech-parse-client.svg?style=flat-square)](https://packagist.org/packages/oneofftech/oneofftech-parse-client)
 
-Parse client is a library to interact with OneOffTech PDF Parsing service based on [PDFAct](https://github.com/data-house/pdfact). OneOffTech Parse is designed to extract text from PDF files maintaining the structure of the document to improve interaction with Large Language Models (LLMs).
+Parse client is a library to interact with [OneOffTech Parse](https://parse.oneofftech.de) service. OneOffTech Parse is designed to extract text from PDF files preserving the [structure of the document](#document-structure) to improve interaction with Large Language Models (LLMs).
+
+OneOffTech Parse is based on [PDF Text extractor](https://github.com/data-house/pdf-text-extractor). The client is suitable to connect to self-hosted versions of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor).
 
 
 > [!INFO]  
@@ -13,22 +15,109 @@ Parse client is a library to interact with OneOffTech PDF Parsing service based
 
 ## Installation
 
-You can install the package via composer:
+You can install the package via Composer:
 
 ```bash
 composer require oneofftech/parse-client
 ```
 
 ## Usage
 
+The Parse client is able to connect to self-hosted instances of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) service or the cloud hosted [OneOffTech Parse](https://parse.oneofftech.de) service.
+
+### Use with self-hosted instance
+
+Before proceeding a running instance of the [PDF Text extractor](https://github.com/data-house/pdf-text-extractor) is required. Once you have a running instance create an instance of the connector client passing the url on which your instance is listening.
+
+```php
+use OneOffTech\Parse\Client\Connectors\ParseConnector;
+
+$client = new ParseConnector(baseUrl: "http://localhost:5000");
+
+/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
+$document = $client->parse("https://domain.internal/document.pdf");
+```
+
+> [!INFO]  
+> - The URL of the document must be accessible without authentication.
+> - Documents are downloaded for the time of processing and then the file is immediately deleted.
+
+
+### Use the cloud hosted service
+
+Go to [parse.oneofftech.de](https://parse.oneofftech.de) and obtain an access token. Instantiate the client and provide a URL of a PDF document. 
+
+```php
+use OneOffTech\Parse\Client\Connectors\ParseConnector;
+
+$client = new ParseConnector("token");
+
+/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
+$document = $client->parse("https://domain.internal/document.pdf");
+```
+
+> [!INFO]  
+> - The URL of the document must be accessible without authentication.
+> - Documents are downloaded for the time of processing and then the file is immediately deleted.
+
+
+### Specify the preferred extraction method
+
+Parse service supports different processors, [`pymupdf`](https://github.com/pymupdf/PyMuPDF) or [`pdfact`](https://github.com/data-house/pdfact). You can specify the preferred processor for each request.
+
 ```php
-...
+use OneOffTech\Parse\Client\ParseOption;
+use OneOffTech\Parse\Client\DocumentProcessor;
+use OneOffTech\Parse\Client\Connectors\ParseConnector;
+
+$client = new ParseConnector("token");
+
+/** @var \OneOffTech\Parse\Client\Dto\DocumentDto */
+$document = $client->parse(
+    url: "https://domain.internal/document.pdf", 
+    options: new ParseOption(DocumentProcessor::PYMUPDF)
+);
+```
+
+### PDFAct vs PyMuPDF
+
+PDFAct offers more flexibility than PyMuPDF. You should evaluate the extraction method best suitable for your application. Here is a small comparison of the two methods.
+
+| feature                           | PDFAct | PyMuPDF |
+|-----------------------------------|--------|---------|
+| Text extraction                   | :white_check_mark: | :white_check_mark: |
+| Pagination                        | :white_check_mark: | :white_check_mark: |
+| Headings identification           | :white_check_mark: | - |
+| Text styles (e.g. bold or italic) | :white_check_mark: | - |
+| Page header                       | :white_check_mark: | - |
+| Page footer                       | :white_check_mark: | - |
+
+
+
+
+## Document structure
+
+Parse is designed to preserve the document's structure hence the content is returned in a hierarchical fashion.
+
+```
+Document
+ ├─Page
+ │  ├─Text (category: heading)
+ │  └─Text (category: body)
+ └─Page
+    ├─Text (category: heading)
+    └─Text (category: body)
 ```
 
+For a more in-depth explanation of the structure see [Parse Document Model](https://github.com/OneOffTech/parse-document-model-python).
+
+
 ## Testing
 
 Parse client is tested using [PEST](https://pestphp.com/). Tests run for each commit and pull request.
 
+To execute the test suite run:
+
 ```bash
 composer test
 ```
@@ -39,7 +128,7 @@ Please see [CHANGELOG](CHANGELOG.md) for more information on what has changed re
 
 ## Contributing
 
-Thank you for considering contributing to the Librarian client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
+Thank you for considering contributing to the Parse client! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.
 
 ## Security Vulnerabilities
 
 
@@ -1,9 +1,11 @@
 {
     "name": "oneofftech/parse-client",
-    "description": "This is my package oneofftech-parse-client",
+    "description": "Parse PDF document keeping the structure.",
     "keywords": [
-        "OneOffTech",
-        "oneofftech-parse-client"
+        "pdf",
+        "parse",
+        "parsing",
+        "text-extract"
     ],
     "homepage": "https://github.com/oneofftech/oneofftech-parse-client",
     "license": "MIT",
@@ -19,8 +21,9 @@
         "saloonphp/saloon": "^3.10"
     },
     "require-dev": {
-        "pestphp/pest": "^2.20",
-        "laravel/pint": "^1.0"
+        "jonpurvis/lawman": "^1.2",
+        "laravel/pint": "^1.0",
+        "pestphp/pest": "^2.20"
     },
     "autoload": {
         "psr-4": {
 
@@ -0,0 +1,89 @@
+<?php
+
+namespace OneOffTech\Parse\Client\Connectors;
+
+use OneOffTech\Parse\Client\DocumentProcessor;
+use OneOffTech\Parse\Client\Dto\DocumentDto;
+use OneOffTech\Parse\Client\ParseOption;
+use OneOffTech\Parse\Client\Requests\ExtractTextRequest;
+use OneOffTech\Parse\Client\Responses\ParseResponse;
+use Saloon\Contracts\Authenticator;
+use Saloon\Http\Auth\NullAuthenticator;
+use Saloon\Http\Auth\TokenAuthenticator;
+use Saloon\Http\Connector;
+use Saloon\Http\Response;
+use Saloon\Traits\Plugins\AcceptsJson;
+use Saloon\Traits\Plugins\AlwaysThrowOnErrors;
+use Saloon\Traits\Plugins\HasTimeout;
+use SensitiveParameter;
+
+class ParseConnector extends Connector
+{
+    use AcceptsJson;
+    use AlwaysThrowOnErrors;
+    use HasTimeout;
+
+    protected int $connectTimeout = 30;
+
+    protected int $requestTimeout = 120;
+
+    protected ?string $response = ParseResponse::class;
+
+    public function __construct(
+
+        /**
+         * The authentication token
+         */
+        #[SensitiveParameter]
+        public readonly ?string $token = null,
+
+        /**
+         * The base url where the API listen
+         */
+        protected readonly string $baseUrl = 'https://parse.oneofftech.de/api/v0',
+    ) {
+        //
+    }
+
+    public function resolveBaseUrl(): string
+    {
+        return $this->baseUrl;
+    }
+
+    protected function defaultAuth(): Authenticator
+    {
+        if (is_null($this->token)) {
+            return new NullAuthenticator;
+        }
+
+        return new TokenAuthenticator($this->token);
+    }
+
+    /**
+     * Determine if the request has failed.
+     */
+    public function hasRequestFailed(Response $response): ?bool
+    {
+        return $response->serverError() || $response->clientError();
+    }
+
+    // Resources and helper methods
+
+    /**
+     * Parse a document hosted on a web server
+     *
+     * @param  string  $url  The URL under which the document is accessible
+     * @param  string  $mimeType  The mime type of the document. Default application/pdf
+     * @param  \OneOffTech\Parse\Client\ParseOption  $options  Specifiy additional options for the specific parsing processor
+     */
+    public function parse(string $url, string $mimeType = 'application/pdf', ?ParseOption $options = null): DocumentDto
+    {
+        return $this
+            ->send((new ExtractTextRequest(
+                url: $url,
+                mimeType: $mimeType,
+                preferredDocumentProcessor: $options?->processor?->value ?? DocumentProcessor::PDFACT->value,
+            ))->validate())
+            ->dto();
+    }
+}
@@ -0,0 +1,115 @@
+<?php
+
+namespace OneOffTech\Parse\Client\DocumentFormat;
+
+use Countable;
+use OneOffTech\Parse\Client\Exceptions\EmptyDocumentException;
+use OneOffTech\Parse\Client\Exceptions\InvalidDocumentFormatException;
+use RecursiveArrayIterator;
+use RecursiveIteratorIterator;
+
+class DocumentNode implements Countable
+{
+
+    public function __construct(
+        public readonly array $content,
+        public readonly array $attributes = [],
+    ) {}
+
+
+    public function type(): string
+    {
+        return 'doc';
+    }
+
+
+    /**
+     * The number of pages in this document as extracted by the parser.
+     */
+    public function count(): int
+    {
+        return count($this->content);
+    }
+
+    /**
+     * Test if the document is empty, i.e. contains no pages or has no textual content on any of the pages
+     */
+    public function isEmpty(): bool
+    {
+        return $this->count() === 0 || !$this->hasContent();
+    }
+
+    /**
+     * Test if the document has discernible textual content on any of the pages
+     */
+    public function hasContent(): bool
+    {
+        foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
+            if($key === 'text' && !empty($value)){
+                return true;
+            }
+        }
+
+        return false;
+    }
+
+
+    /**
+     * The pages in this document
+     * 
+     * @return \OneOffTech\Parse\Client\DocumentFormat\PageNode[]
+     */
+    public function pages(): array
+    {
+        return array_map(fn($page) => PageNode::fromArray($page), $this->content);
+    }
+
+    public function text(): string
+    {
+        $text = []; 
+
+        foreach (new RecursiveIteratorIterator(new RecursiveArrayIterator($this->content), RecursiveIteratorIterator::LEAVES_ONLY) as $key => $value) {
+            if($key === 'text' && !empty($value)){
+                $text[] = $value;
+            }
+        }
+
+        return join(PHP_EOL, $text);
+    }
+
+
+    /**
+     * Throw exception if document has no textual content
+     * 
+     * @throws OneOffTech\Parse\Client\Exceptions\EmptyDocumentException when document has no textual content
+     */
+    public function throwIfNoContent(): self
+    {
+        if(!$this->hasContent()){
+            throw new EmptyDocumentException("Document has no textual content.");
+        }
+
+        return $this;
+    }
+
+
+    /**
+     * Create a document node from associative array
+     */
+    public static function fromArray(array $data): DocumentNode
+    {
+        if(!(isset($data['category']) && isset($data['content']))){
+            throw new InvalidDocumentFormatException("Unexpected document structure. Missing category or content.");
+        }
+
+        if($data['category'] !== 'doc'){
+            throw new InvalidDocumentFormatException("Unexpected node category. Expecting [doc] found [{$data['category']}].");
+        }
+        
+        if(!is_array($data['content'])){
+            throw new InvalidDocumentFormatException("Unexpected content format. Expecting [array].");
+        }
+        
+        return new DocumentNode($data['content'] ?? [], $data['attributes'] ?? []);
+    }
+}