python

pdf-strings

Extract text from PDFs with position data.

Installation

pip install pdf-strings

Quick Start

from pdf_strings import from_path

# Extract text from a PDF
output = from_path("document.pdf")
print(output)  # Plain text

API Reference

Functions

`from_path(path: str, *, password: str | None = None) -> TextOutput`

Extract text from a PDF file.

Parameters:

path (str): Path to the PDF file
password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_path

# Basic usage
output = from_path("document.pdf")

# With password
output = from_path("encrypted.pdf", password="secret")

`from_bytes(data: bytes, *, password: str | None = None) -> TextOutput`

Extract text from PDF bytes.

Parameters:

data (bytes): PDF file contents as bytes
password (str, optional): Password for encrypted PDFs

Returns: TextOutput object containing structured lines and spans

Example:

from pdf_strings import from_bytes

with open("document.pdf", "rb") as f:
    data = f.read()

output = from_bytes(data)

Classes

`TextOutput`

Container for extracted text with structured data.

Attributes:

lines (List[List[TextSpan]]): Lines of text, each containing multiple spans

Methods:

`to_string() -> str`

Get plain text output (concatenates all text with spaces).

output = from_path("document.pdf")
plain_text = output.to_string()
# or simply:
plain_text = str(output)

`to_string_pretty() -> str`

Get formatted text that preserves spatial layout using a character grid.

output = from_path("document.pdf")
formatted_text = output.to_string_pretty()
# or using format spec:
formatted_text = f"{output:#}"

Magic Methods:

__str__(): Returns plain text (same as to_string())
__format__(format_spec): Use # for pretty formatting: f"{output:#}"

`TextSpan`

A span of text with position and metadata.

Attributes:

text (str): The text content
bbox (BoundingBox): Bounding box coordinates
font_size (float): Font size in points
page (int): Page number (0-indexed)

Example:

output = from_path("document.pdf")
for line in output.lines:
    for span in line:
        print(f"'{span.text}' at size {span.font_size}pt on page {span.page}")
        print(f"  Position: {span.bbox}")

`BoundingBox`

Bounding box coordinates for a text span.

Attributes:

top (float): Top coordinate
right (float): Right coordinate
bottom (float): Bottom coordinate
left (float): Left coordinate

String representation: (top, right, bottom, left) following HTML margin convention.

Example:

bbox = span.bbox
print(f"Top-left: ({bbox.left}, {bbox.top})")
print(f"Width: {bbox.right - bbox.left}")
print(f"Height: {bbox.top - bbox.bottom}")

Usage Examples

Extract all text

from pdf_strings import from_path

output = from_path("document.pdf")
print(output.to_string())

Preserve layout

from pdf_strings import from_path

output = from_path("invoice.pdf")
# Character grid rendering preserves columns and spacing
print(output.to_string_pretty())

Access structured data

from pdf_strings import from_path

output = from_path("document.pdf")

for line_idx, line in enumerate(output.lines):
    print(f"Line {line_idx}:")
    for span in line:
        print(f"  {span.text}")
        print(f"    Font size: {span.font_size}")
        print(f"    Position: ({span.bbox.left}, {span.bbox.top})")
        print(f"    Page: {span.page}")

Find text in specific regions

from pdf_strings import from_path

output = from_path("document.pdf")

# Find text in the top-right corner
for line in output.lines:
    for span in line:
        if span.bbox.top < 100 and span.bbox.left > 400:
            print(f"Top-right text: {span.text}")

Extract tables by position

from pdf_strings import from_path

output = from_path("table.pdf")

# Group spans by their vertical position (rows)
rows = {}
for line in output.lines:
    for span in line:
        row_key = round(span.bbox.top / 10) * 10  # Group by ~10pt vertical bands
        if row_key not in rows:
            rows[row_key] = []
        rows[row_key].append((span.bbox.left, span.text))

# Print rows sorted by vertical position
for y_pos in sorted(rows.keys(), reverse=True):
    # Sort spans in each row by horizontal position
    row_spans = sorted(rows[y_pos], key=lambda x: x[0])
    print(" | ".join(text for _, text in row_spans))

Features

Plain text extraction
Spatial layout preservation via character grid
Bounding box coordinates for every text span
Font size and page information
Password-protected PDF support
Handles complex fonts, rotated text, and multi-column layouts
Works with all Python 3.11+ versions

License

MIT

Name		Name	Last commit message	Last commit date
parent directory ..
pdf_strings		pdf_strings
README.md		README.md
hatch_build.py		hatch_build.py
pyproject.toml		pyproject.toml
test_complete_api.py		test_complete_api.py
test_python.py		test_python.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

pdf-strings

Installation

Quick Start

API Reference

Functions

`from_path(path: str, *, password: str | None = None) -> TextOutput`

`from_bytes(data: bytes, *, password: str | None = None) -> TextOutput`

Classes

`TextOutput`

`to_string() -> str`

`to_string_pretty() -> str`

`TextSpan`

`BoundingBox`

Usage Examples

Extract all text

Preserve layout

Access structured data

Find text in specific regions

Extract tables by position

Features

License

FilesExpand file tree

python

Directory actions

More options

Directory actions

More options

Latest commit

History

python

Folders and files

parent directory

README.md

pdf-strings

Installation

Quick Start

API Reference

Functions

from_path(path: str, *, password: str | None = None) -> TextOutput

from_bytes(data: bytes, *, password: str | None = None) -> TextOutput

Classes

TextOutput

to_string() -> str

to_string_pretty() -> str

TextSpan

BoundingBox

Usage Examples

Extract all text

Preserve layout

Access structured data

Find text in specific regions

Extract tables by position

Features

License

`from_path(path: str, *, password: str | None = None) -> TextOutput`

`from_bytes(data: bytes, *, password: str | None = None) -> TextOutput`

`TextOutput`

`to_string() -> str`

`to_string_pretty() -> str`

`TextSpan`

`BoundingBox`