Adobe PDF Heading Extractor

This project extracts structured headings (H1, H2, H3) from .pdf documents using Python and PyMuPDF.
It is containerized using Docker, following Adobe's constraints:

Processes all PDFs inside /app/input/
Outputs structured .json to /app/output/
Runs without network access

Folder Structure

.
├── extractor.py          # Main Python script
├── requirements.txt      # Dependencies
├── Dockerfile            # Docker container setup
├── input/                # Put your PDFs here
│   └── sample.pdf
├── output/               # JSON files appear here
│   └── sample.json

Build the Docker Image

docker build --platform=linux/amd64 -t adobe-solution:harshi123 .

Run the Extractor

docker run --rm -v "${PWD}/input:/app/input" -v "${PWD}/output:/app/output" --network none adobe-solution:harshi123

On Windows CMD, use %cd% instead of ${PWD}
On macOS/Linux, ${PWD} works fine

Input

Place all .pdf files you want to process into the /input folder.

Output

Each PDF is converted into a .json file in /output/, e.g.:

{
  "title": "Abstract:",
  "outline": [
    { "level": "H1", "text": "Abstract:", "page": 2 },
    { "level": "H1", "text": "Existing System:", "page": 5 }
    // ...
  ]
}

Constraints Met

No external network usage (--network none)
Input/output handled via volume mounts only
Lightweight base image (python:3.10-slim)
Batch processes all .pdf files in /input/

Authors

Acknowledgements

This project was developed as part of Adobe's Hackathon Challenge – Round 1A.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adobe PDF Heading Extractor

Folder Structure

Build the Docker Image

Run the Extractor

Input

Output

Constraints Met

Authors

Acknowledgements

About

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
input		input
output		output
Dockerfile		Dockerfile
README.md		README.md
extractor.py		extractor.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Adobe PDF Heading Extractor

Folder Structure

Build the Docker Image

Run the Extractor

Input

Output

Constraints Met

Authors

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages