Skip to content

Harshini2410/Adobe-PDF-Heading-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Adobe PDF Heading Extractor

This project extracts structured headings (H1, H2, H3) from .pdf documents using Python and PyMuPDF.
It is containerized using Docker, following Adobe's constraints:

  • Processes all PDFs inside /app/input/
  • Outputs structured .json to /app/output/
  • Runs without network access

Folder Structure

.
├── extractor.py          # Main Python script
├── requirements.txt      # Dependencies
├── Dockerfile            # Docker container setup
├── input/                # Put your PDFs here
│   └── sample.pdf
├── output/               # JSON files appear here
│   └── sample.json

Build the Docker Image

docker build --platform=linux/amd64 -t adobe-solution:harshi123 .

Run the Extractor

docker run --rm -v "${PWD}/input:/app/input" -v "${PWD}/output:/app/output" --network none adobe-solution:harshi123
  • On Windows CMD, use %cd% instead of ${PWD}
  • On macOS/Linux, ${PWD} works fine

Input

Place all .pdf files you want to process into the /input folder.


Output

Each PDF is converted into a .json file in /output/, e.g.:

{
  "title": "Abstract:",
  "outline": [
    { "level": "H1", "text": "Abstract:", "page": 2 },
    { "level": "H1", "text": "Existing System:", "page": 5 }
    // ...
  ]
}

Constraints Met

  • No external network usage (--network none)
  • Input/output handled via volume mounts only
  • Lightweight base image (python:3.10-slim)
  • Batch processes all .pdf files in /input/

Authors


Acknowledgements

This project was developed as part of Adobe's Hackathon Challenge – Round 1A.

About

Extract structured PDF outlines (H1, H2, H3) using PyMuPDF + Docker | Adobe Hackathon Round 1A

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors