🤖 ResFLow - A Secure, Layout-Aware Resume Parser

This project provides a robust, end-to-end pipeline for parsing resumes (.pdf, .docx, .doc) into a clean, structured JSON format.

It intelligently handles complex layouts, anonymizes all Personally Identifiable Information (PII) before processing, and uses a Large Language Model (Gemini) to extract and categorize information with high accuracy. The output is then validated against a strict schema (Pydantic) and securely "re-hydrated" with the original data.

📌 Additional Resources

The additional approaches and architectural explorations are detailed in the Google Document:
👉 View Document

🔗 Related GitHub Repositories

🚀 Resume Parser Server (Production Pipeline)
https://github.com/tejas-jm/resume-parser-server
🧠 LayoutLM-based Resume Parser
https://github.com/tejas-jm/resume-parser-layoutLM

✨ Core Features

Broad File Support: Parses .pdf, .docx, and legacy .doc files.
Layout-Aware Parsing: Uses unstructured with a "hi-res" (OCR) strategy to understand document layout, columns, and text blocks, preventing data jumbling.
🔒 Secure PII Anonymization: Leverages Presidio to find and mask all sensitive data (names, emails, phone numbers, custom links) before it's sent to any API.
🧠 Intelligent Extraction: Uses Google's Gemini model via LangChain to understand the context of the resume and extract data relationally (e.g., knowing which job description belongs to which company).
✅ Strict Schema Validation: Employs Pydantic to define a "perfect" JSON schema and automatically validate the LLM's output, ensuring 100% data consistency.
💧 Secure Re-hydration: The final step securely swaps the anonymized masks (e.g., <PERSON_1>) back to their original values (e.g., "Alex Johnson") after processing, so the final JSON is complete.

🚀 Workflow

The entire process is a secure, multi-step pipeline:

graph TD
    A["Start: User Uploads File (.pdf, .doc, .docx)"] --> B{Step 1: Parse File};
    B --> C["get_file_text() using 'unstructured' hi-res"];
    C --> D{Step 2: Anonymize Text};
    D --> E["anonymize_and_get_lookup() using Presidio"];
    E --> F["AnalyzerEngine finds PII (PERSON, EMAIL, etc.)"];
    F --> G["AnonymizerEngine creates unique masks (e.g., <PERSON_1>)"];
    G --> H["Output 1: anonymized_text"];
    G --> I["Output 2: lookup_dict {mask: original_value}"];
    
    H --> J{Step 3: Parse with LLM};
    J --> K["LangChain chain.invoke() with anonymized_text"];
    K --> L["Gemini Model generates JSON string"];
    
    L --> M{Step 4: Validate Schema};
    M --> N["PydanticOutputParser validates JSON against ResumeSchema"];
    N -- Valid --> O["parsed_masked_model (Pydantic Object)"];
    N -- Invalid --> P["Error: Validation Failed"];
    
    O --> Q{Step 5: Re-hydrate Data};
    I --> Q;
    Q --> R["rehydrate_model() swaps masks with original values"];
    R --> S["final_resume_model (Pydantic Object)"];
    
    S --> T{Step 6: Final Output};
    T --> U["Print final_resume_model as clean JSON"];

🛠️ Tech Stack

LLM & Orchestration: Google Gemini, LangChain
File Parsing: unstructured[all-docs] (with Detectron2 for OCR)
PII Anonymization: presidio-analyzer, presidio-anonymizer
Schema Validation: pydantic
System Dependencies: libreoffice, antiword (for .doc support)

🏃‍♂️ How to Run (Google Colab)

This project is designed to run perfectly in a Google Colab notebook.

1. 🔑 Set Up Your Google API Key

For this notebook to work, you need to provide a Google AI Studio API key.

Get your key:
- Go to Google AI Studio.
- Click "Create API key" and copy the key.
Store it in Colab Secrets:
- In your Colab notebook, click the "Key" icon (🔑) in the left-hand sidebar.
- Click "Add new secret".
- For the Name, enter: GOOGLE_API_KEY
- For the Value, paste your API key.
- Make sure the "Notebook access" toggle is turned on.

2. 📦 Install Dependencies

Run the first cell in the notebook to install all required Python packages and system libraries (libreoffice, antiword).

3. 🧠 Pre-load Models

Run the second cell to pre-load the unstructured OCR and layout models. This makes the first parse much faster.

4. 🚀 Run the Pipeline

Execute the final "Run" cell. It will ask you to upload a resume file. The script will then perform all 5 steps of the workflow and print the final, clean JSON at the end.

🧩 Code Breakdown

`get_file_text()`

Uses unstructured.partition with the strategy="hi_res". This function automatically detects the file type (.pdf, .doc, .docx) and uses an OCR model to analyze the document's visual layout, preventing text from different columns from being mixed.

`anonymize_and_get_lookup()`

Uses a custom-configured Presidio AnalyzerEngine. We use a clean RecognizerRegistry to only search for specific PII (PERSON, EMAIL, PHONE_NUMBER, GITHUB_LINK, LINKEDIN_LINK). This prevents false positives. The function returns two things:

The fully anonymized text (e.g., "My name is <PERSON_1>")
A lookup_dict (e.g., {"<PERSON_1>": "Alex Johnson"})

`ResumeSchema (Pydantic)`

A set of Pydantic models that define the exact target JSON structure. This includes all nested objects like Contact, Experience, and Skills. This schema is used to both instruct the LLM and validate its output.

`chain`

A LangChain pipeline that:

Takes the anonymized_text.
Inserts it into a PromptTemplate.
Sends the prompt to the Gemini model.
Takes the model's raw JSON string output.
Runs it through a PydanticOutputParser to validate it and convert it into a Python object.

`rehydrate_model()`

A recursive function that walks through the Pydantic object, finds any string that matches a mask in the lookup_dict, and replaces it with the original, sensitive value.

🎯 Example JSON Output

The final, re-hydrated output will look like this:

{
  "name": "Alex Johnson",
  "title": "Full Stack Engineer",
  "contact": {
    "email": "[email protected]",
    "phone": "+1 (555) 123-4567",
    "address": "123 Main St, Anytown, USA 12345",
    "website": "alexjohnson.dev",
    "linkedin": "linkedin.com/in/alex-j-dev",
    "github": "github.com/AlexJDev"
  },
  "summary": null,
  "education": [
    {
      "institution": "State University",
      "degree": "B.S. in Computer Science",
      "dates": "Aug 2019 - May 2023",
      "score": "3.8 GPA",
      "location": null
    }
  ],
  "experience": [
    {
      "company": "Tech Solutions Inc.",
      "role": "Software Engineer Intern",
      "dates": "May 2022 - Aug 2022",
      "duration": null,
      "location": null,
      "responsibilities": [
        "Developed new features for a client-facing web application using React and Node.js.",
        "Collaborated with senior developers in an Agile environment to fix bugs and improve code quality."
      ]
    }
  ],
  "skills": {
    "languages": ["JavaScript", "Python", "SQL"],
    "frameworks": ["React", "Node.js", "Express"],
    "developer_tools": ["Git", "GitHub", "Docker"],
    "databases": ["PostgreSQL", "MongoDB"],
    "other": ["Agile Methodologies", "Cloud Computing"]
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
Final_Resume_Parser_Project.ipynb		Final_Resume_Parser_Project.ipynb
Google Colab Arch.svg		Google Colab Arch.svg
Google Colab Simple Arch.png		Google Colab Simple Arch.png
Google Colab Simple Arch.svg		Google Colab Simple Arch.svg
OCR + SLM Approach_Colab.ipynb		OCR + SLM Approach_Colab.ipynb
OCR.ipynb		OCR.ipynb
README.md		README.md
Resume_Parser_(Presidio,_Gemini,_Pydantic).ipynb		Resume_Parser_(Presidio,_Gemini,_Pydantic).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 ResFLow - A Secure, Layout-Aware Resume Parser

📌 Additional Resources

🔗 Related GitHub Repositories

✨ Core Features

🚀 Workflow

🛠️ Tech Stack

🏃‍♂️ How to Run (Google Colab)

1. 🔑 Set Up Your Google API Key

2. 📦 Install Dependencies

3. 🧠 Pre-load Models

4. 🚀 Run the Pipeline

🧩 Code Breakdown

`get_file_text()`

`anonymize_and_get_lookup()`

`ResumeSchema (Pydantic)`

`chain`

`rehydrate_model()`

🎯 Example JSON Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🤖 ResFLow - A Secure, Layout-Aware Resume Parser

📌 Additional Resources

🔗 Related GitHub Repositories

✨ Core Features

🚀 Workflow

🛠️ Tech Stack

🏃‍♂️ How to Run (Google Colab)

1. 🔑 Set Up Your Google API Key

2. 📦 Install Dependencies

3. 🧠 Pre-load Models

4. 🚀 Run the Pipeline

🧩 Code Breakdown

get_file_text()

anonymize_and_get_lookup()

ResumeSchema (Pydantic)

chain

rehydrate_model()

🎯 Example JSON Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`get_file_text()`

`anonymize_and_get_lookup()`

`ResumeSchema (Pydantic)`

`chain`

`rehydrate_model()`

Packages