This project provides a robust, end-to-end pipeline for parsing resumes (.pdf, .docx, .doc) into a clean, structured JSON format.
It intelligently handles complex layouts, anonymizes all Personally Identifiable Information (PII) before processing, and uses a Large Language Model (Gemini) to extract and categorize information with high accuracy. The output is then validated against a strict schema (Pydantic) and securely "re-hydrated" with the original data.
The additional approaches and architectural explorations are detailed in the Google Document:
👉 View Document
-
🚀 Resume Parser Server (Production Pipeline)
https://github.com/tejas-jm/resume-parser-server -
🧠 LayoutLM-based Resume Parser
https://github.com/tejas-jm/resume-parser-layoutLM
- Broad File Support: Parses
.pdf,.docx, and legacy.docfiles. - Layout-Aware Parsing: Uses
unstructuredwith a "hi-res" (OCR) strategy to understand document layout, columns, and text blocks, preventing data jumbling. - 🔒 Secure PII Anonymization: Leverages Presidio to find and mask all sensitive data (names, emails, phone numbers, custom links) before it's sent to any API.
- 🧠 Intelligent Extraction: Uses Google's Gemini model via LangChain to understand the context of the resume and extract data relationally (e.g., knowing which job description belongs to which company).
- ✅ Strict Schema Validation: Employs Pydantic to define a "perfect" JSON schema and automatically validate the LLM's output, ensuring 100% data consistency.
- 💧 Secure Re-hydration: The final step securely swaps the anonymized masks (e.g.,
<PERSON_1>) back to their original values (e.g., "Alex Johnson") after processing, so the final JSON is complete.
The entire process is a secure, multi-step pipeline:
graph TD
A["Start: User Uploads File (.pdf, .doc, .docx)"] --> B{Step 1: Parse File};
B --> C["get_file_text() using 'unstructured' hi-res"];
C --> D{Step 2: Anonymize Text};
D --> E["anonymize_and_get_lookup() using Presidio"];
E --> F["AnalyzerEngine finds PII (PERSON, EMAIL, etc.)"];
F --> G["AnonymizerEngine creates unique masks (e.g., <PERSON_1>)"];
G --> H["Output 1: anonymized_text"];
G --> I["Output 2: lookup_dict {mask: original_value}"];
H --> J{Step 3: Parse with LLM};
J --> K["LangChain chain.invoke() with anonymized_text"];
K --> L["Gemini Model generates JSON string"];
L --> M{Step 4: Validate Schema};
M --> N["PydanticOutputParser validates JSON against ResumeSchema"];
N -- Valid --> O["parsed_masked_model (Pydantic Object)"];
N -- Invalid --> P["Error: Validation Failed"];
O --> Q{Step 5: Re-hydrate Data};
I --> Q;
Q --> R["rehydrate_model() swaps masks with original values"];
R --> S["final_resume_model (Pydantic Object)"];
S --> T{Step 6: Final Output};
T --> U["Print final_resume_model as clean JSON"];
- LLM & Orchestration: Google Gemini, LangChain
- File Parsing:
unstructured[all-docs](with Detectron2 for OCR) - PII Anonymization:
presidio-analyzer,presidio-anonymizer - Schema Validation:
pydantic - System Dependencies:
libreoffice,antiword(for.docsupport)
This project is designed to run perfectly in a Google Colab notebook.
For this notebook to work, you need to provide a Google AI Studio API key.
-
Get your key:
- Go to Google AI Studio.
- Click "Create API key" and copy the key.
-
Store it in Colab Secrets:
- In your Colab notebook, click the "Key" icon (🔑) in the left-hand sidebar.
- Click "Add new secret".
- For the Name, enter:
GOOGLE_API_KEY - For the Value, paste your API key.
- Make sure the "Notebook access" toggle is turned on.
Run the first cell in the notebook to install all required Python packages and system libraries (libreoffice, antiword).
Run the second cell to pre-load the unstructured OCR and layout models. This makes the first parse much faster.
Execute the final "Run" cell. It will ask you to upload a resume file. The script will then perform all 5 steps of the workflow and print the final, clean JSON at the end.
Uses unstructured.partition with the strategy="hi_res". This function automatically detects the file type (.pdf, .doc, .docx) and uses an OCR model to analyze the document's visual layout, preventing text from different columns from being mixed.
Uses a custom-configured Presidio AnalyzerEngine. We use a clean RecognizerRegistry to only search for specific PII (PERSON, EMAIL, PHONE_NUMBER, GITHUB_LINK, LINKEDIN_LINK). This prevents false positives. The function returns two things:
- The fully anonymized text (e.g.,
"My name is <PERSON_1>") - A
lookup_dict(e.g.,{"<PERSON_1>": "Alex Johnson"})
A set of Pydantic models that define the exact target JSON structure. This includes all nested objects like Contact, Experience, and Skills. This schema is used to both instruct the LLM and validate its output.
A LangChain pipeline that:
- Takes the
anonymized_text. - Inserts it into a
PromptTemplate. - Sends the prompt to the Gemini model.
- Takes the model's raw JSON string output.
- Runs it through a
PydanticOutputParserto validate it and convert it into a Python object.
A recursive function that walks through the Pydantic object, finds any string that matches a mask in the lookup_dict, and replaces it with the original, sensitive value.
The final, re-hydrated output will look like this:
{
"name": "Alex Johnson",
"title": "Full Stack Engineer",
"contact": {
"email": "[email protected]",
"phone": "+1 (555) 123-4567",
"address": "123 Main St, Anytown, USA 12345",
"website": "alexjohnson.dev",
"linkedin": "linkedin.com/in/alex-j-dev",
"github": "github.com/AlexJDev"
},
"summary": null,
"education": [
{
"institution": "State University",
"degree": "B.S. in Computer Science",
"dates": "Aug 2019 - May 2023",
"score": "3.8 GPA",
"location": null
}
],
"experience": [
{
"company": "Tech Solutions Inc.",
"role": "Software Engineer Intern",
"dates": "May 2022 - Aug 2022",
"duration": null,
"location": null,
"responsibilities": [
"Developed new features for a client-facing web application using React and Node.js.",
"Collaborated with senior developers in an Agile environment to fix bugs and improve code quality."
]
}
],
"skills": {
"languages": ["JavaScript", "Python", "SQL"],
"frameworks": ["React", "Node.js", "Express"],
"developer_tools": ["Git", "GitHub", "Docker"],
"databases": ["PostgreSQL", "MongoDB"],
"other": ["Agile Methodologies", "Cloud Computing"]
}
}