Welcome to the repository for my final project for the CM3070 module at the University of London.
For this capstone, the university requires choosing from a predefined set of project briefs. That constrains the “idea” space, so I treated it as an engineering exercise.
The system uses a fine-tuned T5 model to translate natural-language job descriptions into optimized Boolean search strings. It then applies semantic similarity scoring to rank candidate resumes against those queries, packaged in a clean PyQt5 GUI.
In the realm of HR and recruitment technology, data privacy is absolute.
To ensure zero risk of exposing PII (Personally Identifiable Information) or compromising real user data, the entire candidate database used in this project is 100% synthetic.
I utilized Meta's Llama 3B model to architect and generate a highly realistic, diverse, and robust fake dataset of resumes and candidate profiles. You can run, test, and push this application to its limits with the complete assurance that no real-world privacy boundaries are crossed.
Any similarity or match to real-world individuals or real candidate data, if it occurs, is accidental, unintended, and purely coincidental.
⚠️ CRITICAL NOTICE: EXTERNAL ASSET DOWNLOAD REQUIRED Due to the comprehensive nature of the models and datasets, the complete project environment exceeds 6-7GB. To maintain repository performance and comply with GitHub's storage constraints, heavy assets have been decoupled from this repository.
This repository does NOT contain:
- The original base T5 model
- The fine-tuned T5 model weights
- The complete, uncompressed training dataset
- The Llama-generated local database files
To experience the complete application:
- Clone this repository to your local machine.
- Access the Official Public Google Drive Repository.
- Download the missing directories and files.
- Mount them directly into the root directory of your cloned repository.
- 🧠 Algorithmic Boolean Generation: Transforms dense, natural-language job descriptions into precise, optimized Boolean search parameters using a custom fine-tuned T5 architecture.
- 📐 Hybrid Semantic Ranking: Merges rigid Boolean logic with deep-learning semantic similarity metrics to evaluate, score, and rank candidate resumes with high fidelity.
- 🖥️ Qt-Powered Graphical Interface: A sleek, low-latency PyQt5 application granting users the ability to:
- Input or paste raw job descriptions.
- Generate, audit, and manually refine Boolean strings.
- Execute searches against the synthetic local dataset (
candidate_resumes.csv). - Preview and export optimized candidate shortlists.
- 🔄 Extensible Training Pipeline: Includes native scripts to further fine-tune the T5 model using proprietary datasets (
Cleaned_Dataset.csv).
A consistent environment is critical. Ensure Python 3.7+ is active, then initialize the requirements:
pip install -r requirements.txtOnce the Drive assets are mounted in the root folder, execute the main application:
python automated_search_strategy_generator.py- Input: Paste your target job description.
- Process: Click Generate Boolean Query to initialize the T5 inference.
- Refine: Adjust the output parameters manually if required.
- Execute: Click Search Local Candidates to cross-reference against the Llama-generated database.
To continuously train the T5 model on new data arrays:
- Ensure your structured data is located at
Cleaned_Dataset.csv. - Initialize the training sequence:
python automated_search_strategy_generator.py trainCompiled checkpoints will automatically route to the fine-tuned-t5/ directory.
├── automated_search_strategy_generator.py # Core application & training logic
├── Cleaned_Dataset.csv # Training corpus for T5 fine-tuning
├── candidate_resumes.csv # Synthetic Llama-generated search DB
├── fine-tuned-t5/ # ☁️ Model checkpoints (Drive)
├── local_t5_base/ # ☁️ Base model weights (Drive)
├── banner.png # UI hero asset
├── favicon.ico # PyQt5 application icon
└── requirements.txt # Environment dependencies
Use this project however it helps—study it, adapt it, or build on top of it. If you’d like to collaborate professionally (research, consulting, or implementation), reach out.
If you reuse significant parts of the work, please give appropriate credit and follow the license terms.
