A Python-based toolkit for analyzing potentially malicious PDF files using static analysis, IOC extraction, YARA scanning, and threat intelligence integrations.
- 📌 Metadata extraction (author, creator, timestamps)
- 🔍 Suspicious keyword detection (
/JavaScript,/OpenAction, etc.) - 📦 Embedded object extraction
⚠️ JavaScript analysis inside PDFs- 🌐 IOC extraction (IPs, domains, URLs)
- 🧬 YARA rule scanning
- 🛡️ CVE pattern detection
- 🧠 Risk scoring engine
- 🔎 VirusTotal lookup (hash-based)
- ☁️ Hybrid Analysis sandbox integration
- 📊 Automated report generation (DOCX)
PDF-Malware-Analysis-Toolkit/
│
├── analyzer/
├── yara_rules/
├── samples/
├── screenshots/
├── logs/
├── reports/
│
├── main.py
├── requirements.txt
├── README.md
├── LICENSE
└── .gitignore
git clone https://github.com/sandeep0428/pdf-malware-analysis-toolkit.git
cd pdf-malware-analysis-toolkitpython -m venv venv
venv\Scripts\activatepip install -r requirements.txtCreate a .env file:
VT_API_KEY=your_virustotal_api_key
HA_API_KEY=your_hybrid_analysis_api_key
python main.py samples/sample_cve.pdfpython main.py samples/[+] Analyzing: samples/sample_cve.pdf
--- CVE Detection ---
['CVE-2010-0188 Exploit']
--- VirusTotal ---
Malicious: 1
[!] WARNING: File is malicious!
--- Risk Score ---
{'score': 75, 'level': 'HIGH'}
[+] Analysis Complete
- Extract metadata from PDF
- Scan for suspicious keywords
- Extract embedded objects
- Analyze JavaScript content
- Extract IOCs (URLs, IPs, domains)
- Apply YARA rules
- Detect CVE patterns
- Query VirusTotal / Hybrid Analysis (if enabled)
- Calculate risk score
- Generate final report
- Static analysis only
- May not detect obfuscated payloads
- API results depend on availability
This project was developed as part of a cybersecurity problem statement and reflects my own implementation, design decisions, and enhancements for practical SOC use cases.
AI-assisted tools were used to support development, optimization, and code refinement.
MIT License
Sandeep Kumar





