Xscrap allows for proxy usage, user agent customizing, web service enumeration, scraps robots.txt, site related emails, and sub-domain enumeration.
-
Install Deps:
Virtual Env python3 -m venv venv && pipinstall bs4 dnspython lxml PySocks requests Pip pip install bs4 dnspython lxml PySocks requests
-
Usage::
python3 xscrap.py url --proxy proxyurl
A Python-based web scraper designed to gather publicly available information from specified web sources for cybersecurity intelligence, threat monitoring, and open-source intelligence (OSINT) gathering.
Disclaimer: This tool is intended for educational and ethical purposes only. Ensure you have explicit permission before scraping any website, respect robots.txt, and comply with all applicable laws and terms of service. The developers assume no liability and are not responsible for any misuse or damage caused by this tool.
- Description
- Purpose
- Features
- Technology Stack
- Installation
- Usage
- Ethical Considerations
- Contributing
- License
This project provides a configurable web scraping tool focused on extracting data relevant to cybersecurity professionals. It can be adapted to monitor various online sources for indicators of compromise (IoCs), mentions of vulnerabilities, potential data leaks (on public sites like pastebins), or other security-related keywords and patterns.
In the realm of cybersecurity, staying informed about emerging threats, vulnerabilities, and potential exposures is crucial. Manually monitoring countless websites, forums, and paste sites is inefficient. This scraper aims to automate the collection of publicly accessible data from pre-defined sources to aid in:
- Threat Intelligence: Identifying discussions or posts related to new threats, malware, or attack vectors.
- Vulnerability Monitoring: Tracking mentions of specific CVEs or software weaknesses.
- OSINT Gathering: Collecting public information related to specific domains, IPs, or organizations.
- Brand Protection: Monitoring for mentions that might indicate phishing campaigns or reputational risks.
- Data Leak Detection: Searching public paste sites or forums for potential exposure of sensitive keywords (e.g., company-specific terms, email formats - use responsibly!).
- Scrapes content from specified URLs or lists of URLs.
- Searches for user-defined keywords, regular expressions, or patterns (e.g., CVE IDs, email formats, specific terms).
- Extracts relevant text snippets or data points associated with matches.
- Configurable scraping depth and scope (within ethical limits).
- Outputs findings to structured formats (e.g., CSV, JSON, console).
- Basic mechanisms to handle common scraping challenges (e.g., user-agent rotation, configurable delays - respect target sites!).
- Modular design for easier extension to new data sources or parsing logic.
- Language: Python 3.x
- Core Libraries:
requests(for fetching web pages)BeautifulSoup4orlxml(for parsing HTML)- (Add others specific to your project, e.g.,
Scrapy,Seleniumif used,regex,argparse,configparser/pyyaml)
- Data Handling:
csv,json
-
Clone the repository:
git clone https://github.com/nylar357/VulnScraper.git cd VulnScraper -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install requests beautifulsoup4 lxml
(Provide clear examples of how to run the script.)
Example 1: Using the built in configuration:
python3 vulnscraper.py Example 2: Specifying targets and keywords via command line (not yet supported):
python scraper.py --urls "https://site1.com,https://site2.org" --keywords "keyword1,keyword2" --output results.json Using this tool responsibly is paramount.
Using this tool responsibly is paramount.
- Legality & Permissions: Only scrape websites where you have explicit permission or where the
robots.txtfile permits scraping the intended sections. Always comply with the website's Terms of Service. Scraping private forums or restricted areas is illegal and unethical. - Server Load: Implement significant delays between requests (
time.sleep()). Do not overload the target servers. Set a descriptive and truthful User-Agent string that allows website administrators to identify your bot, potentially including contact information. - Data Privacy: Be extremely cautious when searching for or handling potentially sensitive information (PII, credentials). Do not collect, store, or distribute private data found inadvertently. Focus on publicly acknowledged threats and vulnerabilities.
- Purpose: Use the gathered information ethically for defensive cybersecurity purposes only. Do not use it to facilitate unauthorized access, harassment, or any illegal activities.
- Transparency: If using this in an organizational context, ensure its use aligns with company policies and ethical guidelines.
Misuse of this tool can lead to legal consequences and blocking of your IP address.
Contributions are welcome! If you'd like to improve the scraper, please follow these steps:
- Fork the repository.
- Create a new branch (
git checkout -b feature/YourFeatureName). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'). - Push to the branch (
git push origin feature/YourFeatureName). - Open a Pull Request.
Please ensure your code adheres to basic coding standards and includes comments where necessary. Add a clear description of your changes in the Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.
Generated by [BryceG]

