This project includes a Scrapy spider integrated with Selenium for scraping and interacting with dynamic web pages. The purpose is to navigate a web portal, retrieve specific documents, and perform actions like clicking and printing using automated browser interactions.
Ensure you have Python installed on your system. This project requires setting up a virtual environment and installing several dependencies.
- Scrapy
- scrapy-selenium
- selenium
- pyperclip
- beautifulsoup4
- bs4
- Set up a Python virtual environment:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
- Install the required libraries:
pip install Scrapy scrapy-selenium selenium pyperclip beautifulsoup4 bs4
- Clone the project:
git clone https://your-github-repo-link.git cd your-project-directory
Run Chrome in debugging mode to allow Selenium to control it:
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 --user-data-dir="/tmp/chrome_dev_session"Start the Scrapy project (if not already created):
scrapy startproject webportal_scraper List available spiders (to verify setup):
scrapy listRun the spider:
scrapy runspider portal_spider.py