A Python-based web scraping tool that extracts article titles, translates them, and performs word frequency analysis. The project uses Selenium WebDriver with BrowserStack integration for robust web scraping.
- Automated web scraping with configurable article limits
- Title translation capability
- Word frequency analysis of translated titles
- Rate limiting to prevent API overwhelming
- BrowserStack integration for reliable testing
- Comprehensive error handling and logging
- Automated session status reporting
- Python 3.x
- Selenium WebDriver
- BrowserStack account
- Required Python packages:
- selenium
- json
- time
- re
- collections
- logging
- Clone the repository:
git clone https://github.com/tejaspawar/python-selenium-browserstack-assignment.git- Install dependencies:
pip install -r requirements.txt- Set up BrowserStack credentials
- Option 1: in browserstack.yml:
userName: "your_username"
accessKey: "your_access_key"- Option 2: You can also export them as environment variables, BROWSERSTACK_USERNAME and BROWSERSTACK_ACCESS_KEY
For Linux/ MacOS:
export BROWSERSTACK_USERNAME=<browserstack-username>
export BROWSERSTACK_ACCESS_KEY=<browserstack-access-key>For Windows
setx BROWSERSTACK_USERNAME=<browserstack-username>
setx BROWSERSTACK_ACCESS_KEY=<browserstack-access-key>- Update API key for using translation services You need to export it as environment variable, TRANSLATOR_API_KEY
For Linux/ MacOS:
export TRANSLATOR_API_KEY=<your_translator_api_key>For Windows
setx TRANSLATOR_API_KEY=<your_translator_api_key>- Configure scraping parameters in
elpais_scrapper.py:
MAX_ARTICLE_TO_SCRAPE = 5 # Adjust as neededThe scraper performs the following operations:
- Ensures website loads in spanish language
- Extracts titles from web pages
- Translates the extracted titles
- Implements rate limiting (1-second delay between requests)
- Analyzes word frequency in translated titles
- Reports words that appear more than twice
- Updates BrowserStack session status
To run the sample test across platforms defined in the browserstack.yml file, run:
browserstack-sdk src/elpais_scrapper.pyThe script handles various exceptions:
- StaleElementReferenceException for dynamic page elements
- General exceptions with detailed logging
- BrowserStack session status updates for both success and failure cases
The script provides comprehensive logging:
- Debug level: Detailed operation tracking
- Info level: Important milestones and results
- Error level: Exception details and failures
Repeated words in translated headers:
- word1: 3
- word2: 4
- ...