A powerful web scraping and AI content extraction tool built with Streamlit, Selenium, and BeautifulSoup. This application allows you to scrape websites and extract structured information using AI.
- Web scraping with Selenium for JavaScript-rendered sites
- DOM cleaning and processing with BeautifulSoup
- AI-powered content extraction with Ollama 3.2 and LangChain
- User-friendly Streamlit interface
- Python 3.9 or higher
- Google Chrome browser
- Ollama server running locally (for AI parsing)
-
Clone the repository
git clone <your-repo-url> cd <repository-folder>
-
Install dependencies
pip install -r requirements.txt
-
Ensure Chrome is installed
The web scraper requires Google Chrome to be installed on your system. If you don't have Chrome installed, download it from google.com/chrome.
-
Run the debugging tool to verify setup
python debug_chrome.py
This will check if Chrome is properly configured on your system.
-
Start the application
streamlit run main.py
-
Create a new Web Service on Render
-
Connect your GitHub/GitLab repository
-
Configure the following settings:
- Build Command:
bash build.sh - Start Command:
streamlit run main.py --server.port $PORT --server.address 0.0.0.0
- Build Command:
-
Set Environment Variables:
- Add
RENDER=true
- Add
-
Deploy
If you encounter errors when running the scraper locally:
-
Chrome crashes or won't start
- Close all Chrome windows and processes
- Update Chrome to the latest version
- Run
python debug_chrome.pyto diagnose issues - Temporarily disable your antivirus software
- Restart your computer
-
Permission errors with chromedriver
- The application should handle this automatically, but if problems persist, try running the debug tool.
-
Chrome binary errors
- Check the build logs to ensure Chrome was installed properly
- Verify that the environment variable
RENDER=trueis set
-
Timeout errors
- Increase the timeout duration in
scrape.pyif needed
- Increase the timeout duration in
main.py- Streamlit application entry pointscrape.py- Web scraping functionalityparse.py- AI parsing with Ollamadebug_chrome.py- Diagnostic tool for Chrome issuesrequirements.txt- Python dependenciesbuild.sh- Setup script for Renderrender.yaml- Render configuration
MIT
Your contact information or link to portfolio