AI Web Scraper

A powerful web scraping and AI content extraction tool built with Streamlit, Selenium, and BeautifulSoup. This application allows you to scrape websites and extract structured information using AI.

Features

Web scraping with Selenium for JavaScript-rendered sites
DOM cleaning and processing with BeautifulSoup
AI-powered content extraction with Ollama 3.2 and LangChain
User-friendly Streamlit interface

Requirements

Python 3.9 or higher
Google Chrome browser
Ollama server running locally (for AI parsing)

Setup Instructions

Local Setup

Clone the repository

git clone <your-repo-url>
cd <repository-folder>

Install dependencies
```
pip install -r requirements.txt
```
Ensure Chrome is installed

The web scraper requires Google Chrome to be installed on your system. If you don't have Chrome installed, download it from google.com/chrome.
Run the debugging tool to verify setup
```
python debug_chrome.py
```
This will check if Chrome is properly configured on your system.
Start the application
```
streamlit run main.py
```

Deploying to Render

Create a new Web Service on Render
Connect your GitHub/GitLab repository
Configure the following settings:
- Build Command: bash build.sh
- Start Command: streamlit run main.py --server.port $PORT --server.address 0.0.0.0
Set Environment Variables:
- Add RENDER=true
Deploy

Troubleshooting

Common Issues - Local Environment

If you encounter errors when running the scraper locally:

Chrome crashes or won't start
- Close all Chrome windows and processes
- Update Chrome to the latest version
- Run python debug_chrome.py to diagnose issues
- Temporarily disable your antivirus software
- Restart your computer
Permission errors with chromedriver
- The application should handle this automatically, but if problems persist, try running the debug tool.

Common Issues - Render Deployment

Chrome binary errors
- Check the build logs to ensure Chrome was installed properly
- Verify that the environment variable RENDER=true is set
Timeout errors
- Increase the timeout duration in scrape.py if needed

Project Structure

main.py - Streamlit application entry point
scrape.py - Web scraping functionality
parse.py - AI parsing with Ollama
debug_chrome.py - Diagnostic tool for Chrome issues
requirements.txt - Python dependencies
build.sh - Setup script for Render
render.yaml - Render configuration

License

MIT

Contact

Your contact information or link to portfolio

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
__pycache__		__pycache__
ai		ai
chrome_scraper_profile		chrome_scraper_profile
.gitattributes		.gitattributes
Dockerfile		Dockerfile
README.md		README.md
build.sh		build.sh
chromedriver.exe		chromedriver.exe
debug_chrome.py		debug_chrome.py
main.py		main.py
ollama_utils.py		ollama_utils.py
parse.py		parse.py
render.yaml		render.yaml
requirements.txt		requirements.txt
runtime.txt		runtime.txt
scrape.py		scrape.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Web Scraper

Features

Requirements

Setup Instructions

Local Setup

Deploying to Render

Troubleshooting

Common Issues - Local Environment

Common Issues - Render Deployment

Project Structure

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Web Scraper

Features

Requirements

Setup Instructions

Local Setup

Deploying to Render

Troubleshooting

Common Issues - Local Environment

Common Issues - Render Deployment

Project Structure

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages