This is a scrapy project for R18 web scraping, and also as an example for Scrapy technology and CI tools from Github Marketplace.
- Python 3.6+
- Scrapy 1.6.0
- Fully tested on Linux, but it should works on Windows, Mac OSX, BSD
Run docker-compose in docker folder to initial a MongoDB server:
docker-compose up -d
If you don't want to view log message:
docker-compose up -d && docker-compose logs --follow
Initial postgres with senty first:
1. Generate secret key first:
docker run --rm sentry config generate-secret-key
2. Use the secret key to create a database in postgres:
docker run --detach \
--name sentry-redis-init \
--volume $PWD/redis-data:/data \
redis
docker run --detach \
--name sentry-postgres-init \
--env POSTGRES_PASSWORD=secret \
--env POSTGRES_USER=sentry \
--volume $PWD/postgres-data:/var/lib/postgresql/data \
postgres
docker run --interactive --tty --rm \
--env SENTRY_SECRET_KEY='<secret-key>' \
--link sentry-postgres-init:postgres \
--link sentry-redis-init:redis \
sentry upgrade
Then input the superusername and password
3. Stop the redis and postgres:
docker stop sentry-postgres-init sentry-redis-init && docker rm sentry-postgres-init senty-redis-init
- Edit the env files to add the superusername, password and database related information
5. Start sentry with docker-compose.yml:
docker-compose up --detach && docker-compose logs --follow
Pipenv is adopted for the virtual environment management. Create the virtual environment and activate it:
pipenv install && pipenv shell
Go to the project root and run the command:
cd run && python run.py
Run the following command to stop MongoDB:
docker-compose down --volumes
- SitemapSpider
- Stats Collection
- Requests and Responses
- Item Loader
- Spider Contracts
- Downloading and processing files and images
- [X] Move zh page re-direction to en to a downloader middleware
- [X] Docker configurations for MongoBD