A multi-process asynchronous concurrent crawler, performance is not bad
Current implementation:
Epoll-based event module, high performance red black tree timer
Asynchronous dns resolver with cache, based on redis Bloom filter
Multi-process asynchronous concurrency, take full advantage of multi-core advantages, configure the number of processes, bind CPU
Redis-based task queue, hashing addresses by murmur hash to achieve load balancing
Asynchronous http client, supports https, currently only supports get operations, does not support cookies (planned improvements)
http-parser Powerful to parse url
gumbo-parser Google's html analysis library for parsing crawled pages
Git clone https://github.com/DevJasper/web-crawler.git
Cd crawler
Make
./crawlerss
In CentOS7, if the installation path of the dependent library is /usr/local/lib, the runtime will prompt to find the library
You can set the environment variable LD_LIBRARY_PATH first, then run:
LD_LIBRARY_PATH=/usr/local/lib ./crawler
You can also add /usr/local/lib directly to the default library path:
Echo "/usr/local/lib" > /etc/ld.so.conf.d/usrlocallib.conf
Ldconfig
Settings.lua is the configuration file, following the lua syntax format, field description:
Work_processes is the number of processes, it is recommended to set the number of cpu cores
Seed is a list of seed addresses, you can configure multiple, separated by commas