Skip to content

DevJasper/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Asynchronous concurrent crawler

1 Introduction

A multi-process asynchronous concurrent crawler, performance is not bad
Current implementation:
Epoll-based event module, high performance red black tree timer
Asynchronous dns resolver with cache, based on redis Bloom filter
Multi-process asynchronous concurrency, take full advantage of multi-core advantages, configure the number of processes, bind CPU
Redis-based task queue, hashing addresses by murmur hash to achieve load balancing
Asynchronous http client, supports https, currently only supports get operations, does not support cookies (planned improvements)

2.Dependence

http-parser Powerful to parse url
gumbo-parser Google's html analysis library for parsing crawled pages

3. Compile and run

Git clone https://github.com/DevJasper/web-crawler.git
Cd crawler
Make
./crawlerss

In CentOS7, if the installation path of the dependent library is /usr/local/lib, the runtime will prompt to find the library
You can set the environment variable LD_LIBRARY_PATH first, then run:

LD_LIBRARY_PATH=/usr/local/lib ./crawler

You can also add /usr/local/lib directly to the default library path:

Echo "/usr/local/lib" > /etc/ld.so.conf.d/usrlocallib.conf
Ldconfig

4.Configuration

Settings.lua is the configuration file, following the lua syntax format, field description:
Work_processes is the number of processes, it is recommended to set the number of cpu cores
Seed is a list of seed addresses, you can configure multiple, separated by commas

About

Super fast C web crawler

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages