Asynchronous concurrent crawler

1 Introduction

A multi-process asynchronous concurrent crawler, performance is not bad
Current implementation:
Epoll-based event module, high performance red black tree timer
Asynchronous dns resolver with cache, based on redis Bloom filter
Multi-process asynchronous concurrency, take full advantage of multi-core advantages, configure the number of processes, bind CPU
Redis-based task queue, hashing addresses by murmur hash to achieve load balancing
Asynchronous http client, supports https, currently only supports get operations, does not support cookies (planned improvements)

2.Dependence

http-parser Powerful to parse url
gumbo-parser Google's html analysis library for parsing crawled pages

3. Compile and run

Git clone https://github.com/DevJasper/web-crawler.git
Cd crawler
Make
./crawlerss

In CentOS7, if the installation path of the dependent library is /usr/local/lib, the runtime will prompt to find the library
You can set the environment variable LD_LIBRARY_PATH first, then run:

LD_LIBRARY_PATH=/usr/local/lib ./crawler

You can also add /usr/local/lib directly to the default library path:

Echo "/usr/local/lib" > /etc/ld.so.conf.d/usrlocallib.conf
Ldconfig

4.Configuration

Settings.lua is the configuration file, following the lua syntax format, field description:
Work_processes is the number of processes, it is recommended to set the number of cpu cores
Seed is a list of seed addresses, you can configure multiple, separated by commas

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
backup		backup
test		test
Makefile		Makefile
README.md		README.md
async_dns.c		async_dns.c
async_dns.h		async_dns.h
bloom_filter.c		bloom_filter.c
bloom_filter.h		bloom_filter.h
connection.c		connection.c
connection.h		connection.h
core.h		core.h
crawler.c		crawler.c
event.c		event.c
event.h		event.h
event_timer.c		event_timer.c
event_timer.h		event_timer.h
html_parser.c		html_parser.c
html_parser.h		html_parser.h
http_client.c		http_client.c
http_client.h		http_client.h
linux_config.h		linux_config.h
murmur3.c		murmur3.c
murmur3.h		murmur3.h
process_cycle.c		process_cycle.c
process_cycle.h		process_cycle.h
queue.h		queue.h
rbtree.c		rbtree.c
rbtree.h		rbtree.h
redis_func.c		redis_func.c
redis_func.h		redis_func.h
settings.lua		settings.lua
simple_log.h		simple_log.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Asynchronous concurrent crawler

1 Introduction

2.Dependence

3. Compile and run

4.Configuration

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Asynchronous concurrent crawler

1 Introduction

2.Dependence

3. Compile and run

4.Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages