- http协议 与 https协议 (https://langgithub.github.io/2019/06/13/http%E4%B8%8Ehttps/)
- Cookie池
- User-Agent池 查看文件 (https://github.com/langgithub/python_spider/blob/master/User-Agent.txt)
- ip代理池 短效代理=>站大爷 长效代理=>购买服务器,装adsl服务
- DNS缓存 爬虫框架会涉及
- 抓包 fiddler,charles
- 在线爬虫 (淘宝,运营商)
- 后台控制逻辑
- 单独启动线程轮询完成爬虫任务(读取与修改redis中存放爬虫阶段)
- 主线程修改爬虫阶段,轮询等待结果(读取与修改redis中存放爬虫阶段)
- 离线爬虫 (requests模块,scrapy模块)
-
接口爬虫 (抓取破解)
-
selenium自动化爬虫
- 启动hub集群(需要其他参数自行看)
java -jar selenium-server-standalone-3.8.1.jar -role hub -browserTimeout 60
- 启动node节点
- node 是firfox。注意webdriver.gecko.driver路径 java -jar selenium-server-standalone-3.8.1.jar -role node -hub http://192.168.176.1:4444/grid/register -browser "browserName=firefox,webdriver.gecko.driver=/usr/local/bin/geckodriver"
- node 是chrome。注意Dwebdriver.chrome.driver路径 java -Dwebdriver.chrome.driver=/Users/yuanlang/work/javascript/chromedriver -jar selenium-server-standalone-3.8.1.jar -role node -hub http://192.168.176.1:4444/grid/register -browser browserName=chrome
- node 是IE。注意Dwebdriver.ie.driver路径 java -Dwebdriver.ie.driver=D:/IEDrvierServer.ext -jar selenium-server-standalone-3.8.1.jar -role node -hub http://192.168.176.1:4444/grid/register -browser browserName=ie