You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note* If you still cannot run scrapy you may need to fix your path "export PATH=~/anaconda2/bin:$PATH"
Install xlwt (used to save results in excel file) "sudo apt-get update" and then "sudo apt-get install python-xlwt"
If that didn't work then try "pip install xlwt" (if you have pip)
Setup
Edit Filepaths in Path.java to the src folder
Include your List of Starting URLs in resources with name "crawl_list.txt"
Running The Distributed Program
Compile the code with "make"
Need at least 2 terminals/machines
Have one be the master by running "java Master"
The rest are slaves by running "java Slave"
Slaves will need to add their master in , i.e. "127.0.1.1,8000"
After all slaves are added, on the master type in the command "start"
Wait until all slaves are completed, the master will print out the time logs and exit the program
Clean up files with "make clean"
Running the Solo Program
Compile Code with "make"
Run command "java Solo"
About
Created a Master Slave Architecture for a distributed crawling system that will crawl a given list of reddit pages. Our master uses a custom weighted Round Robin algorithm to distribute URLs to the slave components.