Master Student majored in Computer Engineering, seeking summer intern for 2017. If you want to hire me or any questions, welcome to contact me by [email protected]
The script work by scraping the html file of the question list page and extract question title and link by xpath and can change to next page too. The data obtained will be stored in mysql database. When crawling process is finished, the script will make invert index automatically. You can make either one word query or phrase query. Randomized user-agent and proxies are used.
pip and virtualenv
$ sudo apt-get install python-pip python-dev build-essential
$ sudo pip install --upgrade pip
$ sudo pip install --upgrade virtualenvfollowing package can be installed in virtualenv and do not use sudo command ntlk
$ pip install -U nltk
$ nltk.download(‘maxent_treebank_pos_tagger’) for pos_tag
$ nltk.download("stopwords")
$ nltk.download('averaged_perceptron_tagger') for pos_tagmysql
$ apt-get update
$ apt-get install mysql-server
$ mysql_secure_installation
$ mysql_install_dbcreate database and table
$ mysql -u root -p
$ CREATE DATABASE testdb;
$ CREATE USER 'testuser'@'localhost' IDENTIFIED BY 'test623';
$ USE testdb;
$ GRANT ALL ON testdb.* TO 'testuser'@'localhost';selenium
$ pip install seleniumbeautifulsoup
$ pip install beatifulsoupYou should change the username and password at first in scrapy.py line 30. This project is based on python 2.7.4, you can run command line: python scraper.py