Scraper-for-Stackoverflow

Master Student majored in Computer Engineering, seeking summer intern for 2017. If you want to hire me or any questions, welcome to contact me by [email protected]

HOW IT WORKS

The script work by scraping the html file of the question list page and extract question title and link by xpath and can change to next page too. The data obtained will be stored in mysql database. When crawling process is finished, the script will make invert index automatically. You can make either one word query or phrase query. Randomized user-agent and proxies are used.

INSTALLATION

pip and virtualenv

$ sudo apt-get install python-pip python-dev build-essential 
$ sudo pip install --upgrade pip 
$ sudo pip install --upgrade virtualenv

following package can be installed in virtualenv and do not use sudo command ntlk

$ pip install -U nltk
$ nltk.download(‘maxent_treebank_pos_tagger’)   for pos_tag
$ nltk.download("stopwords")
$ nltk.download('averaged_perceptron_tagger')   for pos_tag

mysql

$ apt-get update
$ apt-get install mysql-server
$ mysql_secure_installation
$ mysql_install_db

create database and table

$ mysql -u root -p
$ CREATE DATABASE testdb;
$ CREATE USER 'testuser'@'localhost' IDENTIFIED BY 'test623';
$ USE testdb;
$ GRANT ALL ON testdb.* TO 'testuser'@'localhost';

selenium

$ pip install selenium

beautifulsoup

$ pip install beatifulsoup

USAGE

You should change the username and password at first in scrapy.py line 30. This project is based on python 2.7.4, you can run command line: python scraper.py

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
scraper		scraper
LICENSE.txt		LICENSE.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper-for-Stackoverflow

HOW IT WORKS

INSTALLATION

USAGE

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraper-for-Stackoverflow

HOW IT WORKS

INSTALLATION

USAGE

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages