README.md

TwitterClassification

TwitterClassification is a set of scripts and methods that collect and analyze Twitter user data. It was created for Hazel Kwon's research of "Distance effect on Twitter users' perception of terrorism news"

Dependicies:

Twython
Scikit-learn : bag-of-words model, Random Forest Model.
Natural Language Toolkit: Snowball stemmer, English Stop-words dictionary

Scripts:

get_twitter_user_descriptions.py : collect information on up to 300 Twitter User Descriptions outputted to a .csv file.
get_user_information.py: General version of getTwitterUserDescriptions.py: collect more data than just descriptions and timezone information (which is the functionality of get_Twitter_User_Descriptions.py).
account_classification.py: Classes and methods that implement supervised machine learning protocols that classify Twitter profiles as a personal (or layman's) account or a non-personal (such as a business', public figure's, news reporter's) account. For references on how to use this code, see below. Also, references the example.txt document.
concreteness.py. To understand the concreteness of a tweet (how concrete of language a user employs in their tweets gives us a proximal coefficient, which informs us about how close the tweeter is to the event). This function will calculate the average concreteness rating of a sentence (total concreteness score divided by the amount of concrete words). The concreteness ratings are taken from Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman's work (2014) found here (http://crr.ugent.be/papers/Brysbaert_Warriner_Kuperman_BRM_Concreteness_ratings.pdf)

Additional information:

To use account_classification.py, the data must be structured as bellow.

training data:

user_ID'\t'Description'\t'Personal some id'\t'some description'\t'1 some id'\t'some description'\t'0 some id'\t'some description'\t'1 ...

testing data:

user_ID'\t'Description
some id'\t'some description
some id'\t'some description
some id'\t'some description

Where 1 = personal, 0 = nonpersonal. user_ID can either be a string or an int data type. Furthermore, both the training and testing data should be saved as a .txt file with a tab ('\t') separator. **Make sure all data is saved in utf-8 format. To encode entire data frames we recommend using R.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TwitterClassification

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

TwitterClassification