CourseProject

Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.

Please read Project_Documentation for more details. The video could be found at : https://mediaspace.illinois.edu/media/t/1_n26fno8k

To run the scraper code, run “jupyter notebook”.

The code could be run by command “python model.py -h”.

To run the first example, “python model.py --document wars_news.txt --clusterNumber 5 --collectionNumber 2 --c 0.25 --b 0.91”

To run the second example, “python model.py --document laptop_reviews.txt --clusterNumber 4 --collectionNumber 3 --c 0.7 --b 0.96”

The class Corpus consists of the following functions: init(document_path): initialize a Corpus object.

build_corpus(): read the document in document path, store the collection number and the document in self.collection and self.documents.

build_vocabulary(): read the documents and build the vocabulary for the whole dataset.

build_background_model(): build the background model from the whole dataset.

build_term_doc_matrix(): Construct the term-document matrix where each row represents a document, each column represents a vocabulary term.self.term_doc_matrix[i][j] is the count of term j in document i.

initialize(self, number_of_collections, number_of_clusters, random=True): initialize the matrices document_topic_prob , topic_word_prob and collection_topic_word_prob.

expectation_step(number_of_collections,number_of_clusters,lambda_b, lambda_c): the E-step updates the P(zd ci w = j), i.e. the topic_prob matrix, p(zd,Ci,w = B) i.e. the bg_prob matrix, and p(zd,Ci ,j,w = C ), i.e. the common_topic_prob matrix.

maximization_step(number_of_collections, number_of_clusters ): the M-step updates the matrices document_topic_prob , topic_word_prob and collection_topic_word_prob

calculate_likelihood(number_of_collections,lambda_b, lambda_c): Calculate the current log-likelihood of the model using the model's updated probability matrices.

ccmm(number_of_collections, number_of_clusters, max_iter, lambda_b, lambda_c, epsilon): execute the text mining on the document passed in in max_iter times of iteration.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
.DS_Store		.DS_Store
Project Documentation.pdf		Project Documentation.pdf
Project Proposal.docx		Project Proposal.docx
Project_Progress_Report.pdf		Project_Progress_Report.pdf
README.md		README.md
chromedriver		chromedriver
laptop_reviews.txt		laptop_reviews.txt
model.py		model.py
news_scraper.ipynb		news_scraper.ipynb
news_url.txt		news_url.txt
results.txt		results.txt
review_scraper.ipynb		review_scraper.ipynb
reviews_url.txt		reviews_url.txt
wars_news.txt		wars_news.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CourseProject

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CourseProject

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages