Skip to content

Ehaly/CourseProject

 
 

Repository files navigation

CourseProject

Please fork this repository and paste the github link of your fork on Microsoft CMT. Detailed instructions are on Coursera under Week 1: Course Project Overview/Week 9 Activities.

Please read Project_Documentation for more details. The video could be found at : https://mediaspace.illinois.edu/media/t/1_n26fno8k

To run the scraper code, run “jupyter notebook”.

The code could be run by command “python model.py -h”.

To run the first example, “python model.py --document wars_news.txt --clusterNumber 5 --collectionNumber 2 --c 0.25 --b 0.91”

To run the second example, “python model.py --document laptop_reviews.txt --clusterNumber 4 --collectionNumber 3 --c 0.7 --b 0.96”

The class Corpus consists of the following functions: init(document_path): initialize a Corpus object.

build_corpus(): read the document in document path, store the collection number and the document in self.collection and self.documents.

build_vocabulary(): read the documents and build the vocabulary for the whole dataset.

build_background_model(): build the background model from the whole dataset.

build_term_doc_matrix(): Construct the term-document matrix where each row represents a document, each column represents a vocabulary term.self.term_doc_matrix[i][j] is the count of term j in document i.

initialize(self, number_of_collections, number_of_clusters, random=True): initialize the matrices document_topic_prob , topic_word_prob and collection_topic_word_prob.

expectation_step(number_of_collections,number_of_clusters,lambda_b, lambda_c): the E-step updates the P(zd ci w = j), i.e. the topic_prob matrix, p(zd,Ci,w = B) i.e. the bg_prob matrix, and p(zd,Ci ,j,w = C ), i.e. the common_topic_prob matrix.

maximization_step(number_of_collections, number_of_clusters ): the M-step updates the matrices document_topic_prob , topic_word_prob and collection_topic_word_prob

calculate_likelihood(number_of_collections,lambda_b, lambda_c): Calculate the current log-likelihood of the model using the model's updated probability matrices.

ccmm(number_of_collections, number_of_clusters, max_iter, lambda_b, lambda_c, epsilon): execute the text mining on the document passed in in max_iter times of iteration.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 95.2%
  • Python 4.8%