This folder contains the visualization results of running our code and a short analysis of our results. We are comparing our results to those in the Mining Causal Topics in Text Data: Iterative Topic Modeling with Time Series Feedback by Kim et al.
The paper experimented with parameter mu which is to weight the word prior distribution by controlling the pseudo-counts. However, as gensim's LDA has effectively blackboxed the method, we instead experiment with the decay parameter as a similar alternative to mu. To test the decay parameter, we run LDA with 30 topics and decay values from (0.5, 1].
The paper found that higher levels of mu (we use the decay parameter provided in gensim's LDA implementation as a similar alternative) were not always associated with higher levels of purity in the topics and infact tend to see a decrease in topic purity. However, we find that the best decay parameter to use would be somewhere around 0.7 in this example. We see that at decay=0.7, 0.8, there seems to be a general increase in purity, while there is a general decrease for decay=0.51, 0.6, 0.9. Purity level for decay=1 seems to be relatively constant.
The paper also tried a number of different topics with mu=1000. We do a similar comparison by running LDA with k=10, 20, 30, 40, and a varying number of topics with the decay parameter set to 1. The paper did see a better increase of purity over the iteration with a varying number of topics and k=40. However, we see that by the end of the 5th iteration, k=10 was consistently much higher than the other k values. Varying the number of topics each iteration did not seem to improve the average purity of the causal topics we found.
We found that most decay parameters did improve the causal confidence in our topics. While there was a slight bump in confidence at iteration 3 with decay=0.51, it immediate return back around 97.75% confidence. Only with decay=0.7, 0.8 was there a signifant increase in the confidence.
We see similar confidence results with different k parameters. Varying the topics each iteration did not seem to improve the confidence, but it did have a better confidence overall for most iterations. There was no number of topics that seemed to have an increasing confidence level.
We can see below that the topics seem to be relevant to presedential elections.
- intervention, humanitarian, trade, theaters, breakup
- identity, rats, jews, spells, security -> Antisemitism
- women, prescription, sullivan, regulatory, standardized -> Medical Care
- surpluses, budget, tax, medicare, russian -> Tax Budget
- mcginn, execution, page, repreive, debate -> Death Penalty
- copied, lieberman, incorrectly, referred, article
- aug, lieberman, drug, abortion, beneficiaries -> Medical Care/Abortion
- powell, opening, institute, enterprise, cuba -> Foreign Diplomacy
- bushnell, candace, sex, city, fox
- tax, plan, trillion, social, surplus -> Tax Budget
We think that the majority of the differences lies in the fact that we are not discounting the word priors exactly like in the paper. The decay parameter represents (the percentage of the previous lambda value that is forgotten when each new document is examined)[https://radimrehurek.com/gensim/models/ldamodel.html]. However the use of mu in the paper is adding a pseudo-word count to the M-iteration in LDA. Essentially, we weight the word prior more at the beginning of generating the model to give words in that topic a better chance of not moving away from that topic, but decrease that weight as the algorithm progresses to allow for more movement of words between topics. On the other hand, decay seems to weight the entire document, and not just the words. We believe that differences in our solution and the one shown in the paper can be attributed to the difference in parameter utilization.
To see the pyLDAVis interactive visualization, go to Github's HTML Preview and copy paste the link to the html file. Note that the size of the bubble does not correspond to a higher probability of the topic appearing, it simply means more words were assigned to it. A good topic model will have less overlapping bubbles which corresponds to more distinct topics.
