Comparative Visualization of Document Clustering Models

Models in comparison : LDA(Topic Modeling), K-Means, Deep Embedding Clustering

A visualization tool to compara document clustering results of clustering models.

Overview

Juxtapose 3 Models that using different techniques to Topic Modeling

Demo Video

Project Description

Document clustering is one of the widely researched tasks in the natural language processing field. However, as it is based on an unsupervised learning approach, there is no guarantee that clustering results are accurate. This requires domain experts' validation of results, which can be tedious and repetitive.

Due to this reason, we developed the following tool to provide visualization on clustering results from different techniques to assist the evaluation of individual models and the quality of clusters.

Clustering Algorithms

Clustering models are divided into several types according to using algorithms. We choose the following three models: Latent Dirichlet Allocation(LDA), K-means, Deep Embedded Clustering(DEC).

Latent Dirichlet Allocation(Porbabilistic methods): LDA considers each item as a mixture of various clusters with the probabilistic distribution.
K-Means(Centroid-based methods): K-Means decides the cluster of each item to minimize the within-cluster sum of squares(WCSS, variance)
Deep Embedded Clustering(Low dimensional embedding method): DEC uses a deep neural network to choose features to represent each cluster

Dimension Reduction

U-MAP vs t-SNE

Umap: Uniform manifold approximation and projection for dimension reduction (McInnes, L., & Healy, J., 2018)

To visualize documents, we used Uniform Manifold Approximation(U-MAP). While t-SNE is the most popular/traditional technique to reduce dimensions for preprocessing and projection, it requires high computing power and can be time-consuming. Recently, a new technique called U-MAP has been developed. This technique preserves the global structure more extensively and has a faster time complexity than t-SNE.

Extraction Keywords & Entity Recognition

Extraction Keywords

To verify clusters, we extract main keywords from each cluster. Keywords are determined by probability based on the frequency of a word in each cluster over frequency in overall documents. We use the pyLDAvis library to extract keywords for the LDA model and kmeans to pyLDAvis for K-means and DEC model.

Entity Recognition

Entity Recognition is another information to verify the cluster quality. We provide not only entities of each document but also the frequency of documents in the same cluster. To recognize entities, we use the spaCy library with the English dataset(en_core_web_sm, including Vocabulary, Syntax, Entities).

Implementation

Overall architecture of application

The application is developed with Electron as a cross-platform desktop app and packaged with pyinstaller. Backend is implemented with Flask(model training, feature reduction, and entity recognition), while javascript and d3.js are used for visualization.

Usage

For Dev environment

Need to install all required packages by pip install -r ./requirements.txt
Copy Spacy Data File
- Download data file: python -m spacy download en
- Copy data file("<Spacy Data Path>/en_core_web_sm-2.0.0") to "npl_data/en/en_core_web_sm-2.0.0" folder
Test local server by running python3 py_source/run_app.py
access web via browser by 'localhost:5000'

Packaging for Windows 64bit

PyInstaller
- Package python codes by PyInstaller: pyinstaller -Fw --distpath ./ ./packaging_task/run_app.spec
- Run binary file: ./run_flask.exe or npm start

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data_output		data_output
design		design
model_output		model_output
packaging_task		packaging_task
py_source		py_source
resource		resource
sample_pkl		sample_pkl
web		web
.gitignore		.gitignore
README.md		README.md
check_list.txt		check_list.txt
input_pkl_types.txt		input_pkl_types.txt
main.js		main.js
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
saved_model_reuters.pkl		saved_model_reuters.pkl
titles_condition_by_t.tsv		titles_condition_by_t.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Comparative Visualization of Document Clustering Models

Overview

Project Description

Clustering Algorithms

Dimension Reduction

Extraction Keywords & Entity Recognition

Implementation

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Comparative Visualization of Document Clustering Models

Overview

Project Description

Clustering Algorithms

Dimension Reduction

Extraction Keywords & Entity Recognition

Implementation

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages