Clustering_TextData_With_Document_Vectorization

Conducted clustering analysis on textual data using Kmeans, an unsupervised machine learning algorithm. Preprocessed data and vectorized the textual data since Kmeans calculates the Euclidean distance between fields of each of record and Euclidean distance cannot be calculated for text.

Vectorized text by calculating tf-idf (term frequences-inverse document frequencies) values, creating a document term matrix. Filtered out terms that are too frequent or too rare, focusing on terms that add meaning to the data.

Reduced the dimensionality of the document term matrix through PCA, to visualize the cluster solution across the principal components.

Requirements: Python

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
output_imgs		output_imgs
.DS_Store		.DS_Store
README.md		README.md
main.py		main.py
preprocessing.py		preprocessing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering_TextData_With_Document_Vectorization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Clustering_TextData_With_Document_Vectorization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages