Group 8 Data Science Applications Clustering Assignment Project 2

A simple and well designed structure is essential for any machine learning project, project template that combines simplicity, best practice for CODE structure and good CODE design. The main idea is that there's much same stuff you do every time when you start our machine learning project, so wrapping all this shared stuff will help you to change just the core idea every time you start our machine learning project.

So, here’s a simple readme template that help you get into our project faster and just focus on your notice and explainations, etc)

In order to decrease repeated code shunks, increase the time that can read the code in, flexibility an reusability we used a functional programming structure that focused on split all problems in our project in functions and use that functions many times in many places in the code without repeating the code.

In this project, we selected some books from the Gutenburg library from different categories and then select random paragraphs from them and labeled these paragraphs by the book name for ground truth. After creating the dataset we used many transformation algorithms to embed the text to numbers for the modeling processes like (Word Embedding,LDA, TF_IDF, BOW, Cohernce)

After this, we tried many clustering algorithms like(K-means, Expected-maximization(EM), and Hierarchical) and chose the champion one which achieved the kappa and silhouette score.

Recommended using GPU to compile the code much faster. But it works well for CPU too.

GPU takes around 40 min, while CPU may take hours.

Requirements

numpy (The fundamental package for scientific computing with Python)
pandas (pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.)
Pytorch-Transformers (High-level library to help with text augmentation using deep learning in PyTorch)
sklearn (TMachine Learning and Data Analysis Library in Python)
nltk (The fundamental package for Natural Language Procesing with Python)
matplotlib (Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python)
seaborn (Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.)
gensim (library for training of vector embeddings, topic modelling, document indexing and similarity retrieval with large corpora – Python or otherwise.)
pyldavis (pyLDAvis is designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.)
yellowbrick (Yellowbrick: Machine Learning Visualization.)
wordcloud (A little word cloud generator in Python)
spacy (spaCy is a free open-source library for Natural Language Processing in Python.)

Run the Code

Upload the ipynb code file into "Google Colab" or Anaconda "Jupyter Notebook"
Press "Run All" in the control panel or "Restart Kernel and Run All" to run all code
In case of run each code cell alone, press the run button that appear at each code cell tents

Contributing

Any kind of enhancement or contribution is welcomed.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
images		images
Clustering-group8.pptx		Clustering-group8.pptx
Group8_ClusteringAssignment.pdf		Group8_ClusteringAssignment.pdf
Group8_ClusteringAssignment2.ipynb		Group8_ClusteringAssignment2.ipynb
README.md		README.md
Readme-Clustering-group8.md		Readme-Clustering-group8.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Group 8 Data Science Applications Clustering Assignment Project 2

Requirements

Run the Code

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Group 8 Data Science Applications Clustering Assignment Project 2

Requirements

Run the Code

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages