Skip to content

sanjay2097/Topic-Modeling-on-News-Articles

Repository files navigation

Topic-Modeling-on-News-Articles

image

Objective :

In this project, we used Latent Dirichlet Allocation, an unsupervised machine learning algorithm for a document of more than 2200 BBC News Articles. We were provided with a wide variety of topics ranging from law,government,sports,entertainment and technology so we built a LDA model capable of classifying these topics into groups.

LDA is a generative probability model, which means it attempts to provide a model for the distribution of outputs and inputs based on latent variables. This is opposed to discriminative models, which attempt to learn how inputs map to outputs.

Schematic-of-LDA-algorithm

You can use LDA for a variety of tasks, from clustering customers based on product purchases to automatic harmonic analysis in music. However, it is most commonly associated with topic modeling in text corpuses. Observations are referred to as documents. The feature set is referred to as vocabulary. A feature is referred to as a word. And the resulting categories are referred to as topics.

image

Project Files:

  • News Articles.zip : This file contains the dataset used for this project.It includes 2200 news articles grouped into categories.
  • Topic Modeling on News Articles - This is a power point presentation file of a project. It includes various visualaized plots of EDA using Seaborn and Matplotlib. The result chart of various implemented algorithms.
  • Topic Modeling on News Articles.ipynb - This file includes Features description, exploratory data Analysis, data preprocessing and implemented LDA.

image

Project Details :

Unprocessed Data

Screenshot 2022-06-12 170913

Processed Data

Screenshot 2022-06-12 170900

Distribution of topics

download

Average word count for each topic

download (3)

Most frequent word distribution

download (3)

WordCloud of most frequent words

download (2)

Coherence score for number of topics

download (4)

image

Model References :

image

Conclusion :

We made improvement in classifying the topics.Initially we were provided with 5 major topics but using the LDA model we have further classified into major subtopics thus ensuring reliability in choosing a topic.We have clustered the given categories into 10 major sub-categories for which we have acieved coherence score of 60%.

image

Scope :

Topic modelling applications cover a range of use cases, here are a few real-world examples: Annotation , eDiscovery , Content recommendation , Search engine optimization , Word sense disambiguation etc.This project provides an approach to use topic modelling for classifying various documents which can further be used in supervised learning models to make recommendations for topics.Furthermore, we can use both kinds of information to build a NLP model in the future.

image

Credits :

  • Sanjay Yadav | Data Scientist

image

References :

https://cbail.github.io/SICSS_Topic_Modeling.html

https://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

About

Capstone Project 4

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors