Sentiment analysis using Supervised Deep Learning model

training and validation accuracy of LSTM model
result from 10 epochs
Log scaled graph that shows the change of word count after preprocessing
LSTM model architecture with embedded layers
number of negative and positive sentiments to access the need to of augumentation

Inspiration

Learning about how a machine processes data was always a question that I had since childhood. But due to a lack of knowledge and resources, I could not discover this part of technology. I had a rough idea about machine learning before enrolling in Ignition Hacks. I thought that this competition would increase my tech stack and help me discover more about machine learning and deep learning. Soon machine learning will play a vital role in computer science, and having it as a skill will be helpful for me to face the competitive world.

What it does

Sentimental Analysis is a program that interprets the sentence given by the user and tells us if that sentence is positive or negative. To solve that sentence, it uses Pre-Processing to remove the inconsistencies. Then machine learning model gets trained by the data set provided and predicts the most likely outcome.

How I built it

Pre-Processing Before applying machine learning algorithms to the data, we need to make sure that the information is free from ambiguity and noises For example:" Hello Adam!! Are you feeling good today??" In this sentence, punctuation like! And? It does not tell us about the sentiments of the sentence, but it creates ambiguity in our program; hence these must be removed. Another category of words is stopwords as "say," "me," etc., do not play any role in deciding sentiments; thus, they are removed. Pre-Processing is essential as it will help reduce unnecessary data and clean our data to reduce inconsistencies. To convert all the lines into their processed form, I used a function that used bs4 to remove the HTML tags and contractions to replace contractions in the string text.
Creating tokenizer and embedding layers After pre-processing the data then, we tokenized the data by using an inbuilt function in Keras called the tokenizers. Words are called tokens, and splitting text into tokens is called tokenization. These tokens help understand the context or develop the NLP model. The tokenization allows interpreting the meaning of the text by analyzing the sequence of the words. For example, the text "It is raining" can be tokenized into 'It,' 'is,' and 'raining.'
Embedding data The sole purpose of embedding data is to convert the low dimensional data (our original information) into high dimensional data. Our machine learning models are more efficient when we use high-dimensional vectors.
Creation of neural network using Keras We have to build a neural network that will process all the data collected and predict the output. LSTM has been used to build the model. LSTMs are a special kind of RNN, capable of learning long-term dependencies.

Challenges I ran into

I faced some issues while making the program, but I researched and learned from the problems to overcome them. When I was testing the data file, I realized that the data file was large, and Google collab showed a Runtime Error. So I switched to the desktop version of visual studio code, which solved the problem. While I was using a universal sentence encoder on visual studio code, I learned that Google had not released the Windows version of TensorFlow_text, so I changed the encoder to do the same.

Accomplishments that I'm proud of

I am delighted that I could learn and complete a completely new project in two days.

What I learned

I learned about machine learning algorithms. I learned about supervised and Unsupervised learning and how to train a machine learning model. This project gave me insight into NLPs and their application in various technology parts.

What's next for Sentiment analysis using the Supervised Deep Learning model

I would explore new models like ensemble stacking methods to improve accuracy. The model uses neural networks, and I want to try NN variants like CNN1D BLSTM and other time series, NLP models, e.g., Hidden Markov Models, for better prediction. TF and glove.6B sentence encoders were a bit slow for 600,000 tuples, so I want to try them on distributed computing like Hadoop for faster pre-processing

Built With

bs4
contractions
corpus
deeplearning
encoding
glove.6b
inflects
keras
lstm
machine-learning
math
matplotlib
natural-language-processing
neural-networks
nltk
numpy
pandas
pre-processing
python
scipy
seaborn
sklearn
tensorflow

Updates

Harsh Poddar posted an update — Aug 24, 2020 12:36 AM EDT

I would like to thank Ignition Hacks for providing me such a great opportunity to learn and grow. The resources provided were very helpful and resourceful!!

Log in or sign up for Devpost to join the conversation.

Harsh Poddar started this project — Aug 23, 2020 03:27 PM EDT

Leave feedback in the comments!

Log in or sign up for Devpost to join the conversation.