Details- Use a given dataset to build a model to predict the category using description. Write code in python. Using Jupyter notebook is encouraged.
- Show how you would clean and process the data
- Show how you would visualize this data
- Show how you would measure the accuracy of the model
- What ideas do you have to improve the accuracy of the model? What other algorithms would you try?
You have to clean this data, In the product category tree separate all the categories, figure out the primary category, and then use the model to predict this. If you want to remove some categories for lack of data, you are also free to do that.
- Goal is to predict the product category.
- Description should be the main feature. Feel free to use other features if it'd improve the model.
- Include a Readme.pdf file with approach in detail and report the accuracy and what models were used.
This project is about multi class classification using NLP. Here we are provided with flipkart dataset from which we have to predict the primary category using product description.
In order to reproduce the results produced by the notebook the following needs to be installed. Use of virtual environment while installing these libraries is preferable.
pip install -q tensorflow-text
pip install -q tf-models-official
pip install wordcloud
pip install gensim
pip install nltk
pip install spacy
pip install transformers
pip install wget
pip install transformers
pip install wget
pip install pandas
pip install numpy
pip install seaborn
pip install matplotlib
pip install torch
From the product_category_tree we consider the primary catergory as the root of this tree.
Example: For the category given tree '["Footwear >> Women's Footwear >> Ballerinas >> AW Bellies"]' the primary category is Footwear
Visulazation of Dataset is being done under this section
Data is being cleaned before going through Data Preparation
Data preprocessing is done in the following fashion.
Tokenization of the description of the product. (Post cleaning the description for the product).
Lemmatizing the tokenized data in-order to prepare it for usagein the model.
Here we are removing all the product categories for which less than 10 products are present.
So,after detailed analysis I have used BERT(Bidirectional Encoder Representations from Transformers) Model. It is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context.


