Inspiration
Keeping up with new information on COVID-19 can be a challenge for health professionals due to the vast number of literature and rapid spread of the virus. Is it possible to simplify the search for related publications by clustering similar research articles together? In what ways can the cluster content be qualified?
What it does
With the use of clustering for labeling and dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topics will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modeling will be performed to find the keywords of each cluster.
How we built it
We used Kaggle for datasets and notebook. We used sentence transformer and UMAP along with K Means for clustering. We also used BERTopic for topic modeling and visualization
Challenges we ran into
Some of the challenges we ran into were:
- The dataset was huge and we were not able to process the entire dataset. So, we chose 20000 entries for demonstration.
- A number of entries in the dataset were not in English.
- A lot of stopwords and some words were very common in the articles.
Accomplishments that we're proud of:
We worked with a huge dataset and built a tool which can help healthcare professionals. Also, it was our first time using BERTopic for modeling and visualization.
What we learned
We learnt data preprocessing in NLP . We also learnt how to use BERTopic for topic modeling.
What's next for SCLUVID:
- Further, we intend to deploy this as a web based application for better reach.
- We can also incorporate ElasticSearch to perform document searching based on clusters.
- Papers written in languages other than English can also be accounted for in the future.
Built With
- bertopic
- kaggle
- natural-language-processing
- python
- scikit-learn
- transformer
- umap
Log in or sign up for Devpost to join the conversation.