Inspiration

Keeping up with new information on COVID-19 can be a challenge for health professionals due to the vast number of literature and rapid spread of the virus. Is it possible to simplify the search for related publications by clustering similar research articles together? In what ways can the cluster content be qualified?

What it does

With the use of clustering for labeling and dimensionality reduction for visualization, the collection of literature can be represented by a scatter plot. On this plot, publications of highly similar topics will share a label and will be plotted near each other. In order, to find meaning in the clusters, topic modeling will be performed to find the keywords of each cluster.

How we built it

We used Kaggle for datasets and notebook. We used sentence transformer and UMAP along with K Means for clustering. We also used BERTopic for topic modeling and visualization

Challenges we ran into

Some of the challenges we ran into were:

  • The dataset was huge and we were not able to process the entire dataset. So, we chose 20000 entries for demonstration.
  • A number of entries in the dataset were not in English.
  • A lot of stopwords and some words were very common in the articles.

Accomplishments that we're proud of:

We worked with a huge dataset and built a tool which can help healthcare professionals. Also, it was our first time using BERTopic for modeling and visualization.

What we learned

We learnt data preprocessing in NLP . We also learnt how to use BERTopic for topic modeling.

What's next for SCLUVID:

  • Further, we intend to deploy this as a web based application for better reach.
  • We can also incorporate ElasticSearch to perform document searching based on clusters.
  • Papers written in languages other than English can also be accounted for in the future.

Built With

Share this project:

Updates