Skip to content

Shubham37204/Skimlit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ§ πŸ”¬ SkimLit β€” Medical Abstract Sentence Classification (End-to-End NLP System)

A production-style Deep Learning NLP + Streamlit application that reads medical research abstracts and classifies each sentence into its functional role:

BACKGROUND β€’ OBJECTIVE β€’ METHODS β€’ RESULTS β€’ CONCLUSIONS

Unlike typical ML projects, this system:

  • Trains the model from scratch inside the app
  • Downloads and processes real-world data (~200K sentences)
  • Provides a fully interactive UI for inference

πŸš€ Key Highlights

  • πŸ”₯ End-to-end pipeline (data β†’ training β†’ inference β†’ UI)
  • 🧠 Advanced Tribrid Architecture (Token + Character + Positional)
  • πŸ“Š Real-world dataset (PubMed RCT)
  • 🌐 Interactive Streamlit interface
  • ⚑ No pre-trained local model required (self-training system)
  • 🧩 Multi-input deep learning model (rare in beginner projects)

πŸ“Œ Problem Statement

Medical abstracts are structured but unlabelled in raw form.

Example:

"This study evaluates..."
"We conducted a randomized trial..."
"Results showed improvement..."

➑️ Hard to scan quickly.

βœ… Goal:

Automatically classify each sentence into:

  • BACKGROUND
  • OBJECTIVE
  • METHODS
  • RESULTS
  • CONCLUSIONS

πŸ“Š Dataset

  • πŸ“š PubMed 200K RCT Dataset
  • πŸ”’ ~200,000 labeled sentences
  • 🏷️ 5 structured classes

Dataset Structure

pubmed-rct/
└── PubMed_20k_RCT_numbers_replaced_with_at_sign/
    β”œβ”€β”€ train.txt   (~180K samples)
    β”œβ”€β”€ dev.txt     (~30K samples)
    └── test.txt

πŸ› οΈ Models Implemented

πŸ”Ή Traditional Baseline

  • TF-IDF + Logistic Regression

πŸ”Ή Deep Learning Models

  • 1D CNN
  • LSTM
  • Hybrid (Token + Character embeddings)

πŸ”Ή Final Model (Best Performing)

  • βœ… Tribrid Model

    • Token embeddings (semantic understanding)
    • Character embeddings (morphological patterns)
    • Positional embeddings (structure awareness)

🧬 Tribrid Model β€” Detailed Architecture

Input: Sentence
        β”‚
        β”œβ”€β”€ Token Branch
        β”‚     └── Universal Sentence Encoder (512-d)
        β”‚     └── Dense Layer
        β”‚
        β”œβ”€β”€ Character Branch
        β”‚     └── TextVectorization
        β”‚     └── Embedding (char-level)
        β”‚     └── BiLSTM
        β”‚
        └── Positional Branch
              β”œβ”€β”€ Line Number (one-hot)
              └── Total Lines (one-hot)

                     ↓
              Concatenation
                     ↓
              Dense β†’ Dropout
                     ↓
              Softmax (5 classes)

βš™οΈ Training Configuration

  • Loss: CategoricalCrossentropy (label_smoothing=0.2)
  • Optimizer: Adam
  • Output: Multi-class classification (5 labels)

🌐 Application Workflow

πŸ§ͺ Phase 1 β€” Training

When no model exists:

  1. Clone dataset from GitHub (~177MB)
  2. Parse and preprocess ~200K sentences
  3. Encode labels (sklearn)
  4. Build vectorization layers
  5. Download Universal Sentence Encoder (~1GB, cached)
  6. Train tribrid model
  7. Save model β†’ skimlit_tribrid_model.keras

⏱️ CPU Training Time: ~10–20 min per epoch

πŸ” Phase 2 β€” Inference

  1. Paste medical abstract

  2. Click Classify Sentences

  3. Pipeline:

    • Sentence splitting (spaCy / fallback regex)
    • Multi-input preprocessing
    • Model prediction
  4. Output:

    • Label per sentence
    • Confidence score
    • Clean UI display

πŸ“‚ Project Structure

project/
β”‚
β”œβ”€β”€ app.py                        # Streamlit app (training + inference)
β”œβ”€β”€ SkimLit_1.ipynb              # Original model development
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ README.md
β”œβ”€β”€ sample_abstracts.txt
β”‚
β”œβ”€β”€ skimlit_tribrid_model.keras   # Generated after training
β”‚
└── pubmed-rct/                   # Auto-downloaded dataset

βš™οΈ Installation Guide

1️⃣ Clone Repository

git clone 
cd into project-folder

2️⃣ Install Dependencies

pip install -r requirements.txt

3️⃣ Install spaCy Model (Recommended)

python -m spacy download en_core_web_sm

4️⃣ Run Application

streamlit run app.py

⚑ Runtime Behavior

First Run

  • Dataset download
  • Model training
  • Model saved locally

Subsequent Runs

  • Model loads instantly
  • No retraining

πŸ“ˆ Results & Insights

  • βœ… Deep learning models outperform traditional ML
  • βœ… Character embeddings improve robustness
  • βœ… Positional embeddings add structural understanding
  • βœ… Tribrid model achieves best performance

πŸ§ͺ Sample Inputs

Available in:

sample_abstracts.txt

Includes real-world examples:

  • Diabetes intervention
  • COVID-19 vaccine study
  • Hypertension trial
  • Depression treatment
  • Sleep study

🏷️ Label Reference

Label Meaning
BACKGROUND Context / motivation
OBJECTIVE Research goal
METHODS Experiment design
RESULTS Findings
CONCLUSIONS Interpretation

πŸ’» Tech Stack

  • Python
  • TensorFlow
  • TensorFlow Hub
  • Scikit-learn
  • NumPy, Pandas
  • Streamlit
  • spaCy

⚠️ Requirements

  • Python 3.10 / 3.11
  • TensorFlow 2.15+
  • Git installed

About

Deep learning NLP project to classify ~200K+ PubMed abstract sentences into structured sections using TensorFlow (CNN, LSTM, hybrid embeddings).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors