A production-style Deep Learning NLP + Streamlit application that reads medical research abstracts and classifies each sentence into its functional role:
BACKGROUND β’ OBJECTIVE β’ METHODS β’ RESULTS β’ CONCLUSIONS
Unlike typical ML projects, this system:
- Trains the model from scratch inside the app
- Downloads and processes real-world data (~200K sentences)
- Provides a fully interactive UI for inference
- π₯ End-to-end pipeline (data β training β inference β UI)
- π§ Advanced Tribrid Architecture (Token + Character + Positional)
- π Real-world dataset (PubMed RCT)
- π Interactive Streamlit interface
- β‘ No pre-trained local model required (self-training system)
- π§© Multi-input deep learning model (rare in beginner projects)
Medical abstracts are structured but unlabelled in raw form.
Example:
"This study evaluates..."
"We conducted a randomized trial..."
"Results showed improvement..."
β‘οΈ Hard to scan quickly.
Automatically classify each sentence into:
- BACKGROUND
- OBJECTIVE
- METHODS
- RESULTS
- CONCLUSIONS
- π PubMed 200K RCT Dataset
- π’ ~200,000 labeled sentences
- π·οΈ 5 structured classes
pubmed-rct/
βββ PubMed_20k_RCT_numbers_replaced_with_at_sign/
βββ train.txt (~180K samples)
βββ dev.txt (~30K samples)
βββ test.txt
- TF-IDF + Logistic Regression
- 1D CNN
- LSTM
- Hybrid (Token + Character embeddings)
-
β Tribrid Model
- Token embeddings (semantic understanding)
- Character embeddings (morphological patterns)
- Positional embeddings (structure awareness)
Input: Sentence
β
βββ Token Branch
β βββ Universal Sentence Encoder (512-d)
β βββ Dense Layer
β
βββ Character Branch
β βββ TextVectorization
β βββ Embedding (char-level)
β βββ BiLSTM
β
βββ Positional Branch
βββ Line Number (one-hot)
βββ Total Lines (one-hot)
β
Concatenation
β
Dense β Dropout
β
Softmax (5 classes)
- Loss:
CategoricalCrossentropy (label_smoothing=0.2) - Optimizer:
Adam - Output: Multi-class classification (5 labels)
When no model exists:
- Clone dataset from GitHub (~177MB)
- Parse and preprocess ~200K sentences
- Encode labels (sklearn)
- Build vectorization layers
- Download Universal Sentence Encoder (~1GB, cached)
- Train tribrid model
- Save model β
skimlit_tribrid_model.keras
β±οΈ CPU Training Time: ~10β20 min per epoch
-
Paste medical abstract
-
Click Classify Sentences
-
Pipeline:
- Sentence splitting (spaCy / fallback regex)
- Multi-input preprocessing
- Model prediction
-
Output:
- Label per sentence
- Confidence score
- Clean UI display
project/
β
βββ app.py # Streamlit app (training + inference)
βββ SkimLit_1.ipynb # Original model development
βββ requirements.txt
βββ README.md
βββ sample_abstracts.txt
β
βββ skimlit_tribrid_model.keras # Generated after training
β
βββ pubmed-rct/ # Auto-downloaded dataset
git clone
cd into project-folderpip install -r requirements.txtpython -m spacy download en_core_web_smstreamlit run app.py- Dataset download
- Model training
- Model saved locally
- Model loads instantly
- No retraining
- β Deep learning models outperform traditional ML
- β Character embeddings improve robustness
- β Positional embeddings add structural understanding
- β Tribrid model achieves best performance
Available in:
sample_abstracts.txt
Includes real-world examples:
- Diabetes intervention
- COVID-19 vaccine study
- Hypertension trial
- Depression treatment
- Sleep study
| Label | Meaning |
|---|---|
| BACKGROUND | Context / motivation |
| OBJECTIVE | Research goal |
| METHODS | Experiment design |
| RESULTS | Findings |
| CONCLUSIONS | Interpretation |
- Python
- TensorFlow
- TensorFlow Hub
- Scikit-learn
- NumPy, Pandas
- Streamlit
- spaCy
- Python 3.10 / 3.11
- TensorFlow 2.15+
- Git installed