Language: English | Deutsch
End-to-end sentiment analysis benchmark comparing three ML approaches on customer feedback data: Classical ML (TF-IDF + SVM/LogReg/NB), Fine-tuned BERT (HuggingFace Transformers), and LLM zero/few-shot (Azure OpenAI GPT-4o). Includes training pipelines, evaluation framework, REST API, and interactive dashboard.
Evaluated on two datasets: German Sentiment (8.7K samples, 3-class) and Yelp Reviews (650K English reviews, 5-star → 3-class).
Live Dashboard | GitHub Repo — Interactive results explorer with model comparison, confusion matrices, error analysis, and live sentiment classifier.
All models evaluated on the same test set (stratified split, seed=42).
| Model | F1 (weighted) | Accuracy | Latency (ms/sample) | Cost/1K ($) |
|---|---|---|---|---|
| gbert-base (fine-tuned) | 0.9119 | 0.9128 | 1.3 | 0 |
| SVM (TF-IDF) | 0.8562 | 0.8725 | 0.2 | 0 |
| Logistic Regression (TF-IDF) | 0.8505 | 0.8503 | 0.1 | 0 |
| Naive Bayes (TF-IDF) | 0.7956 | 0.8584 | 0.1 | 0 |
| GPT-4o-mini (zero-shot) | 0.7517 | 0.6700 | 966 | 0.19 |
| GPT-4o (zero-shot) | 0.6808 | 0.5850 | 1050 | 3.22 |
| GPT-4o (few-shot) | 0.5359 | 0.4450 | 1878 | 2.80 |
| GPT-4o-mini (few-shot) | 0.5054 | 0.4200 | 1889 | 0.17 |
5-star ratings mapped to 3-class (1-2★ negative, 3★ neutral, 4-5★ positive). 10K stratified train subset.
| Model | F1 (weighted) | Accuracy | ROC-AUC | Latency (ms/sample) |
|---|---|---|---|---|
| bert-base-uncased (fine-tuned) | 0.7607 | 0.7576 | 0.9135 | 1.4 |
| Logistic Regression (TF-IDF) | 0.7431 | 0.7393 | 0.9022 | 0.4 |
| SVM (TF-IDF) | 0.7375 | 0.7574 | 0.9028 | 0.3 |
| Naive Bayes (TF-IDF) | 0.6340 | 0.7129 | 0.8671 | 0.3 |
- Fine-tuned BERT wins on both datasets: German F1=0.912 (+5.6pp vs. best classical), Yelp F1=0.761 (+1.8pp)
- Classical ML with TF-IDF features outperforms zero-shot LLMs on the German dataset
- On Yelp, BERT's advantage over classical ML shrinks — Logistic Regression (F1=0.743) nearly matches BERT (F1=0.761)
- Few-shot prompting surprisingly hurts LLM performance vs. zero-shot
- GPT-4o is not significantly better than GPT-4o-mini for this task but costs 16x more
- BERT achieves near-real-time latency (1.3ms/sample on T4 GPU) at zero marginal cost
- Neutral class is hardest across all models on Yelp (BERT neutral F1=0.496, LogReg neutral F1=0.489) due to class imbalance (~20%)
- Identical test sets: All models evaluated on the exact same test split (stratified sampling with
seed=42) - Multi-seed evaluation: Classical ML supports training across 3 seeds (42, 123, 456) to report mean +/- std
- Learning curves: Validate that 10K training samples are sufficient (F1 plateaus beyond 5K for most classifiers)
- Consistent label mapping: Yelp 5-star ratings mapped to 3-class (1-2: negative, 3: neutral, 4-5: positive) following Zhang et al. (2015). This is a deliberate design choice for consistency with the German Sentiment dataset (also 3-class). Trade-off: star-3 is only ~20% of Yelp data, creating class imbalance.
# Train with 3 seeds and report mean +/- std
python run_training.py --model classical --classifier svm --dataset german --multi-seed
# Compute learning curve (F1 at 500/1K/2K/5K/10K training samples)
python run_training.py --model classical --classifier svm --dataset german --learning-curvecustomer-feedback-intelligence/
├── src/
│ ├── data/ # Dataset loading + preprocessing
│ ├── models/ # Classical ML, BERT, LLM classifiers
│ ├── training/ # Training orchestration + Optuna tuning
│ ├── evaluation/ # Metrics, comparison, error analysis
│ ├── inference/ # Production predictor
│ └── api.py # FastAPI REST API
├── config/ # Central configuration
├── notebooks/ # Jupyter/Colab notebooks
├── tests/ # pytest test suite (39 tests)
├── results/metrics/ # Saved evaluation metrics (JSON)
├── app.py # Streamlit dashboard
├── run_training.py # CLI: train models
└── run_evaluation.py # CLI: evaluate + compare models
# Setup
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements-full.txt
# Train classical models on German Sentiment
python run_training.py --model all --dataset german
# Evaluate LLM (requires Azure OpenAI key in .env)
python run_evaluation.py --model llm --deployment gpt-4o-mini --mode zero_shot
# Compare all models
python run_evaluation.py --model compare
# Run tests
pytest tests/ -v
# Launch dashboard
streamlit run app.py
# Start API
uvicorn src.api:app --reload-
Classical ML (
src/models/classical.py): TF-IDF vectorization (unigrams + bigrams, 50K features) with Logistic Regression, LinearSVC, or Multinomial Naive Bayes. Hyperparameter tuning via Optuna. -
Fine-tuned BERT (
src/models/bert_classifier.py):deepset/gbert-base(German, 110M params) andbert-base-uncased(English, 110M params) fine-tuned with HuggingFace Trainer. Training notebooks for Google Colab included. -
LLM Zero/Few-Shot (
src/models/llm_classifier.py): Azure OpenAI GPT-4o and GPT-4o-mini with structured JSON output. Includes retry logic, cost tracking, and response caching.
All models share the same evaluation pipeline (src/evaluation/metrics.py):
- Accuracy, Weighted F1, Macro F1, Precision, Recall
- Confusion Matrix (3x3)
- ROC-AUC (One-vs-Rest)
- Inference latency benchmarks
- API cost tracking (LLM models)
- Multi-seed aggregation (mean +/- std)
- Learning curves
- German Sentiment (primary): 6.4K train / 772 val / 1.5K test, 3-class
- Yelp Reviews (secondary): 650K English reviews, 5-star mapped to 3-class (10K train / 5K test subsets)
# Predict single text
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"text": "Das Produkt ist ausgezeichnet!", "model": "classical"}'
# Batch prediction
curl -X POST http://localhost:8000/predict/batch \
-H "Content-Type: application/json" \
-d '{"texts": ["Sehr gut!", "Schlecht."], "model": "classical"}'- ML: scikit-learn, transformers, torch, optuna
- Data: pandas, datasets (HuggingFace)
- API: FastAPI, pydantic
- Dashboard: Streamlit, Plotly
- LLM: openai (Azure OpenAI SDK)
- Tracking: MLflow
- Testing: pytest (39 tests)
- CI/CD: GitHub Actions
MIT