Best Model: Voting Regressor (XGBoost + LightGBM + Random Forest) — R² = 0.96
- Overview
- Research Question & Hypothesis
- Dataset
- ML Pipeline
- Project Phases
- Results
- Key Findings
- Tech Stack
- How to Run
- Repository Structure
- Author
This project builds a complete end-to-end machine learning pipeline to predict downlink throughput (Quality of Service) in a 5G Radio Access Network (RAN).
Using real-world drive-test measurements from the Berlin V2X vehicular scenario, the project progresses through five structured phases — raw data collection, cleaning, exploratory analysis, feature engineering, and model training — culminating in a high-accuracy ensemble model.
| Metric | Value |
|---|---|
| Task | Supervised Regression |
| Target | Downlink data rate (bits/sec) |
| Training samples | Drive-test rows with Carrier Aggregation enabled |
| Final R² | 0.960 |
| Best Model | Voting Regressor (XGB + LGB + RF) |
Can we use 5G radio, mobility, and context features to predict the downlink throughput of a vehicular UE under Carrier Aggregation (CA)?
H1 (Testable): If radio link quality and allocated resources improve — indicated by higher PCell_SNR, RSRP, and Downlink_Num_RBs — then the downlink data rate (target) will increase significantly.
Result: Hypothesis confirmed. Radio signal metrics and resource block allocation are the dominant predictors of 5G throughput.
| Property | Details |
|---|---|
| Source | Kaggle — QoS Prediction Challenge (AI/ML in 5G) |
| Origin | Berlin V2X drive-test measurements |
| Files | Train.csv, Test.csv, VariableDefinitions.csv |
| Target variable | target — Downlink data rate in bits/second (continuous) |
| Execution env | Google Colab (T4 GPU) |
| Group | Features | Strength |
|---|---|---|
| Radio Signal | RSRP, RSRQ, RSSI, SNR (PCell & SCell) | Strong |
| Resource Allocation | Downlink Num_RBs, Bandwidth (MHz) | Strong |
| Mobility & GPS | Latitude, Longitude, Altitude, Speed | Moderate |
| Context / Environment | Area type, Cell Identity | Moderate |
| Temporal | hour_sin, hour_cos (diurnal encoding) |
Moderate |
| Weather | Temperature, Humidity, Wind, Precipitation | Weak |
Raw Drive-Test CSV
│
▼
┌─────────────────────┐
│ Phase 1 │ Data ingestion, shape & type exploration
│ Data Collection │
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Phase 2 │ Duplicate removal, KNN + domain-driven imputation,
│ Cleaning & Prep │ 3GPP-bounded outlier clipping
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Phase 3 │ Distributions, correlation heatmaps,
│ EDA │ mutual information, Simpson's paradox, seasonality
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Phase 4 │ PCA (95% variance), LassoCV feature selection,
│ Feature Eng. │ signal ratios, temporal cyclical encoding
└────────┬────────────┘
│
▼
┌─────────────────────┐
│ Phase 5 │ Ridge, SVR, Decision Tree, XGBoost, LightGBM,
│ Modeling │ Random Forest, Voting Regressor + GridSearchCV
└────────┬────────────┘
│
▼
R² = 0.960
Phase 1 — Data Collection
- Loaded
Train.csvandTest.csvfrom Google Drive - Inspected shape, dtypes (
object,int64,float64) and variable definitions - Understood the population: drive-test observations from a metropolitan vehicular network with Carrier Aggregation enabled
- Identified the target variable: continuous downlink throughput in bits/sec
Phase 2 — Data Cleaning & Preprocessing
- Duplicates: 0 duplicate rows found
- Missing values: Null patterns across signal and context columns
- Imputation strategy (layered):
- Bandwidth → RB count derivation using domain formula
- Cell-ID grouped median imputation for radio metrics
- KNN Imputer for residual nulls
- Outlier handling: Z-score (threshold = 3) + box plots; clipped to 3GPP standard bounds — extreme values retained as they carry real telecom signal
Phase 3 — Exploratory Data Analysis
- Descriptive statistics (5-number summaries) across all splits
- Distribution plots for all numeric and categorical features (saved as
postproc_freq_plots/) - Correlation heatmap: Strong positive block — SNR, RSRP, RSRQ, RSSI, Num_RBs →
target - Mutual information + chi-squared: Radio metrics dominate; weather near-zero
- Simpson's Paradox: Aggregated area-level trends reverse within-group trends for certain features
- Seasonality: Clear diurnal pattern in mean throughput encoded via
hour_sin / hour_cos
Phase 4 — Feature Engineering & Dimensionality Reduction
- Engineered features: signal ratios, clipped radio bounds, temporal cyclical encoding
- PCA → retained 95% variance in significantly fewer components (highest reduction)
- LassoCV → eliminated near-zero weather coefficients
- Evaluated all three dataset variants (Full / PCA / LassoCV)
- Full Features selected as final training set for maximum predictive information
Phase 5 — Model Training & Evaluation
- Trained 4 base models with hyperparameter tuning via GridSearchCV / RandomizedSearchCV
- Added ensemble models (Random Forest, LightGBM, Voting Regressor)
- Evaluated using R², MAE, RMSE, and learning curves
- No significant overfitting — training and validation scores converge well
| Model | Features | R² Score |
|---|---|---|
| Ridge Regression | Full Features | Baseline |
| SVR | Full Features | Moderate |
| Decision Tree | Full Features | Moderate |
| XGBoost | Full Features | 0.935 |
| Random Forest | Full Features | High |
| LightGBM | Full Features | High |
| Voting Regressor (XGB + LGB + RF) | Full Features | **0.960 ** |
| Method | Variance Retained | Dimensions Reduced |
|---|---|---|
| PCA | 95% | High reduction |
| LassoCV | Eliminated weather features | Moderate reduction |
| Winner | PCA | Most compact |
- Radio signal quality drives throughput — SINR, RSRP, and allocated resource blocks (Num_RBs) are the top predictors
- Weather is a near-zero predictor — confirmed by both mutual information analysis and LassoCV coefficient shrinkage
- Ensemble beats individuals — Voting Regressor (R²=0.96) outperforms standalone XGBoost (R²=0.935) by combining diverse learners
- No overfitting — learning curves show train/validation convergence across all models
- Simpson's Paradox observed — area-level aggregation can reverse individual feature-target trends, emphasizing the need for stratified analysis
| Library | Version | Purpose |
|---|---|---|
pandas |
latest | Data manipulation |
numpy |
latest | Numerical computing |
scikit-learn |
latest | ML models, PCA, LassoCV, CV |
xgboost |
latest | Gradient boosted trees |
lightgbm |
latest | Fast gradient boosting |
matplotlib |
latest | Plotting |
seaborn |
latest | Statistical visualization |
scipy |
latest | Z-score, statistics |
Run environment: Google Colab with T4 GPU
- Open
602.p5.SriRammSekarSasirekha.ipynbin Google Colab - Mount Google Drive
- Upload
Train.csv,Test.csv, andVariableDefinitions.csvto your Drive - Update the
BASEpath at the top of the notebook:
BASE = "/content/drive/MyDrive/YOUR_FOLDER_NAME"- Set runtime to T4 GPU → Runtime → Change runtime type → GPU
- Run all cells sequentially (Runtime → Run all)
# Clone the repo
git clone https://github.com/SriRammSS/5GQoS-Analysis.git
cd 5GQoS-Analysis
# Install dependencies
pip install -r requirements.txt
# Launch Jupyter
jupyter notebook 602.p5.SriRammSekarSasirekha.ipynbNote: The notebook mounts Google Drive for data loading. For local runs, update the
BASEvariable to a local path containing the CSV files.
5GQoS-Analysis/
│
├── 602.p5.SriRammSekarSasirekha.ipynb # Full 5-phase project notebook
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md # Project documentation
Sri Ramm Sekar Sasirekha UID: 121949820 | Course: 602 — Principles of Data Science