Skip to content

SriRammSS/5GQoS-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

5G QoS Prediction in Radio Access Networks

Using Machine Learning to Predict Downlink Throughput from Drive-Test Measurements

Python Jupyter XGBoost License: MIT Colab


Best Model: Voting Regressor (XGBoost + LightGBM + Random Forest) — R² = 0.96


Table of Contents


Overview

This project builds a complete end-to-end machine learning pipeline to predict downlink throughput (Quality of Service) in a 5G Radio Access Network (RAN).

Using real-world drive-test measurements from the Berlin V2X vehicular scenario, the project progresses through five structured phases — raw data collection, cleaning, exploratory analysis, feature engineering, and model training — culminating in a high-accuracy ensemble model.

Metric Value
Task Supervised Regression
Target Downlink data rate (bits/sec)
Training samples Drive-test rows with Carrier Aggregation enabled
Final R² 0.960
Best Model Voting Regressor (XGB + LGB + RF)

Research Question & Hypothesis

Can we use 5G radio, mobility, and context features to predict the downlink throughput of a vehicular UE under Carrier Aggregation (CA)?

H1 (Testable): If radio link quality and allocated resources improve — indicated by higher PCell_SNR, RSRP, and Downlink_Num_RBs — then the downlink data rate (target) will increase significantly.

Result: Hypothesis confirmed. Radio signal metrics and resource block allocation are the dominant predictors of 5G throughput.


Dataset

Property Details
Source Kaggle — QoS Prediction Challenge (AI/ML in 5G)
Origin Berlin V2X drive-test measurements
Files Train.csv, Test.csv, VariableDefinitions.csv
Target variable target — Downlink data rate in bits/second (continuous)
Execution env Google Colab (T4 GPU)

Feature Groups

Group Features Strength
Radio Signal RSRP, RSRQ, RSSI, SNR (PCell & SCell) Strong
Resource Allocation Downlink Num_RBs, Bandwidth (MHz) Strong
Mobility & GPS Latitude, Longitude, Altitude, Speed Moderate
Context / Environment Area type, Cell Identity Moderate
Temporal hour_sin, hour_cos (diurnal encoding) Moderate
Weather Temperature, Humidity, Wind, Precipitation Weak

ML Pipeline

Raw Drive-Test CSV
 │
 ▼
┌─────────────────────┐
│ Phase 1 │ Data ingestion, shape & type exploration
│ Data Collection │
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 2 │ Duplicate removal, KNN + domain-driven imputation,
│ Cleaning & Prep │ 3GPP-bounded outlier clipping
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 3 │ Distributions, correlation heatmaps,
│ EDA │ mutual information, Simpson's paradox, seasonality
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 4 │ PCA (95% variance), LassoCV feature selection,
│ Feature Eng. │ signal ratios, temporal cyclical encoding
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 5 │ Ridge, SVR, Decision Tree, XGBoost, LightGBM,
│ Modeling │ Random Forest, Voting Regressor + GridSearchCV
└────────┬────────────┘
 │
 ▼
 R² = 0.960 

Project Phases

Phase 1 — Data Collection
  • Loaded Train.csv and Test.csv from Google Drive
  • Inspected shape, dtypes (object, int64, float64) and variable definitions
  • Understood the population: drive-test observations from a metropolitan vehicular network with Carrier Aggregation enabled
  • Identified the target variable: continuous downlink throughput in bits/sec
Phase 2 — Data Cleaning & Preprocessing
  • Duplicates: 0 duplicate rows found
  • Missing values: Null patterns across signal and context columns
  • Imputation strategy (layered):
  • Bandwidth → RB count derivation using domain formula
  • Cell-ID grouped median imputation for radio metrics
  • KNN Imputer for residual nulls
  • Outlier handling: Z-score (threshold = 3) + box plots; clipped to 3GPP standard bounds — extreme values retained as they carry real telecom signal
Phase 3 — Exploratory Data Analysis
  • Descriptive statistics (5-number summaries) across all splits
  • Distribution plots for all numeric and categorical features (saved as postproc_freq_plots/)
  • Correlation heatmap: Strong positive block — SNR, RSRP, RSRQ, RSSI, Num_RBs → target
  • Mutual information + chi-squared: Radio metrics dominate; weather near-zero
  • Simpson's Paradox: Aggregated area-level trends reverse within-group trends for certain features
  • Seasonality: Clear diurnal pattern in mean throughput encoded via hour_sin / hour_cos
Phase 4 — Feature Engineering & Dimensionality Reduction
  • Engineered features: signal ratios, clipped radio bounds, temporal cyclical encoding
  • PCA → retained 95% variance in significantly fewer components (highest reduction)
  • LassoCV → eliminated near-zero weather coefficients
  • Evaluated all three dataset variants (Full / PCA / LassoCV)
  • Full Features selected as final training set for maximum predictive information
Phase 5 — Model Training & Evaluation
  • Trained 4 base models with hyperparameter tuning via GridSearchCV / RandomizedSearchCV
  • Added ensemble models (Random Forest, LightGBM, Voting Regressor)
  • Evaluated using R², MAE, RMSE, and learning curves
  • No significant overfitting — training and validation scores converge well

Results

Model Comparison

Model Features R² Score
Ridge Regression Full Features Baseline
SVR Full Features Moderate
Decision Tree Full Features Moderate
XGBoost Full Features 0.935
Random Forest Full Features High
LightGBM Full Features High
Voting Regressor (XGB + LGB + RF) Full Features **0.960 **

Dimensionality Reduction

Method Variance Retained Dimensions Reduced
PCA 95% High reduction
LassoCV Eliminated weather features Moderate reduction
Winner PCA Most compact

Key Findings

  1. Radio signal quality drives throughput — SINR, RSRP, and allocated resource blocks (Num_RBs) are the top predictors
  2. Weather is a near-zero predictor — confirmed by both mutual information analysis and LassoCV coefficient shrinkage
  3. Ensemble beats individuals — Voting Regressor (R²=0.96) outperforms standalone XGBoost (R²=0.935) by combining diverse learners
  4. No overfitting — learning curves show train/validation convergence across all models
  5. Simpson's Paradox observed — area-level aggregation can reverse individual feature-target trends, emphasizing the need for stratified analysis

Tech Stack

Library Version Purpose
pandas latest Data manipulation
numpy latest Numerical computing
scikit-learn latest ML models, PCA, LassoCV, CV
xgboost latest Gradient boosted trees
lightgbm latest Fast gradient boosting
matplotlib latest Plotting
seaborn latest Statistical visualization
scipy latest Z-score, statistics

Run environment: Google Colab with T4 GPU


How to Run

Option 1 — Google Colab (Recommended)

  1. Open 602.p5.SriRammSekarSasirekha.ipynb in Google Colab
  2. Mount Google Drive
  3. Upload Train.csv, Test.csv, and VariableDefinitions.csv to your Drive
  4. Update the BASE path at the top of the notebook:
BASE = "/content/drive/MyDrive/YOUR_FOLDER_NAME"
  1. Set runtime to T4 GPU → Runtime → Change runtime type → GPU
  2. Run all cells sequentially (Runtime → Run all)

Option 2 — Local

# Clone the repo
git clone https://github.com/SriRammSS/5GQoS-Analysis.git
cd 5GQoS-Analysis

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook 602.p5.SriRammSekarSasirekha.ipynb

Note: The notebook mounts Google Drive for data loading. For local runs, update the BASE variable to a local path containing the CSV files.


Repository Structure

5GQoS-Analysis/
│
├── 602.p5.SriRammSekarSasirekha.ipynb # Full 5-phase project notebook
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md # Project documentation

Author

Sri Ramm Sekar Sasirekha UID: 121949820 | Course: 602 — Principles of Data Science

GitHub


Built as part of the 602 Principles of Data Science course project.

About

QoS Prediction in 5G RAN — End-to-end ML pipeline (XGBoost + LightGBM + RF Ensemble) achieving R2=0.96 on Berlin V2X drive-test data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors