5G QoS Prediction in Radio Access Networks

Using Machine Learning to Predict Downlink Throughput from Drive-Test Measurements

Best Model: Voting Regressor (XGBoost + LightGBM + Random Forest) — R² = 0.96

Overview

This project builds a complete end-to-end machine learning pipeline to predict downlink throughput (Quality of Service) in a 5G Radio Access Network (RAN).

Using real-world drive-test measurements from the Berlin V2X vehicular scenario, the project progresses through five structured phases — raw data collection, cleaning, exploratory analysis, feature engineering, and model training — culminating in a high-accuracy ensemble model.

Metric	Value
Task	Supervised Regression
Target	Downlink data rate (bits/sec)
Training samples	Drive-test rows with Carrier Aggregation enabled
Final R²	0.960
Best Model	Voting Regressor (XGB + LGB + RF)

Research Question & Hypothesis

Can we use 5G radio, mobility, and context features to predict the downlink throughput of a vehicular UE under Carrier Aggregation (CA)?

H1 (Testable): If radio link quality and allocated resources improve — indicated by higher PCell_SNR, RSRP, and Downlink_Num_RBs — then the downlink data rate (target) will increase significantly.

Result: Hypothesis confirmed. Radio signal metrics and resource block allocation are the dominant predictors of 5G throughput.

Dataset

Property	Details
Source	Kaggle — QoS Prediction Challenge (AI/ML in 5G)
Origin	Berlin V2X drive-test measurements
Files	`Train.csv`, `Test.csv`, `VariableDefinitions.csv`
Target variable	`target` — Downlink data rate in bits/second (continuous)
Execution env	Google Colab (T4 GPU)

Feature Groups

Group	Features	Strength
Radio Signal	RSRP, RSRQ, RSSI, SNR (PCell & SCell)	Strong
Resource Allocation	Downlink Num_RBs, Bandwidth (MHz)	Strong
Mobility & GPS	Latitude, Longitude, Altitude, Speed	Moderate
Context / Environment	Area type, Cell Identity	Moderate
Temporal	`hour_sin`, `hour_cos` (diurnal encoding)	Moderate
Weather	Temperature, Humidity, Wind, Precipitation	Weak

ML Pipeline

Raw Drive-Test CSV
 │
 ▼
┌─────────────────────┐
│ Phase 1 │ Data ingestion, shape & type exploration
│ Data Collection │
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 2 │ Duplicate removal, KNN + domain-driven imputation,
│ Cleaning & Prep │ 3GPP-bounded outlier clipping
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 3 │ Distributions, correlation heatmaps,
│ EDA │ mutual information, Simpson's paradox, seasonality
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 4 │ PCA (95% variance), LassoCV feature selection,
│ Feature Eng. │ signal ratios, temporal cyclical encoding
└────────┬────────────┘
 │
 ▼
┌─────────────────────┐
│ Phase 5 │ Ridge, SVR, Decision Tree, XGBoost, LightGBM,
│ Modeling │ Random Forest, Voting Regressor + GridSearchCV
└────────┬────────────┘
 │
 ▼
 R² = 0.960

Project Phases

Phase 1 — Data Collection

Loaded Train.csv and Test.csv from Google Drive
Inspected shape, dtypes (object, int64, float64) and variable definitions
Understood the population: drive-test observations from a metropolitan vehicular network with Carrier Aggregation enabled
Identified the target variable: continuous downlink throughput in bits/sec

Phase 2 — Data Cleaning & Preprocessing

Duplicates: 0 duplicate rows found
Missing values: Null patterns across signal and context columns
Imputation strategy (layered):
Bandwidth → RB count derivation using domain formula
Cell-ID grouped median imputation for radio metrics
KNN Imputer for residual nulls
Outlier handling: Z-score (threshold = 3) + box plots; clipped to 3GPP standard bounds — extreme values retained as they carry real telecom signal

Phase 3 — Exploratory Data Analysis

Descriptive statistics (5-number summaries) across all splits
Distribution plots for all numeric and categorical features (saved as postproc_freq_plots/)
Correlation heatmap: Strong positive block — SNR, RSRP, RSRQ, RSSI, Num_RBs → target
Mutual information + chi-squared: Radio metrics dominate; weather near-zero
Simpson's Paradox: Aggregated area-level trends reverse within-group trends for certain features
Seasonality: Clear diurnal pattern in mean throughput encoded via hour_sin / hour_cos

Phase 4 — Feature Engineering & Dimensionality Reduction

Engineered features: signal ratios, clipped radio bounds, temporal cyclical encoding
PCA → retained 95% variance in significantly fewer components (highest reduction)
LassoCV → eliminated near-zero weather coefficients
Evaluated all three dataset variants (Full / PCA / LassoCV)
Full Features selected as final training set for maximum predictive information

Phase 5 — Model Training & Evaluation

Trained 4 base models with hyperparameter tuning via GridSearchCV / RandomizedSearchCV
Added ensemble models (Random Forest, LightGBM, Voting Regressor)
Evaluated using R², MAE, RMSE, and learning curves
No significant overfitting — training and validation scores converge well

Results

Model Comparison

Model	Features	R² Score
Ridge Regression	Full Features	Baseline
SVR	Full Features	Moderate
Decision Tree	Full Features	Moderate
XGBoost	Full Features	0.935
Random Forest	Full Features	High
LightGBM	Full Features	High
Voting Regressor (XGB + LGB + RF)	Full Features	0.960

Dimensionality Reduction

Method	Variance Retained	Dimensions Reduced
PCA	95%	High reduction
LassoCV	Eliminated weather features	Moderate reduction
Winner	PCA	Most compact

Key Findings

Radio signal quality drives throughput — SINR, RSRP, and allocated resource blocks (Num_RBs) are the top predictors
Weather is a near-zero predictor — confirmed by both mutual information analysis and LassoCV coefficient shrinkage
Ensemble beats individuals — Voting Regressor (R²=0.96) outperforms standalone XGBoost (R²=0.935) by combining diverse learners
No overfitting — learning curves show train/validation convergence across all models
Simpson's Paradox observed — area-level aggregation can reverse individual feature-target trends, emphasizing the need for stratified analysis

Tech Stack

Library	Version	Purpose
`pandas`	latest	Data manipulation
`numpy`	latest	Numerical computing
`scikit-learn`	latest	ML models, PCA, LassoCV, CV
`xgboost`	latest	Gradient boosted trees
`lightgbm`	latest	Fast gradient boosting
`matplotlib`	latest	Plotting
`seaborn`	latest	Statistical visualization
`scipy`	latest	Z-score, statistics

Run environment: Google Colab with T4 GPU

How to Run

Option 1 — Google Colab (Recommended)

Open 602.p5.SriRammSekarSasirekha.ipynb in Google Colab
Mount Google Drive
Upload Train.csv, Test.csv, and VariableDefinitions.csv to your Drive
Update the BASE path at the top of the notebook:

BASE = "/content/drive/MyDrive/YOUR_FOLDER_NAME"

Set runtime to T4 GPU → Runtime → Change runtime type → GPU
Run all cells sequentially (Runtime → Run all)

Option 2 — Local

# Clone the repo
git clone https://github.com/SriRammSS/5GQoS-Analysis.git
cd 5GQoS-Analysis

# Install dependencies
pip install -r requirements.txt

# Launch Jupyter
jupyter notebook 602.p5.SriRammSekarSasirekha.ipynb

Note: The notebook mounts Google Drive for data loading. For local runs, update the BASE variable to a local path containing the CSV files.

Repository Structure

5GQoS-Analysis/
│
├── 602.p5.SriRammSekarSasirekha.ipynb # Full 5-phase project notebook
├── requirements.txt # Python dependencies
├── LICENSE # MIT License
└── README.md # Project documentation

Author

Sri Ramm Sekar Sasirekha UID: 121949820 | Course: 602 — Principles of Data Science

_{Built as part of the 602 Principles of Data Science course project.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

5G QoS Prediction in Radio Access Networks

Using Machine Learning to Predict Downlink Throughput from Drive-Test Measurements

Table of Contents

Overview

Research Question & Hypothesis

Dataset

Feature Groups

ML Pipeline

Project Phases

Results

Model Comparison

Dimensionality Reduction

Key Findings

Tech Stack

How to Run

Option 1 — Google Colab (Recommended)

Option 2 — Local

Repository Structure

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
602.p5.SriRammSekarSasirekha.ipynb		602.p5.SriRammSekarSasirekha.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

5G QoS Prediction in Radio Access Networks

Using Machine Learning to Predict Downlink Throughput from Drive-Test Measurements

Table of Contents

Overview

Research Question & Hypothesis

Dataset

Feature Groups

ML Pipeline

Project Phases

Results

Model Comparison

Dimensionality Reduction

Key Findings

Tech Stack

How to Run

Option 1 — Google Colab (Recommended)

Option 2 — Local

Repository Structure

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages