Skip to content

karamperiss/heterogeneous-cluster-simulation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Heterogeneous Server Cluster Simulation & Load Balancing

Overview

This project focuses on the performance analysis of heterogeneous server clusters and explores the effectiveness of different load balancing strategies. In modern cloud infrastructure, servers often have varying processing capabilities. Standard routing algorithms that ignore this heterogeneity can lead to severe bottlenecks.

Using discrete event simulation (via the Python Ciw library), this project models a network of servers and compares a standard "Shortest Queue" algorithm against a custom "Heterogeneity-Aware" routing policy. The goal is to find the optimal strategy that minimizes average customer wait time while maximizing resource utilization.

System Architecture

The simulated queuing network consists of 12 dynamically configurable nodes defined in config.yml:

  • Node 1 (Dispatcher): The entry point of the system. It performs no processing (Service time = 0) and is strictly responsible for routing incoming requests to the worker nodes. Arrivals follow an Exponential Distribution ($\lambda=1.0$ jobs/sec).
  • Nodes 2-12 (Heterogeneous Workers): 11 worker nodes divided into three speed tiers, with service times following an Exponential Distribution.
    • Slow Workers (Nodes 2-6): Average service time of 12 sec.
    • Medium Workers (Nodes 7-9): Average service time of 8 sec.
    • Fast Workers (Nodes 10-12): Average service time of 4 sec.

Load Balancing Strategies Evaluated

1. Algorithm 1: Shortest Queue

A baseline policy that routes tasks to the worker with the absolute shortest queue (including the currently served job). It completely ignores the processing speed of the nodes. In case of a tie, the node with the lowest ID is selected.

2. Algorithm 2: Heterogeneity-Aware (Optimal)

A smarter approach that attempts to exploit the faster servers. It sorts workers by their current load. If the least-loaded worker is "Fast", it is selected immediately. If the least-loaded worker is "Slow" or "Medium", the algorithm searches for a "Fast" node that satisfies a tolerance parameter $d$:

$$Load_{fast}\le{Load_{minNode}+d}$$

If found, the fast node takes the job, prioritizing speed over a slightly shorter queue. Otherwise, it falls back to the least-loaded node.

Simulation Methodology

To ensure statistical validity, the simulation is fully automated via run_experiments.py:

  • Duration: 24-hour simulation time with an additional 1-hour warm-up period to reach a steady state.
  • Monte Carlo: 20 independent runs per experiment.
  • Metrics: 95% Confidence Intervals ($\alpha=0.05$) are calculated for average waiting time and resource utilization.

Key Results

1. Tuning the Tolerance Parameter ($d$)

Experiments showed that the optimal value for the Heterogeneity-Aware algorithm is $d=0$. Forcing jobs into already busy fast servers ($d > 0$) causes severe bottlenecks and reduces the overall system utilization.


2. Shortest Queue vs. Heterogeneity-Aware Algorithm

Comparing the baseline against our tuned algorithm ($d=0$) yielded massive improvements:

Metric Algorithm 1 (Shortest Queue) Algorithm 2 (Heterogeneity-Aware, $d=0$)
Average Wait Time 0.78 ± 0.008 sec 0.257 ± 0.0005 sec
Average Utilization 75.07% ± 0.08% 62.82% ± 0.14%

Conclusion: Algorithm 2 reduces wait times by ~66%. While Algorithm 1 shows higher resource utilization (75%), this is a negative indicator here; tasks get trapped in "Slow" nodes (12 sec/job) simply because their queues were temporarily short. Algorithm 2 clears the workload much faster by heavily utilizing the 4-second nodes, leading to a highly desirable lower overall utilization state.

How to Run the Project

  1. Install dependencies:
    pip install ciw pyyaml matplotlib numpy
  2. Run the automated experiment script:
    python run_experiments.py
  3. The script will automatically generate the statistical summary (simulation_results.txt) and all relevant plots in the root directory.

About

A Python-based discrete event simulation evaluating load balancing strategies (Shortest Queue vs. Heterogeneity-Aware) in a mixed-speed server cluster.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages