Accelerating MoE Inference in a Simulated Edge Environment Using DeepSpeed

Author: Tan LE (David) - EPITA

Project Goal

This project aims to explore Mixture-of-Experts (MoE) inference in a simulated edge environment, focusing on how distributed or partitioned models can help reduce or increase inference latency when network constraints are introduced. By combining DeepSpeed, Python's multiprocessing, and artificial network delays, we gather insights into the trade-offs between scaling performance and managing communication overhead.

Objectives / Goals

1. Evaluate Distributed MoE Inference:

Demonstrate how partitioning or scaling Mixture-of-Experts models affects inference latency and throughput, even in a simulated environment.

2. Investigate Network & Scheduling Overheads:

Introduce artificial network delays to reflect edge deployment constraints.
Show how concurrency or scheduling strategies (e.g., distributing load across experts) impacts performance.

3. Profile Resource Usage:

Collect baseline metrics (CPU/GPU usage, memory footprint) for single-process vs. multi-process setups.
Provide simple charts or tables comparing key performance indicators (KPIs).

4. Show Practical Insight:

Deliver a short, data-driven report outlining findings, trade-offs, and next steps.

Setup Instructions

Prerequisites

Python 3.11 or higher
uv package installer
(Optional) NVIDIA GPU with CUDA support for faster inference

Environment Setup

Clone the repository:

git clone <repository-url>
cd moe_sim_deepspeed

Create and activate virtual environment using uv:

# Create virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows

Install dependencies:
```
uv pip install -r requirements.txt
```

(Optional) Set environment variables:

# Disable tokenizer parallelism warnings
export TOKENIZERS_PARALLELISM=false  # On Unix/macOS
# OR
set TOKENIZERS_PARALLELISM=false     # On Windows

Verification

To verify the installation, you can run:

python -c "import torch; import deepspeed; import wandb; print('Setup successful!')"

If you see "Setup successful!" without any errors, your environment is ready for the MoE inference experiments.

Running the Base Inference Script

Make sure your virtual environment is activated:

source .venv/bin/activate  # On Unix/macOS
# OR
.venv\Scripts\activate     # On Windows

Set up Hugging Face authentication:

# Login to Hugging Face (you'll need an account and access to Llama models)
huggingface-cli login

Run the base inference script:
```
python base_inference.py
```

The script will automatically detect and use the best available device (CUDA, MPS, or CPU) for inference.

Experimentation Modes

The distributed inference system supports three main experimentation modes to test different aspects of the MoE (Mixture of Experts) setup:

1. Single Run Mode (Basic Testing)

python distributed_inference.py --mode single --num_experts 3 --latency 5.0 --num_runs 5

This mode:

Runs with a specified number of expert processes (default: 3)
Adds configurable artificial latency (default: 5ms)
Performs multiple inference runs
Logs results to Weights & Biases
Shows per-run metrics and averages

Use this mode for:

Initial testing and verification
Quick performance checks
Debugging specific configurations

2. Latency Experiment Mode (Network Impact Study)

python distributed_inference.py --mode latency

This mode:

Tests different latency settings (0ms, 5ms, 10ms, 20ms)
Uses default 3 experts
Runs multiple inference runs for each latency setting
Helps understand how network latency affects distributed inference performance

Use this mode to:

Study the impact of network conditions
Optimize communication strategies
Determine latency thresholds

3. Expert Count Experiment Mode (Scaling Study)

python distributed_inference.py --mode experts

This mode:

Tests different numbers of experts (2, 3, 4)
Uses default 5ms latency
Runs multiple inference runs for each expert count
Helps understand how scaling the number of experts affects performance

Use this mode to:

Study system scalability
Optimize expert count
Balance resource utilization

4. Orchestrated Experiments (Comprehensive Analysis)

For a comprehensive analysis of all aspects of the MoE system, use the orchestration script:

python run_experiments.py --mode all --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20

This script:

Runs all experiments (base inference, distributed inference with different expert counts and latencies)
Collects resource usage metrics
Generates comprehensive reports with visualizations
Logs results to Weights & Biases

You can also run specific experiments:

# Run only the expert count experiment
python run_experiments.py --mode expert_count --num_runs 5 --expert_counts 2,3,4

# Run only the latency experiment
python run_experiments.py --mode latency --num_runs 5 --latencies 0,5,10,20

# Run only the comparison experiment
python run_experiments.py --mode comparison --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20

The results will be stored in the specified output directory (default: experiment_results), and reports will be generated in the reports subdirectory.

Key Findings and Insights

Our experiments reveal several important insights about MoE inference in edge environments:

Latency Thresholds: There's a critical latency threshold beyond which distributed inference becomes slower than single-process inference. This threshold varies based on model size and hardware capabilities.
Expert Count Optimization: Adding more experts doesn't always improve performance. There's an optimal number of experts that balances parallelism with communication overhead.
Resource Efficiency: Distributed inference can be more memory-efficient per process, but the total system memory usage increases with the number of experts.
Communication Patterns: The way experts communicate (synchronously vs. asynchronously) significantly impacts overall performance, especially under high latency conditions.

These findings help guide the design of efficient MoE systems for edge deployment, where network constraints are a significant factor.

Metrics and Analysis

For each run, you'll see:

Per-run metrics:
- Generated text
- Processing time
- Communication time
- Total time
- Tokens per second
Summary statistics:
- Average processing time
- Average communication time
- Average total time
- Average tokens per second

All results are automatically logged to Weights & Biases for visualization and analysis.

Hardware Acceleration

The script supports multiple hardware acceleration options:

CUDA: For NVIDIA GPUs (fastest option)
MPS: For Apple Silicon Macs
CPU: Fallback option for systems without GPU support

To ensure CUDA support, make sure you have the appropriate NVIDIA drivers installed and PyTorch with CUDA support.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
README.md		README.md
base_inference.py		base_inference.py
distributed_inference.py		distributed_inference.py
expert_process.py		expert_process.py
generate_report.py		generate_report.py
moe_config.py		moe_config.py
report_template.html		report_template.html
requirements.txt		requirements.txt
resource_comparison.py		resource_comparison.py
resource_monitor.py		resource_monitor.py
run_experiments.py		run_experiments.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Accelerating MoE Inference in a Simulated Edge Environment Using DeepSpeed

Project Goal

Objectives / Goals

Setup Instructions

Prerequisites

Environment Setup

Verification

Running the Base Inference Script

Experimentation Modes

1. Single Run Mode (Basic Testing)

2. Latency Experiment Mode (Network Impact Study)

3. Expert Count Experiment Mode (Scaling Study)

4. Orchestrated Experiments (Comprehensive Analysis)

Key Findings and Insights

Metrics and Analysis

Hardware Acceleration

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Accelerating MoE Inference in a Simulated Edge Environment Using DeepSpeed

Project Goal

Objectives / Goals

Setup Instructions

Prerequisites

Environment Setup

Verification

Running the Base Inference Script

Experimentation Modes

1. Single Run Mode (Basic Testing)

2. Latency Experiment Mode (Network Impact Study)

3. Expert Count Experiment Mode (Scaling Study)

4. Orchestrated Experiments (Comprehensive Analysis)

Key Findings and Insights

Metrics and Analysis

Hardware Acceleration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages