Skip to content

tanle8/moe_sim_deepspeed

Repository files navigation

Accelerating MoE Inference in a Simulated Edge Environment Using DeepSpeed

Author: Tan LE (David) - EPITA

Project Goal

This project aims to explore Mixture-of-Experts (MoE) inference in a simulated edge environment, focusing on how distributed or partitioned models can help reduce or increase inference latency when network constraints are introduced. By combining DeepSpeed, Python's multiprocessing, and artificial network delays, we gather insights into the trade-offs between scaling performance and managing communication overhead.

Objectives / Goals

1. Evaluate Distributed MoE Inference:

Demonstrate how partitioning or scaling Mixture-of-Experts models affects inference latency and throughput, even in a simulated environment.

2. Investigate Network & Scheduling Overheads:

  • Introduce artificial network delays to reflect edge deployment constraints.

  • Show how concurrency or scheduling strategies (e.g., distributing load across experts) impacts performance.

3. Profile Resource Usage:

  • Collect baseline metrics (CPU/GPU usage, memory footprint) for single-process vs. multi-process setups.

  • Provide simple charts or tables comparing key performance indicators (KPIs).

4. Show Practical Insight:

Deliver a short, data-driven report outlining findings, trade-offs, and next steps.

Setup Instructions

Prerequisites

  • Python 3.11 or higher
  • uv package installer
  • (Optional) NVIDIA GPU with CUDA support for faster inference

Environment Setup

  1. Clone the repository:

    git clone <repository-url>
    cd moe_sim_deepspeed
  2. Create and activate virtual environment using uv:

    # Create virtual environment
    uv venv
    
    # Activate virtual environment
    source .venv/bin/activate  # On Unix/macOS
    # OR
    .venv\Scripts\activate     # On Windows
  3. Install dependencies:

    uv pip install -r requirements.txt
  4. (Optional) Set environment variables:

    # Disable tokenizer parallelism warnings
    export TOKENIZERS_PARALLELISM=false  # On Unix/macOS
    # OR
    set TOKENIZERS_PARALLELISM=false     # On Windows

Verification

To verify the installation, you can run:

python -c "import torch; import deepspeed; import wandb; print('Setup successful!')"

If you see "Setup successful!" without any errors, your environment is ready for the MoE inference experiments.

Running the Base Inference Script

  1. Make sure your virtual environment is activated:

    source .venv/bin/activate  # On Unix/macOS
    # OR
    .venv\Scripts\activate     # On Windows
  2. Set up Hugging Face authentication:

    # Login to Hugging Face (you'll need an account and access to Llama models)
    huggingface-cli login
  3. Run the base inference script:

    python base_inference.py

The script will automatically detect and use the best available device (CUDA, MPS, or CPU) for inference.

Experimentation Modes

The distributed inference system supports three main experimentation modes to test different aspects of the MoE (Mixture of Experts) setup:

1. Single Run Mode (Basic Testing)

python distributed_inference.py --mode single --num_experts 3 --latency 5.0 --num_runs 5

This mode:

  • Runs with a specified number of expert processes (default: 3)
  • Adds configurable artificial latency (default: 5ms)
  • Performs multiple inference runs
  • Logs results to Weights & Biases
  • Shows per-run metrics and averages

Use this mode for:

  • Initial testing and verification
  • Quick performance checks
  • Debugging specific configurations

2. Latency Experiment Mode (Network Impact Study)

python distributed_inference.py --mode latency

This mode:

  • Tests different latency settings (0ms, 5ms, 10ms, 20ms)
  • Uses default 3 experts
  • Runs multiple inference runs for each latency setting
  • Helps understand how network latency affects distributed inference performance

Use this mode to:

  • Study the impact of network conditions
  • Optimize communication strategies
  • Determine latency thresholds

3. Expert Count Experiment Mode (Scaling Study)

python distributed_inference.py --mode experts

This mode:

  • Tests different numbers of experts (2, 3, 4)
  • Uses default 5ms latency
  • Runs multiple inference runs for each expert count
  • Helps understand how scaling the number of experts affects performance

Use this mode to:

  • Study system scalability
  • Optimize expert count
  • Balance resource utilization

4. Orchestrated Experiments (Comprehensive Analysis)

For a comprehensive analysis of all aspects of the MoE system, use the orchestration script:

python run_experiments.py --mode all --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20

This script:

  • Runs all experiments (base inference, distributed inference with different expert counts and latencies)
  • Collects resource usage metrics
  • Generates comprehensive reports with visualizations
  • Logs results to Weights & Biases

You can also run specific experiments:

# Run only the expert count experiment
python run_experiments.py --mode expert_count --num_runs 5 --expert_counts 2,3,4

# Run only the latency experiment
python run_experiments.py --mode latency --num_runs 5 --latencies 0,5,10,20

# Run only the comparison experiment
python run_experiments.py --mode comparison --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20

The results will be stored in the specified output directory (default: experiment_results), and reports will be generated in the reports subdirectory.

Key Findings and Insights

Our experiments reveal several important insights about MoE inference in edge environments:

  1. Latency Thresholds: There's a critical latency threshold beyond which distributed inference becomes slower than single-process inference. This threshold varies based on model size and hardware capabilities.

  2. Expert Count Optimization: Adding more experts doesn't always improve performance. There's an optimal number of experts that balances parallelism with communication overhead.

  3. Resource Efficiency: Distributed inference can be more memory-efficient per process, but the total system memory usage increases with the number of experts.

  4. Communication Patterns: The way experts communicate (synchronously vs. asynchronously) significantly impacts overall performance, especially under high latency conditions.

These findings help guide the design of efficient MoE systems for edge deployment, where network constraints are a significant factor.

Metrics and Analysis

For each run, you'll see:

  • Per-run metrics:
    • Generated text
    • Processing time
    • Communication time
    • Total time
    • Tokens per second
  • Summary statistics:
    • Average processing time
    • Average communication time
    • Average total time
    • Average tokens per second

All results are automatically logged to Weights & Biases for visualization and analysis.

Hardware Acceleration

The script supports multiple hardware acceleration options:

  • CUDA: For NVIDIA GPUs (fastest option)
  • MPS: For Apple Silicon Macs
  • CPU: Fallback option for systems without GPU support

To ensure CUDA support, make sure you have the appropriate NVIDIA drivers installed and PyTorch with CUDA support.

About

Exploring Mixture-of-Experts (MoE) inference in simulated edge environments using DeepSpeed. This project investigates how distributed model partitioning affects inference latency under network constraints, providing insights into the trade-offs between scaling performance and communication overhead.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors