Author: Tan LE (David) - EPITA
This project aims to explore Mixture-of-Experts (MoE) inference in a simulated edge environment, focusing on how distributed or partitioned models can help reduce or increase inference latency when network constraints are introduced. By combining DeepSpeed, Python's multiprocessing, and artificial network delays, we gather insights into the trade-offs between scaling performance and managing communication overhead.
1. Evaluate Distributed MoE Inference:
Demonstrate how partitioning or scaling Mixture-of-Experts models affects inference latency and throughput, even in a simulated environment.
2. Investigate Network & Scheduling Overheads:
-
Introduce artificial network delays to reflect edge deployment constraints.
-
Show how concurrency or scheduling strategies (e.g., distributing load across experts) impacts performance.
3. Profile Resource Usage:
-
Collect baseline metrics (CPU/GPU usage, memory footprint) for single-process vs. multi-process setups.
-
Provide simple charts or tables comparing key performance indicators (KPIs).
4. Show Practical Insight:
Deliver a short, data-driven report outlining findings, trade-offs, and next steps.
- Python 3.11 or higher
- uv package installer
- (Optional) NVIDIA GPU with CUDA support for faster inference
-
Clone the repository:
git clone <repository-url> cd moe_sim_deepspeed
-
Create and activate virtual environment using uv:
# Create virtual environment uv venv # Activate virtual environment source .venv/bin/activate # On Unix/macOS # OR .venv\Scripts\activate # On Windows
-
Install dependencies:
uv pip install -r requirements.txt
-
(Optional) Set environment variables:
# Disable tokenizer parallelism warnings export TOKENIZERS_PARALLELISM=false # On Unix/macOS # OR set TOKENIZERS_PARALLELISM=false # On Windows
To verify the installation, you can run:
python -c "import torch; import deepspeed; import wandb; print('Setup successful!')"If you see "Setup successful!" without any errors, your environment is ready for the MoE inference experiments.
-
Make sure your virtual environment is activated:
source .venv/bin/activate # On Unix/macOS # OR .venv\Scripts\activate # On Windows
-
Set up Hugging Face authentication:
# Login to Hugging Face (you'll need an account and access to Llama models) huggingface-cli login -
Run the base inference script:
python base_inference.py
The script will automatically detect and use the best available device (CUDA, MPS, or CPU) for inference.
The distributed inference system supports three main experimentation modes to test different aspects of the MoE (Mixture of Experts) setup:
python distributed_inference.py --mode single --num_experts 3 --latency 5.0 --num_runs 5This mode:
- Runs with a specified number of expert processes (default: 3)
- Adds configurable artificial latency (default: 5ms)
- Performs multiple inference runs
- Logs results to Weights & Biases
- Shows per-run metrics and averages
Use this mode for:
- Initial testing and verification
- Quick performance checks
- Debugging specific configurations
python distributed_inference.py --mode latencyThis mode:
- Tests different latency settings (0ms, 5ms, 10ms, 20ms)
- Uses default 3 experts
- Runs multiple inference runs for each latency setting
- Helps understand how network latency affects distributed inference performance
Use this mode to:
- Study the impact of network conditions
- Optimize communication strategies
- Determine latency thresholds
python distributed_inference.py --mode expertsThis mode:
- Tests different numbers of experts (2, 3, 4)
- Uses default 5ms latency
- Runs multiple inference runs for each expert count
- Helps understand how scaling the number of experts affects performance
Use this mode to:
- Study system scalability
- Optimize expert count
- Balance resource utilization
For a comprehensive analysis of all aspects of the MoE system, use the orchestration script:
python run_experiments.py --mode all --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20This script:
- Runs all experiments (base inference, distributed inference with different expert counts and latencies)
- Collects resource usage metrics
- Generates comprehensive reports with visualizations
- Logs results to Weights & Biases
You can also run specific experiments:
# Run only the expert count experiment
python run_experiments.py --mode expert_count --num_runs 5 --expert_counts 2,3,4
# Run only the latency experiment
python run_experiments.py --mode latency --num_runs 5 --latencies 0,5,10,20
# Run only the comparison experiment
python run_experiments.py --mode comparison --num_runs 5 --expert_counts 2,3,4 --latencies 0,5,10,20The results will be stored in the specified output directory (default: experiment_results), and reports will be generated in the reports subdirectory.
Our experiments reveal several important insights about MoE inference in edge environments:
-
Latency Thresholds: There's a critical latency threshold beyond which distributed inference becomes slower than single-process inference. This threshold varies based on model size and hardware capabilities.
-
Expert Count Optimization: Adding more experts doesn't always improve performance. There's an optimal number of experts that balances parallelism with communication overhead.
-
Resource Efficiency: Distributed inference can be more memory-efficient per process, but the total system memory usage increases with the number of experts.
-
Communication Patterns: The way experts communicate (synchronously vs. asynchronously) significantly impacts overall performance, especially under high latency conditions.
These findings help guide the design of efficient MoE systems for edge deployment, where network constraints are a significant factor.
For each run, you'll see:
- Per-run metrics:
- Generated text
- Processing time
- Communication time
- Total time
- Tokens per second
- Summary statistics:
- Average processing time
- Average communication time
- Average total time
- Average tokens per second
All results are automatically logged to Weights & Biases for visualization and analysis.
The script supports multiple hardware acceleration options:
- CUDA: For NVIDIA GPUs (fastest option)
- MPS: For Apple Silicon Macs
- CPU: Fallback option for systems without GPU support
To ensure CUDA support, make sure you have the appropriate NVIDIA drivers installed and PyTorch with CUDA support.