This repository contains the code for the CS5229 NS3 Programming Assignment, which is part of the CS5229 course at NUS. The assignment focuses on implementing a distributed machine learning system simulator using NS3.
To run this project, we recommend using Docker to set up the environment. You can find the Dockerfile in the root directory of this repository.
Please ensure that Docker is installed on your machine.
It should be workable on Linux, MacOS, and Windows. However, please note that the Desktop version of Docker may not function stably on Linux and Windows, so we recommend using the command-line version instead. (You may need to learn some basic Docker commands first if you are not familiar with it.)
The Astra-sim project has a known issue affecting End-to-End tests (workloads with ML training jobs).
During execution, the program may get stuck at a certain point. In such cases, the fct.txt file will stop recording new flow completion times.
Under baseline configurations, the 32-node End-to-End workload should finish in less than 20 minutes, and the 128-node workload should finish in under 30 minutes. If your run does not complete within 40 minutes, it is likely stuck.
Unfortunately, we have not been able to resolve this issue, as it originates from the internal implementation of Astra-sim, which is maintained upstream.
Workaround:
- Stop the execution manually (
Ctrl+C). - Modify the random seed in the build script (e.g., change
RANDOM_SEEDS=(1)toRANDOM_SEEDS=(2)), and re-run the test. - To save time, you can run multiple seeds concurrently on different threads and keep the runs that complete successfully.
- We recommend using this approach only for final evaluations. For early-stage development and testing, please start with microbenchmarks or smaller workloads.
-
Clone this repository to your local machine.
-
Navigate to the project directory.
-
Build the Docker image is very simple, just run the following command in the terminal:
docker build -t cs5229-astra-sim . -
After the image is built, you can start a container using the following command. Please note that we need to mount the current local directory to this container.
docker run -dit --name cs5229-astra-sim --mount type=bind,source=[the absolute path of the current directory],target=/app/astra-sim cs5229-astra-sim:latest
-
After starting the container, you can enter it by:
docker exec -it cs5229-astra-sim bash -
To leave the container, just type
exitin the terminal. -
In the container, you can navigate to the
/app/astra-simdirectory, where you can find all the code and scripts. -
To run the experiments, please find the scripts under
build_scripts/astra_ns3/directory. You can refer to the instructions in theREADME.mdfile under that directory. (Please note that if you re-run the same experiment, the output will be overwritten by the latest run. Please be careful, backup your results if needed.) -
After running the experiments, you can find the output files under
extern/network_backend/ns-3/scratch/output/directory. For the detailed explanation of the output files, please refer to theextern/network_backend/ns-3/scratch/output_document/directory.
Since the project involves a lot of things to track, in case you get lost, here is a brief overview of the workflow:
-
Set up the environment, make sure the baseline is running.
-
Observe the baseline performance, based on the results in the output directory, and think about the potential problems.
-
Based on the problems, do some checking to make sure your problem really exists. (e.g. check the queue occupancy, link utilization, etc.)
-
After confirming the problem, think about the potential solutions and implement them.
-
When you implement a solution, think about what category of the solution it is, find the corresponding files we mentioned below, and modify them. (Other files are also allowed to be modified, but please do not break the fundamental assumptions on infrastructure, e.g. link speed, topology, etc.)
-
After implementing the solution, run experiments to see if it works. When you do testing, remember to check: config files, build scripts, output files, etc, to make sure everything is correctly set up, and the evaluation is fair.
-
Based on the results, do more analysis, see if the problem has really been solved, or if there are any new problems.
-
When you finalize your solution and experiment results, prepare your report. Don't forget to include enough visualizations to make your report more convincing.
We understand that a nice IDE can greatly enhance productivity. You can use Visual Studio Code with the Remote - Containers extension to develop inside the Docker container. This allows you to leverage the powerful features of VSCode while working within the containerized environment.
When the container is running, open VS Code, and use the "Remote - Containers: Attach to Running Container..." command from the Command Palette (Ctrl+Shift+P) to connect to the cs5229-astra-sim container.
Inside the container, you can open the folder /app/astra-sim to access the project files. You can also open a terminal in VS Code, which will be inside the container.
The project is based on CMake, but since NS3 is quite heavy and not organized in a standard way, we found that the CMake extension cannot work properly. So you can just use the terminal to build and run the code.
The network functions you can edit, and the corresponding files are listed below:
(The NS3 source files are under extern/ns-3/)
-
Load Balancing Algorithm:
src/point-to-point/model/load-balancing-customized.*(Can refer toload-balancing-ecmp.*for example) -
Host-side Congestion Control Algorithm:
src/point-to-point/model/rdma-hw.* -
Switch-side Congestion Control (e.g. ECN):
src/point-to-point/model/switch-mmu.* -
Queue Management / Scheduling Algorithm:
src/network/utils/broadcom-egress-queue.*(Note this one is in a different location)
Please note that these are recommendations, if your solution involves modifying other files, that's perfectly fine as well, as long as you do not break the fundamental assumptions on infrastructure (e.g. link speed, topology, etc).
The configuration files are under extern/ns-3/scratch/config/. You can add your own configuration files here, and modify the scripts under build_scripts/astra_ns3/ to run your experiments.
Note on parameter tuning: You can modify the parameters in the configuration files. However, please note that in the report, you should clearly state which parameters you have modified and why. And make sure you are doing a fair comparison. For example, if you are comparing two load balancing algorithms, make sure the network topology and other parameters are the same. Only a fair comparison can lead to meaningful conclusions.
-
The network setup steps are defined in
extern/ns-3/scratch/common.h. -
The main function is defined in
astra-sim/network_frontend/AstraSimNetwork.cc. -
The Astra-sim's folder
astra-sim/contains how the workload is compiled and executed. Since this part is the fundamental assumption of the workload, do not modify it. -
src/point-to-point/model/rdma-*contains the RDMA model in NS3, which is used to simulate the RDMA communication. You can refer to it if you want to understand how RDMA works. -
src/point-to-point/model/switch-*contains the switch model in NS3, which is used to simulate the switch behavior, the "high" level switch behavior is defined inswitch-node.*, while the more detailed level switch behavior is defined inswitch-mmu.*.
This assignment is based on the ASTRA-sim 2.0 simulator, which is a distributed machine learning system simulator by Intel, Meta, and Georgia Tech. For more information about ASTRA-sim, please refer to the ASTRA-sim GitHub repository.
If you have more interest in the project, you can visit their websitewebsite.
Thanks to the ASTRA-sim team for their contributions to the field of distributed machine learning system simulation. It really enables the probability of trying out the distributed machine learning in system simulator like NS3, provides us a great opportunity to learn and practice, without the need to set up hardware environment.
This project is licensed under the MIT License - see the LICENSE file for details. We kept the author information and code ownership clarification. We have performed some motifications to the original code to adapt to the assignment requirements. Please refer to the comments in the code for more details.