CS5229 NS3 Programming Assignment

This repository contains the code for the CS5229 NS3 Programming Assignment, which is part of the CS5229 course at NUS. The assignment focuses on implementing a distributed machine learning system simulator using NS3.

Getting Started

To run this project, we recommend using Docker to set up the environment. You can find the Dockerfile in the root directory of this repository.

Please ensure that Docker is installed on your machine.

It should be workable on Linux, MacOS, and Windows. However, please note that the Desktop version of Docker may not function stably on Linux and Windows, so we recommend using the command-line version instead. (You may need to learn some basic Docker commands first if you are not familiar with it.)

One Known Issue

The Astra-sim project has a known issue affecting End-to-End tests (workloads with ML training jobs).

During execution, the program may get stuck at a certain point. In such cases, the fct.txt file will stop recording new flow completion times.

Under baseline configurations, the 32-node End-to-End workload should finish in less than 20 minutes, and the 128-node workload should finish in under 30 minutes. If your run does not complete within 40 minutes, it is likely stuck.

Unfortunately, we have not been able to resolve this issue, as it originates from the internal implementation of Astra-sim, which is maintained upstream.

Workaround:

Stop the execution manually (Ctrl+C).
Modify the random seed in the build script (e.g., change RANDOM_SEEDS=(1) to RANDOM_SEEDS=(2)), and re-run the test.
To save time, you can run multiple seeds concurrently on different threads and keep the runs that complete successfully.
We recommend using this approach only for final evaluations. For early-stage development and testing, please start with microbenchmarks or smaller workloads.

Build and Run

Clone this repository to your local machine.
Navigate to the project directory.
Build the Docker image is very simple, just run the following command in the terminal:
```
docker build -t cs5229-astra-sim .
```
After the image is built, you can start a container using the following command. Please note that we need to mount the current local directory to this container.
```
docker run -dit --name cs5229-astra-sim --mount type=bind,source=[the absolute path of the current directory],target=/app/astra-sim cs5229-astra-sim:latest
```
After starting the container, you can enter it by:
```
docker exec -it cs5229-astra-sim bash
```
To leave the container, just type exit in the terminal.
In the container, you can navigate to the /app/astra-sim directory, where you can find all the code and scripts.
To run the experiments, please find the scripts under build_scripts/astra_ns3/ directory. You can refer to the instructions in the README.md file under that directory. (Please note that if you re-run the same experiment, the output will be overwritten by the latest run. Please be careful, backup your results if needed.)
After running the experiments, you can find the output files under extern/network_backend/ns-3/scratch/output/ directory. For the detailed explanation of the output files, please refer to the extern/network_backend/ns-3/scratch/output_document/ directory.

Overall workflow

Since the project involves a lot of things to track, in case you get lost, here is a brief overview of the workflow:

Set up the environment, make sure the baseline is running.
Observe the baseline performance, based on the results in the output directory, and think about the potential problems.
Based on the problems, do some checking to make sure your problem really exists. (e.g. check the queue occupancy, link utilization, etc.)
After confirming the problem, think about the potential solutions and implement them.
When you implement a solution, think about what category of the solution it is, find the corresponding files we mentioned below, and modify them. (Other files are also allowed to be modified, but please do not break the fundamental assumptions on infrastructure, e.g. link speed, topology, etc.)
After implementing the solution, run experiments to see if it works. When you do testing, remember to check: config files, build scripts, output files, etc, to make sure everything is correctly set up, and the evaluation is fair.
Based on the results, do more analysis, see if the problem has really been solved, or if there are any new problems.
When you finalize your solution and experiment results, prepare your report. Don't forget to include enough visualizations to make your report more convincing.

Tricks to use smoothly

We understand that a nice IDE can greatly enhance productivity. You can use Visual Studio Code with the Remote - Containers extension to develop inside the Docker container. This allows you to leverage the powerful features of VSCode while working within the containerized environment.

When the container is running, open VS Code, and use the "Remote - Containers: Attach to Running Container..." command from the Command Palette (Ctrl+Shift+P) to connect to the cs5229-astra-sim container.

Inside the container, you can open the folder /app/astra-sim to access the project files. You can also open a terminal in VS Code, which will be inside the container.

The project is based on CMake, but since NS3 is quite heavy and not organized in a standard way, we found that the CMake extension cannot work properly. So you can just use the terminal to build and run the code.

Codes that you can edit

The network functions you can edit, and the corresponding files are listed below:

(The NS3 source files are under extern/ns-3/)

Load Balancing Algorithm: src/point-to-point/model/load-balancing-customized.* (Can refer to load-balancing-ecmp.* for example)
Host-side Congestion Control Algorithm: src/point-to-point/model/rdma-hw.*
Switch-side Congestion Control (e.g. ECN): src/point-to-point/model/switch-mmu.*
Queue Management / Scheduling Algorithm: src/network/utils/broadcom-egress-queue.* (Note this one is in a different location)

Please note that these are recommendations, if your solution involves modifying other files, that's perfectly fine as well, as long as you do not break the fundamental assumptions on infrastructure (e.g. link speed, topology, etc).

The configuration files are under extern/ns-3/scratch/config/. You can add your own configuration files here, and modify the scripts under build_scripts/astra_ns3/ to run your experiments.

Note on parameter tuning: You can modify the parameters in the configuration files. However, please note that in the report, you should clearly state which parameters you have modified and why. And make sure you are doing a fair comparison. For example, if you are comparing two load balancing algorithms, make sure the network topology and other parameters are the same. Only a fair comparison can lead to meaningful conclusions.

Other codes

The network setup steps are defined in extern/ns-3/scratch/common.h.
The main function is defined in astra-sim/network_frontend/AstraSimNetwork.cc.
The Astra-sim's folder astra-sim/ contains how the workload is compiled and executed. Since this part is the fundamental assumption of the workload, do not modify it.
src/point-to-point/model/rdma-* contains the RDMA model in NS3, which is used to simulate the RDMA communication. You can refer to it if you want to understand how RDMA works.
src/point-to-point/model/switch-* contains the switch model in NS3, which is used to simulate the switch behavior, the "high" level switch behavior is defined in switch-node.*, while the more detailed level switch behavior is defined in switch-mmu.*.

Reference

This assignment is based on the ASTRA-sim 2.0 simulator, which is a distributed machine learning system simulator by Intel, Meta, and Georgia Tech. For more information about ASTRA-sim, please refer to the ASTRA-sim GitHub repository.

If you have more interest in the project, you can visit their websitewebsite.

Thanks to the ASTRA-sim team for their contributions to the field of distributed machine learning system simulation. It really enables the probability of trying out the distributed machine learning in system simulator like NS3, provides us a great opportunity to learn and practice, without the need to set up hardware environment.

This project is licensed under the MIT License - see the LICENSE file for details. We kept the author information and code ownership clarification. We have performed some motifications to the original code to adapt to the assignment requirements. Please refer to the comments in the code for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
astra-sim		astra-sim
build_scripts/astra_ns3		build_scripts/astra_ns3
extern		extern
inputs		inputs
runs/example		runs/example
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CS5229 NS3 Programming Assignment

Getting Started

One Known Issue

Build and Run

Overall workflow

Tricks to use smoothly

Codes that you can edit

Other codes

Reference

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CS5229 NS3 Programming Assignment

Getting Started

One Known Issue

Build and Run

Overall workflow

Tricks to use smoothly

Codes that you can edit

Other codes

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages