Waris Gill, Ali Anwar, and Muhammad Ali Gulzar. Feddebug: Systematic debugging for federated learning applications. In Proceedings of the 45th International Conference on Software Engineering (ICSE '23). Association for Computing Machinery, New York, NY, USA.
The ArXiv version of the manuscript is avaibable at : FedDebug
FedDebug enables interactive and automated fault localization in Federation Learning (FL) applications. It adapts conventional debugging practices in FL with its breakpoint and fix & replay featur and it offers a novel differential testing technique to automatically identify the precise faulty clients.FedDebug tool artifact comprises a two-step process
- Interactive Debugging Constructs integrated with IBMFL framework via FL simulation in a Docker Image
- Automated Faulty Client Localization via Google Colab Notebooks
FedDebug's interactive debugging module takes inspiration from traditional debuggers, such as gdb, and enables real-time interactive debugging on a simulation of a live FL application.
Below, we provide step-by-step walkthrough of FedDebug's interactive debugging features in IBMFL. For ease of installation, we offer pre-configured Docker Image of FedDebug enabled IBMFL.
Please type "docker build -t ibmfl .". It will build docker form the Dockerfile. To run and interact with ibmfl docker type the following command docker run -it ibmfl. It will start the docker and now you can interact with it. Type ls in the docker shell to display the current directory content to check whether everything is installed correctly. You can check more about dockers on the following link Docker Tutorial.
In this tutorial, we will be using Tmux to simulate a distributed FL environment in Docker, representing both FL clients and aggregator. Tmux is a terminal multiplexer that allows you to split your terminal screen horizontally. Tmux allows us to interact with the client and aggregator side of IBMFL and seemlesly move between those two interfaces. To start tmux, simply type 'tmux' in the terminal. You can check more about tmux on this link Tmux Quick Tutorial.
To split the screen horizontally, type in a tmux session Ctrl + b followed by shift + " . It will split the screen into two terminals.
You can move between terminals (panes) by pressing Ctrl + b followed by the up and down arrows keys.
In one of the tmux terminals, type python sim_agg.py to run the aggregator.

After running this command, you will see the following output: "Press key to continue." Right now, do not press any key and move to another tmux terminal.
In the second terminal, type python sim_party.py. This will start the ten parties and connect them to the second aggregator.
After running this command, you will see the following output:

Move back to the aggregator terminal and press enter key. This will start the training from the aggregator for 10 rounds with 10 clients. Currently, breakpoint is set at round 5, so the terminal will stop displaying logs after round 5 and you will see the following output:
Note: you can change the breakpoint round id in sim_agg.py
help: You can type help to see all the commands available at the round level in the debugger.
ls: The ls command at the round level will display generic information about the round.
step next, step back, and step in: You can also use the step next, step back, and step in commands to navigate through the rounds.
After step in you can also type help to see the commands available inside a round.
ls: You can use the ls command inside a round to display all the clients in the round with their accuracies.
agg: Similarly can also use the agg to partially aggregate the models from a subset of clients for the partial evaluation to inspect their performance.
Note: We have replace step in command inside the round with agg to avoid any confusion with step in command of rounds.
step out: To leave a round, you can also use the step out command to step out of a round.

remove client <client id>: Suppose that you identify a faulty client in round 8, you can remove its contribution from the round using the remove client <client id> command.
This will resume the training from round 9 instead of round 5, and the faulty client will not be included from round 8.
**Note: After removing the client, the training is complete. **
To check the functionality of resume command, restart the aggregator (python sim_agg.py) and clients ((python sim_party.py)) as explained above. Perform some actions (e.g., step in, step out, ls etc ), except remove client. Suppose that there is no faulty client and you want resume the training without any further debugging. You can type resume to resume training and you will see that the aggregator immediately displays the the output of all the rounds without any retraining.
WARIS : TODO HEre GIVe colab link and show steps











