This tutorial will show you how to set up and start a federated learning experiment to train ShuffleNet on the FEMNIST dataset using FedScale.
Check the instructions to set up your environment and follow the instructions below to download the FEMNIST dataset. Here femnist is the
# Run `fedscale dataset help` for more details `fedscale dataset download [dataset_name]` # Or `bash download.sh download [dataset_name] `
Please make sure you are using the correct environment.
conda activate fedscale
Create experiment profile:
benchmark/configs/femnist/ directory and modify/create your configuration file to submit your job.
Modify the configurations such as the number of participants per round, the aggregation algorithm, the client optimizer, the training model, etc. based on your need.
Evaluate on a Local Machine
Follow the following instruction to run FedScale locally (i.e., standalone mode).
- Submitting to driver:
It is more convenient to first test your code without a GPU cluster.
First add an argument
- use_cuda: Falseunder job_conf in your configuration file
benchmark/configs/femnist/conf.ymlif you are training without using any GPU. Set
localhost:[x]by default, where x represent how many executors you want to run on your local machine. Then run the following command to start your federated learning job:
fedscale drive start benchmark/configs/femnist/conf.yml # or python driver.py start benchmark/configs/femnist/conf.yml
- Running with Jupyter: We also provide jupyter notebook examples to run your code locally. You can first start running server, and then run the client.
Evaluate on a Cluster
Follow the following instruction to run FedScale in a cluster (i.e., distributed mode or cross-silo federated learning deployment).
Set up cluster: Please assure that these paths are consistent across all nodes so that FedScale simulator can find the right path.
Coordinator node: Make sure that the coordinator (master node) has access to other worker nodes via
All nodes: Follow this to install all necessary libs, and then download the datasets.
Set up job configuration: Change
worker_ipsto the host name of your nodes in the configuration file. For example, using 10.0.0.2:[4,4] as one of the worker_ips means launching four executors on each of the first two GPUs on 10.0.0.2 to train your model in a space/time sharing fashion. Make sure the node you submit the job has access to other nodes, and you have synchronized the code across all the nodes.
fedscale driver submit [conf.yml](or
python docker/driver.py submit [conf.yml]) will submit a job with parameters specified in conf.yml on both the aggregator and worker nodes. We provide some example configuration files in
FedScale/benchmark/configsfor each dataset. Comments in these example will help you quickly understand how to specify these parameters.
fedscale driver stop [job_name](or
python docker/driver.py stop [job_name]) will terminate the running
job_name(specified in yml) on the used nodes.
Monitor Your Training Progress
You can find the job logging
job_name under the path
log_path specified in the
conf.yml file. To check the training loss or test accuracy, you can do:
cat job_name_logging | grep 'Training loss' cat job_name_logging | grep 'FL Testing'
We have integrated TensorBoard for the visualization of experiment results. To track the experiment with
./FedScale/benchmark/logs/cifar10/0209_141336), please try
tensorboard --logdir=[log_path] --bind_all, and all the results will be available at:
Meanwhile, all logs are dumped to
log_path (specified in the config file) on each node.
testing_perf locates at the master node under this path, and the user can load it with
pickle to check the time-to-accuracy performance.
You can also check
/benchmark/[job_name]_logging to see whether the job is moving on.