Deployment
This tutorial will show you how to set up and start a federated learning experiment to train ShuffleNet on the FEMNIST dataset using FedScale.
Preliminary
Check the instructions to set up your environment and follow the instructions below to download the FEMNIST dataset. Here femnist is the dataset_name
:
# Run `fedscale dataset help` for more details
`fedscale dataset download [dataset_name]` # Or `bash download.sh download [dataset_name] `
Please make sure you are using the correct environment.
conda activate fedscale
Create experiment profile:
Go to benchmark/configs/femnist/
directory and modify/create your configuration file to submit your job.
Modify the configurations such as the number of participants per round, the aggregation algorithm, the client optimizer, the training model, etc. based on your need.
Evaluate on a Local Machine
Follow the following instruction to run FedScale locally (i.e., standalone mode).
- Submitting to driver:
It is more convenient to first test your code without a GPU cluster.
First add an argument
- use_cuda: False
under job_conf in your configuration filebenchmark/configs/femnist/conf.yml
if you are training without using any GPU. Setps_ip
andworker_ips
to belocalhost
andlocalhost:[x]
by default, where x represent how many executors you want to run on your local machine. Then run the following command to start your federated learning job:fedscale drive start benchmark/configs/femnist/conf.yml # or python driver.py start benchmark/configs/femnist/conf.yml
- Running with Jupyter: We also provide jupyter notebook examples to run your code locally. You can first start running server, and then run the client.
Evaluate on a Cluster
Follow the following instruction to run FedScale in a cluster (i.e., distributed mode or cross-silo federated learning deployment).
-
Set up cluster: Please assure that these paths are consistent across all nodes so that FedScale simulator can find the right path.
-
Coordinator node: Make sure that the coordinator (master node) has access to other worker nodes via
ssh
. -
All nodes: Follow this to install all necessary libs, and then download the datasets.
-
-
Set up job configuration: Change
ps_ip
andworker_ips
to the host name of your nodes in the configuration file. For example, using 10.0.0.2:[4,4] as one of the worker_ips means launching four executors on each of the first two GPUs on 10.0.0.2 to train your model in a space/time sharing fashion. Make sure the node you submit the job has access to other nodes, and you have synchronized the code across all the nodes. -
Submit/Stop job:
-
fedscale driver submit [conf.yml]
(orpython docker/driver.py submit [conf.yml]
) will submit a job with parameters specified in conf.yml on both the aggregator and worker nodes. We provide some example configuration files inFedScale/benchmark/configs
for each dataset. Comments in these example will help you quickly understand how to specify these parameters. -
fedscale driver stop [job_name]
(orpython docker/driver.py stop [job_name]
) will terminate the runningjob_name
(specified in yml) on the used nodes.
-
Monitor Your Training Progress
You can find the job logging job_name
under the path log_path
specified in the conf.yml
file. To check the training loss or test accuracy, you can do:
cat job_name_logging | grep 'Training loss'
cat job_name_logging | grep 'FL Testing'
We have integrated TensorBoard for the visualization of experiment results. To track the experiment with [log_path]
(e.g., ./FedScale/benchmark/logs/cifar10/0209_141336
), please try tensorboard --logdir=[log_path] --bind_all
, and all the results will be available at: http://[ip_of_coordinator]:6006/
.
Meanwhile, all logs are dumped to log_path
(specified in the config file) on each node.
testing_perf
locates at the master node under this path, and the user can load it with pickle
to check the time-to-accuracy performance.
You can also check /benchmark/[job_name]_logging
to see whether the job is moving on.