Deployment


This tutorial will show you how to set up and start a federated learning experiment to train ShuffleNet on the FEMNIST dataset using FedScale.

Preliminary

Check the instructions to set up your environment and follow the instructions below to download the FEMNIST dataset. Here femnist is the dataset_name:

# Run `fedscale dataset help` for more details
`fedscale dataset download [dataset_name]`   # Or `bash download.sh download [dataset_name] `

Please make sure you are using the correct environment.

conda activate fedscale

Create experiment profile: Go to benchmark/configs/femnist/ directory and modify/create your configuration file to submit your job. Modify the configurations such as the number of participants per round, the aggregation algorithm, the client optimizer, the training model, etc. based on your need.


Evaluate on a Local Machine

Follow the following instruction to run FedScale locally (i.e., standalone mode).

  • Submitting to driver: It is more convenient to first test your code without a GPU cluster. First add an argument - use_cuda: False under job_conf in your configuration file benchmark/configs/femnist/conf.yml if you are training without using any GPU. Set ps_ip and worker_ips to be localhost and localhost:[x] by default, where x represent how many executors you want to run on your local machine. Then run the following command to start your federated learning job:
    fedscale drive start benchmark/configs/femnist/conf.yml
    # or python driver.py start benchmark/configs/femnist/conf.yml
    
  • Running with Jupyter: We also provide jupyter notebook examples to run your code locally. You can first start running server, and then run the client.

Evaluate on a Cluster

Follow the following instruction to run FedScale in a cluster (i.e., distributed mode or cross-silo federated learning deployment).

  • Set up cluster: Please assure that these paths are consistent across all nodes so that FedScale simulator can find the right path.

    • Coordinator node: Make sure that the coordinator (master node) has access to other worker nodes via ssh.

    • All nodes: Follow this to install all necessary libs, and then download the datasets.

  • Set up job configuration: Change ps_ip and worker_ips to the host name of your nodes in the configuration file. For example, using 10.0.0.2:[4,4] as one of the worker_ips means launching four executors on each of the first two GPUs on 10.0.0.2 to train your model in a space/time sharing fashion. Make sure the node you submit the job has access to other nodes, and you have synchronized the code across all the nodes.

  • Submit/Stop job:

    • fedscale driver submit [conf.yml] (or python docker/driver.py submit [conf.yml]) will submit a job with parameters specified in conf.yml on both the aggregator and worker nodes. We provide some example configuration files in FedScale/benchmark/configs for each dataset. Comments in these example will help you quickly understand how to specify these parameters.

    • fedscale driver stop [job_name] (or python docker/driver.py stop [job_name]) will terminate the running job_name (specified in yml) on the used nodes.


Monitor Your Training Progress

You can find the job logging job_name under the path log_path specified in the conf.yml file. To check the training loss or test accuracy, you can do:

cat job_name_logging | grep 'Training loss'
cat job_name_logging | grep 'FL Testing'

We have integrated TensorBoard for the visualization of experiment results. To track the experiment with [log_path] (e.g., ./FedScale/benchmark/logs/cifar10/0209_141336), please try tensorboard --logdir=[log_path] --bind_all, and all the results will be available at: http://[ip_of_coordinator]:6006/.

Meanwhile, all logs are dumped to log_path (specified in the config file) on each node. testing_perf locates at the master node under this path, and the user can load it with pickle to check the time-to-accuracy performance. You can also check /benchmark/[job_name]_logging to see whether the job is moving on.