Case Study - Deploy NVIDIA FLARE Federated Learning

--- description: OneAI Documentation tags: Case Study, Clara, Federated Learning, EN --- [OneAI Documentation](/s/user-guide-en) # Case Study - Deploy NVIDIA FLARE Federated Learning [TOC] ## Introduction Federated Learning (FL) allows multiple participants to jointly build a general and powerful machine learning model without sharing data, thus solving the problems of data privacy, security, regulation, and locality, and replacing data sharing with model sharing to obtain the best AI model by using a large amount of data from different sources and decentralized training. Federated learning techniques have been applied in many fields, including healthcare, telecommunications, IoT, and pharmaceuticals. NVIDIA FLARE™ (NVIDIA Federated Learning Application Runtime Environment) is a domain-agnostic, open-source, and extensible SDK for Federated Learning. The SDK allows researchers and data scientists to adapt existing ML/DL workflow to a federated paradigm and enables platform developers to build a secure, privacy-preserving offering for a distributed multi-party collaboration. <center><img src="/uploads/upload_68113190912dafd4db7a21f653ff5276.png" ></center> <center>NVIDIA FLARE federated learning diagram</center> <center><a href="https://developer.nvidia.com/flare">(Image credit: NVIDIA)</a></center> ## 0. Get Started TWCC OneAI provides GPU computing resource management and a variety of tools, and is pre-loaded with commonly used AI training frameworks. As long as you follow the instructions of this case study, you can quickly complete the deployment and connection, and get started using TWCC OneAI to deploy NVIDIA FLARE federated learning and develop AI models quickly. This tutorial will use OneAI's **Container Service** and **Storage Service** to deploy NVIDIA FLARE [**Hello PyTorch**](https://github.com/NVIDIA/NVFlare/tree/2.1.3/examples/hello-pt) Federated Learning environment. In the example, the person in charge of FL Server rents a project to deploy FL Server in TWCC OneAI; other hospitals or institutions rent their own projects to deploy FL Client in OneAI to join Federated Learning training, where two FL Clients Site1 and Site2 of NVIDIA organization will be deployed; the FL Admin can be the Chief Data Scientist in the federation, and another project is rented on OneAI to deploy the FL Admin. Different projects are independent and do not share data with each other. For ease of experience, this example can be run in the same OneAI project. <center><img src="/uploads/upload_6b28ed6f392c528691093a5560847c72.png" alt="NVIDIA Clara Train Federated Learning Deployment Architecture"></center> <center>Federated Learning Deployment Architecture</center> The main steps are as follows: 1. [**Provision**](#1-Provision) At this stage, the project manager plans the roles and permissions of the participants and creates the installation package. 2. [**Deploy FL Server**](#2-Deploy-FL-Server) At this stage, we will deploy FL Server, which coordinates federated learning training and is the main hub for FL Client and FL Admin connections. 3. [**Deploy FL Client**](#3-Deploy-FL-Client) At this stage, we will deploy 2 FL Clients, Site1 and Site2, and prepare the training dataset for each FL Client. 4. [**Deploy FL Admin**](#4-Deploy-FL-Admin) At this stage, we will deploy FL Admin and connect to FL Server and FL Client through the FL Admin tool for federated learning operations. 5. [**Perform Federated Learning Training**](#5-Perform-federated-learning-training) At this stage, we will perform federated learning training and obtain the training results of each FL Client and the best model. :::warning :warning: **Note:** 1. Please have a basic understanding of [**NVIDIA FLARE**](https://nvflare.readthedocs.io/en/main/index.html) before starting this example. 2. This example is based on the [**NVIDIA FLARE 2.1 Examples**](https://github.com/NVIDIA/NVFlare/tree/2.1.3/examples/hello-pt) sample programs, datasets and models. Please read NVIDIA's license terms before use. ::: ## 1. Provision The Provision stage is where the project manager plans the roles and permissions of the participants and creates the installation package. This chapter will use the built-in project.yml of NVIDIA FLARE to generate the installation package of each site through the Provision tool. Please follow the steps below. ### 1.1 Pre-built FL Server Container Because we need to set the service ports of fed_learn_port and admin_port for FL Server at the Provision stage, we will pre-build an FL Server container to set the two external service ports of FL Server before starting the Provisioning. Select **Container Service** from the OneAI service list, enter the container service management page, and click **+ CREATE** to add a new container service. 1. **Basic Information** Please enter in sequence the **name**, for example: **`federated-learning-server`**, **description**, for **image**, please select `nvidia-official-images:PyTorch-21.02-py3`. :::info :bulb:**Tips:** It is recommended to select the image provided by the system according to the development framework you are using. | Framework| Image | | ------------ | -------- | | TensorFlow | nvidia-official-images:TensorFlow-21.02-tf1-py3 | | TensorFlow 2| nvidia-official-images:TensorFlow-21.02-tf2-py3 | | PyTorch | nvidia-official-images:PyTorch-21.02-py3 | | MONAI | nvidia-official-images:PyTorch-21.02-py3 | | Numpy | nvidia-official-images:PyTorch-21.02-py3 | ::: ![](/uploads/upload_0b50aec3e17f50a671e2c037a4534739.png) 2. **Hardware Settings** Select the hardware specification, there is no need to configure GPU. 3. **Storage Settings** This step does not need to be set for now, you can clear the default settings first. 4. **Network Settings** In this step, we will pre-create two external service ports required by FL Server. If the selected image has a default container port, you can delete it first, and then add two container ports as shown in the following diagram. * **fed_learn_port** ![](/uploads/upload_6a6dbb1a4909891213718b391344f05a.png) * **admin_port** ![](/uploads/upload_98109635b81c1dd1c499a5e383505d57.png) * The setting screen after the port is added is as follows. ![](/uploads/upload_6df3433bed0a4060dbe98b2dd4e0f25a.png) 5. **Variable Settings** This step does not need to be set for now, you can clear the default settings first or go directly to the next step. 6. **Review & Create** Finally, confirm the entered information and click **CREATE**. After the **FL Server** container is created, it will appear in the list of container services. When the status of the container service changes to **`Ready`**, you can see the node ports **`8002:31767`** and **`8003:31768`** automatically generated by the container ports `8002` and `8003` added by the system in the **Network Settings** step. The two node ports will be used in subsequent steps. ![](/uploads/upload_92e6c54023ba22b1d8d64892fe12c298.png) ### 1.2 Create Provision Installation Package Storage Space This step will create a space to store the Provision installation package. 1. **Create A Bucket** Select **Storage Service** from the OneAI service list, enter the storage service management page, and then click **+ CREATE** to add a bucket named **`federated-learning-packages`**. ![](/uploads/upload_ef5dd7076eca743cacc92521b837f2fb.png) 2. **View Bucket** After the bucket is created, go back to the Storage Service Management page, and you will see that the bucket has been created. ![](/uploads/upload_439ff13f6485c3b0a51c16a3c411a07d.png) ### 1.3 Create Provision Container　 Select **Container Service** from the OneAI service list, enter the container service management page, and click **+ CREATE** to add a new container service. 1. **Basic Information** Please enter in sequence the **name**, for example: **`federated-learning-provision`**, **description**, for **image**, please select the execution environment that supports Python 3.8, for example: `clara:v4.0`. ![](/uploads/upload_bacacf3f0b25f9e8cb0df398adca9109.png) 2. **Hardware Settings** Select the hardware specification, there is no need to configure GPU. 3. **Storage Settings** At this stage, you need to mount the bucket that stores the installation package. Please click **ADD** and set the bucket to store the installation package. **workspace**: **`federated-learning-packages`**. ![](/uploads/upload_6745784ee384c698d0c9b0f07aec9be4.png) 4. **Network Settings** No setup is required for this step. 5. **Variable Settings** * **Command** Please paste the following command: ```= echo "cd /workspace" >> /root/.bashrc &&git clone https://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@github.com/asus-ocis/fl-script.git /fl-script &&sleep 1 &&/bin/bash /fl-script/nvflare21.sh ``` ![](/uploads/upload_71445907341f786191f973835deedae9.png) 6. **Review & Create** Finally, confirm the entered information and click **CREATE**. ### 1.4 Modify Provision Configuration File After the **Provision** container is created, it will appear in the list of container services. When the status of the container service becomes **`Ready`**, please enter the container service details page and click the **Terminal** tab above to connect to the container. Once successfully connected, enter the provision command, and modify the Provision configuration file project.yml according to the environment settings of this example. When executing the provision command, if project.xml does not exist, you will be asked if you want to automatically generate it, please enter `y` to confirm. ``` # provision Path list (sys.path) for python codes loading: ['/opt/conda/bin', '/workspace', '/opt/nvidia/medical', '/opt/conda/lib/python38.zip', '/opt/conda/lib/python3.8', '/opt/conda/lib/python3.8/lib-dynload', '/opt/conda/lib/python3.8/site-packages', '/opt/monai', '/workspace/.'] No project.yml found in current folder. Is it OK to generate one at /workspace/project.yml for you? (y/N) y /workspace/project.yml was created. Please edit it to fit your FL configuration. ``` Execute the following command to edit project.yml and modify the content of the configuration file. ```= # nano project.yml ``` The modifications are as follows: * **name**: The full qualified domain name (FQDN) of the FL Server, which is set to **`federated.oneai.twcc.ai`** in this example. Note: Because this example deploys FL Server on OneAI, it needs to be set to **`federated.oneai.twcc.ai`** specially created by OneAI for federated learning. * **fed_learn_port**: The default value in the configuration file is **`8002`**, please modify it to the external service port corresponding to your FL Server container service port `8002`, in this example, it is `31767`(as shown in the figure below). * **admin_port**: The default value in the configuration file is **`8003`**, please modify it to the external service port corresponding to your FL Server container service port `8003`, in this example, it is `31768`(as shown in the figure below). Please note that admin_port cannot be the same as fed_learn_port. * **enable_byoc**: Because this example uses a custom program, all enable_byoc settings in the configuration file need to be set to true. * **Disable HA related settings**: There is a second FL Server preset in the configuration file. Because this example will not enable the HA function, you need to modify the HA related settings. * **Setting to pack the installation package into a zip file**: The configuration file does not have this setting by default, please add a setting to pack the installation package into a zip file for easy download. ![](/uploads/upload_92e6c54023ba22b1d8d64892fe12c298.png) :::info :bulb:**Tips: About FL Server** The FL Server FQDN used in this example is **`federated.oneai.twcc.ai`**, which is the FL Server service URL provided by OneAI for federated learning. You can follow this example to experience the deployment and execution of federated learning training on OneAI, or set it to other FL Server FQDN. ::: ```yaml= .... # change overseer.example.com to the FQDN of the overseer - name: overseer.example.com type: overseer org: nvidia protocol: https api_root: /api/v1 port: 8443 # change example.com to the FQDN of the server - name: federated.oneai.twcc.ai <--- Modify the Domain of FL Server type: server org: nvidia fed_learn_port: 31767 <--- Modify fed_learn_port admin_port: 31768 <--- Modify admin_port, cannot be the same as fed_learn_port # enable_byoc loads python codes in app. Default is false. enable_byoc: true components: <<: *svr_comps # - name: example2.com <--- Please comment out the entire second FL server setting, because this example does not use HA # type: server # org: nvidia # fed_learn_port: 8002 # admin_port: 8003 # enable_byoc loads python codes in app. Default is false. # enable_byoc: true # components: # <<: *svr_comps - name: site-1 type: client org: nvidia enable_byoc: true components: <<: *cln_comps - name: site-2 type: client org: nvidia enable_byoc: true <--- Changed to true because custom programs need to be executed components: <<: *cln_comps ... ... ... overseer_agent: path: nvflare.ha.dummy_overseer_agent.DummyOverseerAgent <--- Changed to DummyOverseerAgent to turn off HA function overseer_exists: false <--- Changed to false to turn off HA function args: sp_end_point: federated.oneai.twcc.ai:31767:31768 <--- Set the end point of the FL server ... ... ... - path: nvflare.lighter.impl.workspace.DistributionBuilder <--- Add these 3 rows to pack the installation package into a zip file args: zip_password: true ``` :::info :bulb:**Tips: About FL Client And Admin Client** This example only needs to modify the settings of FL Server. If you want to modify FL Client and Admin Client, you can modify the settings of fl_clients and admin_clients in project.yml by yourself. Please refer to the following instructions. For more information, please refer to the NVIDIA FLARE [**Project yaml file documentation**](https://nvflare.readthedocs.io/en/main/programming_guide/provisioning_system.html?highlight=project.yml#project-yaml-file). ```yaml= ... participants: <-- All participant roles and organizations are defined here, such as server, client and admin # change example.com to the FQDN of the server - name: federated.oneai.twcc.ai type: server <-- Define server according to type here org: nvidia fed_learn_port: 31767 admin_port: 31768 # enable_byoc loads python codes in app. Default is false. enable_byoc: true - name: site1 type: client <-- Define client according to type here org: nvidia enable_byoc: true - name: site2 type: client <-- Define client according to type here org: nvidia enable_byoc: true - name: admin@nvidia.com type: admin <-- Define admin according to type here org: nvidia roles: - super <-- Define admin permissions here ... ``` ::: :::info :bulb:**Tips: About Roles and Permissions** The role permissions of NVIDIA FLARE are defined in the authz_policy field of project.yml. This example uses the default role and permission settings. For more information, please refer to the NVIDIA FLARE [**Federated Learning Authorization documentation**](https://nvflare.readthedocs.io/en/main/user_guide/authorization.html?highlight=right). ::: ### 1.5 Execute Provision Tool to Generate Installation Package Next, please execute the **`provision`** command to generate the installation package and password. ``` # provision ``` After successful execution, please write down the password provided on the screen. ![](/uploads/upload_847989c6b37dbc3707c4bada3f34dde9.png) Next, select **Storage Service** from the OneAI service list, enter the storage service management page, select **`federated-learning-packages`** and you will see the installation package generated by executing the provision command, as shown in the following figure: ![](/uploads/upload_e0227b6db5771cb6f6c2188e5721e6a2.png) ![](/uploads/upload_2c57cfc74532c03199c599a81659929e.png) This example will use the following installation packages. In real situations, each federation member will only receive their own installation package and unzip password. * federated.oneai.twcc.ai.zip * site1.zip * site2.zip * admin@nvidia.com.zip ## 2. Deploy FL Server FL Server will coordinate the federated learning training and become the main hub for all FL Client and Admin connections. This section will describe how to use **OneAI Container Service** to deploy FL Server. ### 2.1 Upload FL Server Installation Package First, please unzip the installation package federated.oneai.twcc.ai.zip, the password is the Password on the left of federated.oneai.twcc.ai.zip in the screen list after executing the provision command. ![](/uploads/upload_847989c6b37dbc3707c4bada3f34dde9.png) The directory structure is as follows after unzipping: ```= ./startup ├── authorization.json ├── fed_server.json ├── log.config ├── readme.txt ├── rootCA.pem ├── server.crt ├── server.key ├── server.pfx ├── signature.json ├── start.sh ├── stop_fl.sh └── sub_start.sh ``` After unzipping the installation package, please follow the steps below to check the information of the installation package: 1. Confirm FL Server information. Open **fed_server.json** and make sure that the Domain Name and Port of the FL Server set in **`target`** in this file are correct, for example: ```json= ... "service": { "target": "federated.oneai.twcc.ai:31767", ... ``` 2. Check the installation package files for abnormalities to prevent getting files from unknown sources. After checking, upload the installation package to OneAI's **Storage Service**. 1. **Create A Bucket** Select **Storage Service** from the OneAI service list, enter the storage service management page, and then click **+ CREATE** to add a bucket named **`federated-learning-server-startkit`** to store the Server installation package. ![](/uploads/upload_194ac83ec19a8aa6c092ef5cf02178c5.png) 2. **View Bucket** After the bucket is created, go back to the Storage Service Management page, and you will see that the bucket has been created. 3. **Upload FL Server Installation Package** Click on the bucket that has been created, and then upload the entire FL Server installation package directory **startup**. (See [**Storage Service Documentation**](/s/storage-en)). ![](/uploads/upload_068bd1ae7523786913818ac2c07c1687.png) ### 2.2 Modify FL Server Container Select **Container Service** from the OneAI service list, enter the container service management page, and then click the **`federated-learning-server`** container service that has been just created. ![](/uploads/upload_92e6c54023ba22b1d8d64892fe12c298.png) After entering the **Container Service Details** page, click the **Stop** icon above to stop the container service. ![](/uploads/upload_d12681ab8488acab8e319390b51e5cc7.png) Go to the **Container Service Details** page, then click the **Edit** icon above. ![](/uploads/upload_cf00ea0eedf9e752b818a59c9f2c0495.png) Then follow the steps below to modify the container service settings of FL Server. 1. **Basic Information** No modification is required for this step. 2. **Hardware Settings** No modification is required for this step. 3. **Storage Settings** This step will mount the bucket that stores the FL Server installation package. Please click **ADD** and set the bucket to store the installation package. * **workspace**: **`federated-learning-server-startkit`**. ![](/uploads/upload_a65a99f21038542c4d6f7c89ebc7debc.png) 4. **Network Settings** Here we need to modify the previously created container port to be the same as the node port automatically allocated by the system. :::info :bulb:**Tips:** Because the Listen Port of FL Server needs to be the same as the port of FL Client. ::: Click the **Edit** icon to the right of the list. ![](/uploads/upload_9008c504f7ebdd5c349306c9dde908e9.png) Change the port `8002` to the following **Node Port**, as shown below. ![](/uploads/upload_51174cde1cee818b0df9a24effe5e012.png) Click **SAVE** after you have made the changes. ![](/uploads/upload_bb2083b9fe3266afc1d5a105660bbd5a.png) Repeat the same steps, change the container port `8003` to the corresponding node port. After the modification, please confirm whether the **container port** is the same as the **node port**. ![](/uploads/upload_b867f63df517cfc2f1daaf273272641f.png) 5. **Variable Settings** * **Environment variables** This step allows you to choose whether you want to set the following variable according to your needs. * **START**: Set whether to automatically execute **`start.sh`**. When the value is set to **`AUTO`**, **`start.sh`** will be automatically executed after the container is started to start FL Server. If you do not want to automatically start FL Server, please do not set the START variable. * **Command** Please paste the following command: ```= echo "cd /workspace" >> /root/.bashrc &&git clone https://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@github.com/asus-ocis/fl-script.git /fl-script &&sleep 1 &&/bin/bash /fl-script/nvflare21.sh ``` ![](/uploads/upload_58b2f8584075a7cecdbba60863f03c47.png) 6. **Review & Create** Finally, confirm the entered information and click CREATE. After the modification, please click the **Start** icon above to restart the container service. ### 2.3 Start FL Server If you have set **START:AUTO** in **Variable Settings**, after the **Container Service** is created, the system will automatically execute **./startup/start.sh** to start FL Server. When the status of the container service is displayed as **`Ready`**, please enter the container service details page, click the **Terminal** tab above, and execute the following command to view the log file. ```= cat /workspace/log.txt ``` :::info :bulb:**Tips:** If you have not set START:AUTO in **Variable Settings**, after entering the **Terminal** page, please first manually execute **`./startup/start.sh`** to start FL Server. ::: When you see the following messages, it means that FL Server has started successfully. ```= 2022-07-15 04:07:10,061 - FederatedServer - INFO - starting secure server at federated.oneai.twcc.ai:31767 2022-07-15 04:07:10,068 - FederatedServer - INFO - Got the primary sp: federated.oneai.twcc.ai fl_port: 31767 SSID: ebc6125d-0a56-4688-9b08-355fe9e4d61a. Turning to hot. deployed FL server trainer. 2022-07-15 04:07:10,080 - FedAdminServer - INFO - Starting Admin Server federated.oneai.twcc.ai on Port 31768 2022-07-15 04:07:10,080 - root - INFO - Server started ``` Finally, please make sure that job-related software has been installed. ## 3. Deploy FL Client There are 2 Clients in this example, site1 and site2. The following describes how to deploy the installation package and dataset of site1. Next, follow this example to complete the deployment of site2. ### 3.1 Upload FL Client Installation Package First, unzip the site1 installation package site1.zip on the local side. The directory structure after unzipping is as follows: ```= ./startup ├── client.crt ├── client.key ├── client.pfx ├── fed_client.json ├── log.config ├── readme.txt ├── rootCA.pem ├── signature.json ├── start.sh ├── stop_fl.sh └── sub_start.sh ``` After unzipping the installation package, please follow the steps below to check the information of the installation package: 1. Confirm FL Server Information. Open **fed_server.json** and make sure that the Domain Name and Port of the FL Server set in **`sp_end_point`** in this file are correct. ```json= ... "overseer_agent": "sp_end_point": "federated.oneai.twcc.ai:31767:31768" ... ``` 2. Check the installation package files for abnormalities to prevent getting files from unknown sources. After checking, upload the installation package to OneAI's **Storage Service**. 1. **Create A Bucket** Select **Storage Service** from the OneAI service list, enter the storage service management page, and then click **+ CREATE** to add a bucket named `federated-learning-site1-startkit` to store the site1 installation package. ![](/uploads/upload_3a627cca79d05f29c3f50ed4505c3389.png) 2. **View Bucket** After the bucket is created, go back to the Storage Service Management page, and you will see that the bucket has been created. 3. **Upload FL Client Installation Package** Click on the bucket that has been created, and then upload the entire site1 installation package directory **startup**. (See [**Storage Service Documentation**](/s/storage-en)). ![](/uploads/upload_889fe00d80fa6524ba7526e23412229a.png) ### 3.2 Prepare Training Dataset Bucket Next, please create a bucket for storing the training dataset, for example: `federated-learning-site1-dataset`. The NVIDIA PyTorch example used in this example will automatically store the Cifar10 dataset here. ### 3.3 Create FL Client Container Select **Container Service** from the OneAI service list, enter the container service management page, and click **+ CREATE** to add a new container service. 1. Basic Information Please enter in sequence the **name**, for example: **`federated-learning-site1`**, **description**, for **image**, please select `nvidia-official-images:PyTorch-21.02-py3`. :::info :bulb:**Tips:** It is recommended to select the image provided by the system according to the development framework you are using. | Framework| Image | | ------------ | -------- | | TensorFlow | nvidia-official-images:TensorFlow-21.02-tf1-py3 | | TensorFlow 2| nvidia-official-images:TensorFlow-21.02-tf2-py3 | | PyTorch | nvidia-official-images:PyTorch-21.02-py3 | | MONAI | nvidia-official-images:PyTorch-21.02-py3 | | Numpy | nvidia-official-images:PyTorch-21.02-py3 | ::: ![](/uploads/upload_24ccfd3f7d659592cdddb0b38a167938.png) 2. **Hardware Settings** FL Client needs to perform FL training jobs, please select a hardware option with at least 1 GPU and 25 GB of memory. 3. **Storage Settings** There are two buckets to be mounted in this step, please click **ADD** and set the following: * **workspace**: The bucket for storing the Client installation package, please select **`federated-learning-site1-startkit`**. * **data**: The bucket for storing the training dataset, please select **`federated-learning-site1-dataset`**. ![](/uploads/upload_0fe5d24210660d27ee11cebc42a9c48b.png) 4. **Network Settings** No setup is required for this step. 5. **Variable Settings** * **Environment variables** * **`START`** Set whether to automatically execute **`start.sh`**. When the value is set to **`AUTO`**, **`start.sh`** will be automatically executed after the container is started to start FL Server, and automatically join the FL Server. * **Command** Please paste the following command: ```= echo "cd /workspace" >> /root/.bashrc &&git clone https://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@github.com/asus-ocis/fl-script.git /fl-script &&sleep 1 &&/bin/bash /fl-script/nvflare21.sh ``` ![](/uploads/upload_77eba2ca1b793d3fe266333bd096c135.png) :::info :bulb:**Tips:** If your FL Server is not public, you can have the system modify /etc/hosts automatically by setting the **`FLSERVER`** and **`FLSERVERIP`** environment variables. * **`FLSERVER`** Set your FL Server Domain Name (non-public) and the system will automatically modify /etc/hosts. For example `myflserver.com` (must match the "sp_end_point" field of **fed_client.json**). Set the IP of FL Server, which needs to be used together with **`FLSERVER`**, such as `203.145.220.166`. ::: 6. **Review & Create** Finally, confirm the entered information and click **CREATE**. ### 3.4 Start FL Client If you have set **START:AUTO** in **Variable Settings**, the system will automatically execute **./startup/start.sh**. When the status of the FL Client container service is displayed as **`Ready`**, please enter the container service details page, click the **Terminal** tab above, and execute the following command to view the log file. ```= cat /workspace/log.txt ``` :::info :bulb:**Tips:** If you have not set START:AUTO in **Variable Settings**, after entering the **Terminal** page, please first manually execute **`./startup/start.sh`** to start FL Server. ::: After the FL Client has successfully connected to the FL Server, the following information will appear: ```= ... 2022-07-15 06:50:01,348 - FederatedClient - INFO - Got the new primary SP: federated.oneai.twcc.ai:30718 2022-07-15 06:50:02,514 - FederatedClient - INFO - Successfully registered client:site-1 for project example_project. Token:ac664158-37a7-4169-9a61-32bdb36e2ef3 SSID:ebc6125d-0a56-4688-9b08-355fe9e4d61a ``` ### 3.5 Deploy Other FL Client Please repeat the deployment steps for site1, using the site2.zip installation package to deploy and start. ## 4. Deploy FL Admin FL Admin is the control center of federated learning, and is responsible for supervising and regulating the entire deep learning model training process. Once FL Server and FL Clients are started, federated learning can be managed and run through FL Admin. This section will explain how to deploy FL Admin and use Admin Tool to manage federated learning. ### 4.1 Upload FL Admin Installation Package First, unzip the installation package admin@nvidia.com.zip on the local side. The directory structure after unzipping is as follows: ```= startup/ ├── client.crt ├── client.key ├── client.pfx ├── fed_admin.json ├── fl_admin.sh ├── readme.txt └── rootCA.pem ``` After unzipping the installation package, please follow the steps below to check the information of the installation package: 1. Confirm FL Server information. Open **fed_adminr.json** and make sure that the Domain Name and Port of the FL Server set in **`sp_end_point`** in this file are correct. ```json= ... "overseer_agent": "sp_end_point": "federated.oneai.twcc.ai:31767:31768" ... ``` 2. Check the installation package files for abnormalities to prevent getting files from unknown sources. After checking, upload the installation package to OneAI's **Storage Service**. 1. **Create A Bucket** Select **Storage Service** from the OneAI service list, enter the storage service management page, and then click **+ CREATE** to add a bucket named `federated-learning-admin-startkit` to store the Admin installation package. ![](/uploads/upload_821c486f770c570e8fe028aecf56793b.png) 2. **View Bucket** After the bucket is created, go back to the Storage Service Management page, and you will see that the bucket has been created. 3. **Upload FL Admin Installation Package** Click on the bucket that has been created, and then upload the entire Admin installation package directory **startup**. (See [**Storage Service Documentation**](/s/storage-en)). ![](/uploads/upload_ef4e960be1a6a77f365b2b82a3b7baf5.png) ### 4.2 Create FL Admin Container Select **Container Service** from the OneAI service list, enter the container service management page, and click **+ CREATE** to add a new container service. 1. **Basic Information** Please enter in sequence the **name**, for example: **`federated-learning-admin`**, **description**, for **image**, please select the execution environment that supports Python 3.8, for example: `clara:v4.0`. ![](/uploads/upload_cb9c034ee8d48fe9529ab9f232d70b42.png) 2. **Hardware Settings** Select the hardware specification, there is no need to configure GPU. 3. **Storage Settings** This step will mount the bucket that stores the installation package. Please click **ADD** and set the bucket to store the installation package. * **workspace**: The bucket for storing the Admin installation package, please select `federated-learning-admin-startkit`. ![](/uploads/upload_1edf3e106705540a2b0cd21f592533f5.png) 4. **Network Settings** No setup is required for this step. 5. **Variable Settings** * **Environment variables** The FL Server used in this example does not need to set environment variables. * **Command** Please paste the following command: ``` echo "cd /workspace" >> /root/.bashrc &&git clone https://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@github.com/asus-ocis/fl-script.git /fl-script &&sleep 1 &&/bin/bash /fl-script/nvflare21.sh ``` ![](/uploads/upload_71445907341f786191f973835deedae9.png) :::info :bulb:**Tips:** If your FL Server is not public, you can have the system modify /etc/hosts automatically by setting the **`FLSERVER`** and **`FLSERVERIP`** environment variables. * **`FLSERVER`** Set your FL Server Domain Name (non-public) and the system will automatically modify /etc/hosts. For example **`myflserver.com`**. (must match the "sp_end_point" field of **fed_admin.json**). * **`FLSERVERIP`** Set the IP of FL Server, which needs to be used together with **`FLSERVER`**, such as `203.145.220.166`. ::: 6. **Review & Create** Finally, confirm the entered information and click **CREATE**. ### 4.3 Start FL Admin After the Admin container service is created, please enter the container service details page, click the **Terminal** tab above, and execute the **`/startup/fl_admin.sh`** command to start the FL Admin Server. ```= # ./startup/fl_admin.sh ``` When the **`User Name:`** prompt appears on the screen, please enter **`admin@nvidia.com`** and press Enter to log in. ```= # ./startup/fl_admin.sh User Name: admin@nvidia.com Waiting for token from successful login... Got primary SP federated.oneai.twcc.ai:31767:31768 from overseer. Host: federated.oneai.twcc.ai Admin_port: 31768 SSID: ebc6125d-0a56-4688-9b08-355fe9e4d61a Type ? to list commands; type "? cmdName" to show usage of a command. > ``` ### 4.4 Confirm Connection Status Execute the **`check_status server`** command to query the status of the FL Server. ![](/uploads/upload_4641976e086e78985e866f1de07e34ef.png) Execute the **`check_status client`** command to query the status of the FL Client. After confirming that the FL Server and FL Client are successfully connected, you can start preparing for federated learning training. ![](/uploads/upload_cd01fd4280893844cbf2ee7942193d7d.png) ## 5. Perform Federated Learning Training This section will explain how to prepare a training program and perform federated learning training. ### 5.1 Preparing Training Program This tutorial will use the Hello PyTorch sample program provided by NVIDIA to perform federated learning training. Please download [**hello-py**](https://github.com/NVIDIA/NVFlare/tree/2.1.3/examples/hello-pt) first, and modify the program as below to match the test environment of this sample. :::info :bulb:**Tips:** For detailed custom code instructions, please refer to the [**NVIDIA FLARE documentation**](https://nvflare.readthedocs.io/en/main/user_guide/application.html?highlight=custom#custom-code). ::: 1. Please modify part of the file hello-py/custom/cifar10trainer.py: ```json= ... def __init__( self, data_path="/data", <--Please set "/data" here ... ``` 2. Please modify part of the file hello-py/custom/cifar10validator.py: ```json= ... def __init__(self, data_path="/data", validate_task_name=AppConstants.TASK_VALIDATION): <--Please set data_path="/data" here ... ``` After modification, please go to the **transfer** directory of the **`federated-learning-admin-startkit`** bucket of **Storage Service**, and upload **hello-py** to the **transfer** directory, as shown in the following figure: ![](/uploads/upload_a51c6d335f0f58b7123f9377e002b4e7.png) ### 5.2 Run Training Jobs Next, we can run training jobs through the submit_job command. Please execute **`submit_job hello-pt`** to start the training. After execution, you will get a training job number. Please write down the training job number for subsequent queries. ![](/uploads/upload_69f242b8c28f2c692ad7199611cecc1c.png) Once the training job is started, you can execute the **`check_status server`** and **`check_status client`** commands to check the status of the FL Server and FL Client. If the status is displayed as **`started`**, it means that the training job has started. ![](/uploads/upload_bf014dd87c1570c7900c1e7a501779ac.png) ![](/uploads/upload_da3a551b79f0d4aad0ac2357c9689a5b.png) ### 5.3 Check Training Status After running training jobs for a period of time, you can execute the **`list_jobs`** command to obtain the status of all training jobs. ![](/uploads/upload_74f49806cee84af145a9d9e472949371.png) ### 5.4 Get Training Results When the training job is displayed as completed, you can execute the **`download_job <job_id>`** command to download the training job results. ![](/uploads/upload_03c5e59390782f5002a8207250f9fdd9.png) The trained model (global model) can be obtained from the download path, for example: 6592e357-0b8b-400e-b04d-275349eb138c/workspace/app_server/FL_global_model.pt.