Distributed Training using PyTorch Lightning

Pytorch Lightning provides a simplified API to train machine learning models on PyTorch without any of the boilerplate code. Coupled with PySpark's TorchDistributor, you can launch distributed training tasks using a Spark job in barrier mode. Users only need to provide a train() function that runs the single-node training code on a GPU or worker node and the package handles all the configurations for you.

Pytorch Lightning does not come prebundled with Databricks and needs to be installed. It is possible to install it as a notebook library if just testing on single node, but it must be installed as a cluster library in order to use it on a cluster.

See:

AWS: Notebook Library / Cluster Library
Azure: Notebook Library / Cluster Library
GCP: Notebook Library / Cluster Library

Requirements

Databricks Runtime ML 13.0 and above
(Recommended) GPU instances

Create the training function

The TorchDistributor API has support for single node multi-GPU training as well as multi-node training. The following pl_train function takes the parameters num_tasks and num_proc_per_task.

For additional clarity:

num_tasks (which sets pl.Trainer(num_nodes=num_tasks, **kwargs)) is the number of Spark Tasks you want for distributed training.
num_proc_per_task (which sets pl.Trainer(devices=num_proc_per_task, **kwargs)) is the number of devices/GPUs you want per Spark task for distributed training.

If you are running single node multi-GPU training on the driver, set num_tasks to 1 and num_proc_per_task to the number of GPUs that you want to use on the driver.

If you are running multi-node training, set num_tasks to the number of spark tasks you want to use and num_proc_per_task to the value of spark.task.resource.gpu.amount (which is usually 1).

Therefore, the total number of GPUs used is num_tasks * num_proc_per_task

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn(

Training: 0it [00:00, ?it/s]

`Trainer.fit` stopped: `max_epochs=14` reached. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn(

Testing: 0it [00:00, ?it/s]

──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.404929518699646 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

INFO:TorchDistributor:Started local training with 1 processes /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( 2023-03-23 17:23:37.603507: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:23:37.740776: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 16%|█▌ | 156/1000 [00:02<00:12, 69.75it/s, v_num=2e2f] *** WARNING: max output size exceeded, skipping output. *** Epoch 13: 100%|██████████| 1000/1000 [00:08<00:00, 111.22it/s, v_num=2e2f]`Trainer.fit` stopped: `max_epochs=14` reached. Epoch 13: 100%|██████████| 1000/1000 [00:09<00:00, 106.58it/s, v_num=2e2f] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( INFO:TorchDistributor:Finished local training with 1 processes Testing DataLoader 0: 100%|██████████| 313/313 [00:01<00:00, 187.31it/s] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.4028000831604004 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

INFO:TorchDistributor:Started distributed training with 2 executor proceses /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( 2023-03-23 17:26:51.508311: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:26:51.588113: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:26:51.647188: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-03-23 17:26:51.739496: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes ---------------------------------------------------------------------------------------------------- LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 3%|▎ | 28/860 [00:02<01:13, 11....(truncated)31d1] Epoch 13: 100%|██████████| 860/860 [00:09<00:00, 87.66it/s, v_num=31d1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:225: PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`, it is recommended to use `Trainer(devices=1, num_nodes=1)` to ensure each sample/batch gets evaluated exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some samples to make sure all devices have same batch size in case of uneven inputs. rank_zero_warn( /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Testing DataLoader 0: 34%|███▍ | 54/157 [00:00<...(truncated)t/s] warning_cache.warn( Testing DataLoader 0: 100%|██████████| 157/157 [00:01<00:00, 147.78it/s] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.39914393424987793 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── INFO:TorchDistributor:Finished distributed training with 2 executor proceses

torch-distributor-lightning (1)(Python)

Distributed Training using PyTorch Lightning

Requirements

Set up the model

Set up the DataModule

Create the training function

Model training and testing

Train the model locally with 1 GPU

Single node multi-GPU setup

Multi-node setup