torch-distributor-lightning (1)(Python)

Loading...

Distributed Training using PyTorch Lightning

Pytorch Lightning provides a simplified API to train machine learning models on PyTorch without any of the boilerplate code. Coupled with PySpark's TorchDistributor, you can launch distributed training tasks using a Spark job in barrier mode. Users only need to provide a train() function that runs the single-node training code on a GPU or worker node and the package handles all the configurations for you.

Pytorch Lightning does not come prebundled with Databricks and needs to be installed. It is possible to install it as a notebook library if just testing on single node, but it must be installed as a cluster library in order to use it on a cluster.

See:

Requirements

  • Databricks Runtime ML 13.0 and above
  • (Recommended) GPU instances
2

    3

    Set up the model

    The following creates an AutoEncoder using the pl.LightningModule API.

    5

    Set up the DataModule

    7

    Create the training function

    The TorchDistributor API has support for single node multi-GPU training as well as multi-node training. The following pl_train function takes the parameters num_tasks and num_proc_per_task.

    For additional clarity:

    • num_tasks (which sets pl.Trainer(num_nodes=num_tasks, **kwargs)) is the number of Spark Tasks you want for distributed training.
    • num_proc_per_task (which sets pl.Trainer(devices=num_proc_per_task, **kwargs)) is the number of devices/GPUs you want per Spark task for distributed training.

    If you are running single node multi-GPU training on the driver, set num_tasks to 1 and num_proc_per_task to the number of GPUs that you want to use on the driver.

    If you are running multi-node training, set num_tasks to the number of spark tasks you want to use and num_proc_per_task to the value of spark.task.resource.gpu.amount (which is usually 1).

    Therefore, the total number of GPUs used is num_tasks * num_proc_per_task

    9

    Model training and testing

    Train the model locally with 1 GPU

    Note that nnodes = 1 and nproc_per_node = 1.

    12

    GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn(
    Training: 0it [00:00, ?it/s]
    `Trainer.fit` stopped: `max_epochs=14` reached. LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn(
    Testing: 0it [00:00, ?it/s]
    ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.404929518699646 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

    Single node multi-GPU setup

    For the distributor API, you want to set num_processes to the total amount of GPUs that you plan on using. For single node multi-gpu, this is limited by the number of GPUs available on the driver node.

    As mentioned before, single node multi-gpu (with NUM_PROC GPUs) setup involves setting trainer = pl.Trainer(accelerator='gpu', devices=NUM_PROC, num_nodes=1, **kwargs)

    14

    INFO:TorchDistributor:Started local training with 1 processes /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( 2023-03-23 17:23:37.603507: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:23:37.740776: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 16%|█▌ | 156/1000 [00:02<00:12, 69.75it/s, v_num=2e2f] *** WARNING: max output size exceeded, skipping output. *** Epoch 13: 100%|██████████| 1000/1000 [00:08<00:00, 111.22it/s, v_num=2e2f]`Trainer.fit` stopped: `max_epochs=14` reached. Epoch 13: 100%|██████████| 1000/1000 [00:09<00:00, 106.58it/s, v_num=2e2f] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 48 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( INFO:TorchDistributor:Finished local training with 1 processes Testing DataLoader 0: 100%|██████████| 313/313 [00:01<00:00, 187.31it/s] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.4028000831604004 ────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────

    Multi-node setup

    For the distributor API, you want to set num_processes to the total amount of GPUs that you plan on using. For multi-node, this will be equal to num_spark_tasks * num_gpus_per_spark_task. Additionally, note that num_gpus_per_spark_task usually equals 1 unless you configure that value specifically.

    Note that multi-node (with num_proc GPUs) setup involves setting trainer = pl.Trainer(accelerator='gpu', devices=1, num_nodes=num_proc, **kwargs)

    16

    INFO:TorchDistributor:Started distributed training with 2 executor proceses /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( 2023-03-23 17:26:51.508311: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:26:51.588113: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F AVX512_VNNI FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-03-23 17:26:51.647188: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. 2023-03-23 17:26:51.739496: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`. /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:70: UserWarning: You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop. rank_zero_warn("You passed in a `val_dataloader` but have no `validation_step`. Skipping val loop.") Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes ---------------------------------------------------------------------------------------------------- LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------------- 0 | encoder | Sequential | 50.4 K 1 | decoder | Sequential | 51.2 K --------------------------------------- 101 K Trainable params 0 Non-trainable params 101 K Total params 0.407 Total estimated model params size (MB) /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, train_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( Epoch 0: 3%|▎ | 28/860 [00:02<01:13, 11....(truncated)31d1] Epoch 13: 100%|██████████| 860/860 [00:09<00:00, 87.66it/s, v_num=31d1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:225: PossibleUserWarning: Using `DistributedSampler` with the dataloaders. During `trainer.test()`, it is recommended to use `Trainer(devices=1, num_nodes=1)` to ensure each sample/batch gets evaluated exactly once. Otherwise, multi-device settings use `DistributedSampler` that replicates some samples to make sure all devices have same batch size in case of uneven inputs. rank_zero_warn( /local_disk0/.ephemeral_nfs/envs/pythonEnv-450eb645-ead0-4b5d-b187-49d2ea4b7cc9/lib/python3.10/site-packages/pytorch_lightning/trainer/connectors/data_connector.py:430: PossibleUserWarning: The dataloader, test_dataloader, does not have many workers which may be a bottleneck. Consider increasing the value of the `num_workers` argument` (try 4 which is the number of cpus on this machine) in the `DataLoader` init to improve performance. rank_zero_warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Testing DataLoader 0: 34%|███▍ | 54/157 [00:00<...(truncated)t/s] warning_cache.warn( Testing DataLoader 0: 100%|██████████| 157/157 [00:01<00:00, 147.78it/s] ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── Test metric DataLoader 0 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── test_loss 0.39914393424987793 ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── INFO:TorchDistributor:Finished distributed training with 2 executor proceses