Distributed data loading with Petastorm for distributed training

Petastorm is an open source data access library. This library enables single-node or distributed training and evaluation of deep learning models directly from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames.

This example shows how to use Petastorm with TorchDistributor to train on imagenet data with Pytorch Lightning.

Requirements

Databricks Runtime ML 13.0 and above
(Recommended) GPU instances

Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages. Collecting pytorch-lightning Downloading pytorch_lightning-2.0.2-py3-none-any.whl (719 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 719.0/719.0 kB 7.0 MB/s eta 0:00:00 Requirement already satisfied: pillow in /databricks/python3/lib/python3.10/site-packages (9.2.0) Collecting deltalake Downloading deltalake-0.9.0-1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.8 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 29.8/29.8 MB 42.2 MB/s eta 0:00:00 Requirement already satisfied: PyYAML>=5.4 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (6.0) Requirement already satisfied: fsspec[http]>2021.06.0 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (2022.7.1) Requirement already satisfied: numpy>=1.17.2 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (1.21.5) Requirement already satisfied: tqdm>=4.57.0 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (4.64.1) Requirement already satisfied: typing-extensions>=4.0.0 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (4.3.0) Collecting lightning-utilities>=0.7.0 Downloading lightning_utilities-0.8.0-py3-none-any.whl (20 kB) Requirement already satisfied: packaging>=17.1 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (21.3) Collecting torchmetrics>=0.7.0 Downloading torchmetrics-0.11.4-py3-none-any.whl (519 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 519.2/519.2 kB 57.4 MB/s eta 0:00:00 Requirement already satisfied: torch>=1.11.0 in /databricks/python3/lib/python3.10/site-packages (from pytorch-lightning) (1.13.1+cu117) Requirement already satisfied: pyarrow>=7 in /databricks/python3/lib/python3.10/site-packages (from deltalake) (7.0.0) Requirement already satisfied: aiohttp in /databricks/python3/lib/python3.10/site-packages (from fsspec[http]>2021.06.0->pytorch-lightning) (3.8.4) Requirement already satisfied: requests in /databricks/python3/lib/python3.10/site-packages (from fsspec[http]>2021.06.0->pytorch-lightning) (2.28.1) Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /databricks/python3/lib/python3.10/site-packages (from packaging>=17.1->pytorch-lightning) (3.0.9) Requirement already satisfied: aiosignal>=1.1.2 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (1.3.1) Requirement already satisfied: charset-normalizer<4.0,>=2.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (2.0.4) Requirement already satisfied: yarl<2.0,>=1.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (1.8.2) Requirement already satisfied: async-timeout<5.0,>=4.0.0a3 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (4.0.2) Requirement already satisfied: frozenlist>=1.1.1 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (1.3.3) Requirement already satisfied: multidict<7.0,>=4.5 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (6.0.4) Requirement already satisfied: attrs>=17.3.0 in /databricks/python3/lib/python3.10/site-packages (from aiohttp->fsspec[http]>2021.06.0->pytorch-lightning) (21.4.0) Requirement already satisfied: urllib3<1.27,>=1.21.1 in /databricks/python3/lib/python3.10/site-packages (from requests->fsspec[http]>2021.06.0->pytorch-lightning) (1.26.11) Requirement already satisfied: certifi>=2017.4.17 in /databricks/python3/lib/python3.10/site-packages (from requests->fsspec[http]>2021.06.0->pytorch-lightning) (2022.9.14) Requirement already satisfied: idna<4,>=2.5 in /databricks/python3/lib/python3.10/site-packages (from requests->fsspec[http]>2021.06.0->pytorch-lightning) (3.3) Installing collected packages: torchmetrics, lightning-utilities, deltalake, pytorch-lightning Successfully installed deltalake-0.9.0 lightning-utilities-0.8.0 pytorch-lightning-2.0.2 torchmetrics-0.11.4 Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.

The training and validation datasets are stored in seperate Delta tables. The content column that contains images in binary format and object_id column are used to train the resnet model.

There are two ways to load delta table into Petastorm.

Load the Delta table into Spark dataframe and use make_spark_converter of Petastorm. The dataframe materializes at the start of the training, which can take considerable amount of time for large datasets.
Directly pass a list of parquet files to Petastorm's make_batch_reader and Petastorm loads the data directly from those parquet files without materializing it into another cache location. This batch reader can then be passed to petastorm.pytorch.DataLoader to create dataloaders that can be used in the LightningDataModule.

In this example make_batch_reader approach is used.

DataLoader Class

This class holds all the logic for processing and loading the dataset.

The default value None for num_epochs in make_batch_reader function is used in order to generate an infinite number of data batches to avoid handling the last, likely incomplete, batch. This is especially important for distributed training where the number of data records seen on all workers need to be identical per step. Given that the length of each data shard may not be identical, setting num_epochs to any specific number would fail to meet the guarantee and can result in an error. Even though this may not be really important for training on a single device, it determines the way epochs are controlled. Otherwise, training runs infinitely on an infinite dataset, which means there would be only 1 epoch if other means of controlling the epoch duration are not used.

Using the default value num_epochs=None is also important for the validation process. At the time this notebook was developed, Pytorch Lightning Trainer runs a final check for completeness prior to any training, unless instructed otherwise. That check initializes the validation data loader and reads the num_sanity_val_steps batches from it before the first training epoch. Training does not reload the validation dataset for the actual validation phase of the first epoch which results in an error. To work around this error, you can avoid doing any checks by setting num_sanity_val_steps=0, and using limit_val_batches parameter of the Trainer class to avoid the infinitely running validation.

Create the training function

The TorchDistributor API has support for single node multi-GPU training as well as multi-node training. The following pl_train function takes the parameters num_tasks and num_proc_per_task.

For additional clarity:

num_tasks (which sets pl.Trainer(num_nodes=num_tasks, **kwargs)) is the number of Spark Tasks you want for distributed training.
num_proc_per_task (which sets pl.Trainer(devices=num_proc_per_task, **kwargs)) is the number of devices/GPUs you want per Spark task for distributed training.

If you are running single node multi-GPU training on the driver, set num_tasks to 1 and num_proc_per_task to the number of GPUs that you want to use on the driver.

If you are running multi-node training, set num_tasks to the number of Spark tasks you want to use and num_proc_per_task to the value of spark.task.resource.gpu.amount (which is usually 1).

Therefore, the total number of GPUs used is num_tasks * num_proc_per_task.

Petastorm uses the device ID and device count that is passed from the main training loop to shard data for multi-GPU training. It is crucial to specify appropriate values for Petastorm arguments such as workers_count, reader_pool, and result_queue_size to prevent out-of-memory (OOM) exceptions. For instance, result_queue_size determines the number of row groups loaded into the queue. If the size of the Parquet row groups is large, setting result_queue_size to a higher number can easily lead to OOM. Consider the following scenario: a row group with 1000 rows and a row size of 0.1 MB. If the default result_queue_size (50) and workers_count (10) are used, this would result in 50 GB of data in memory (10 workers x 50 result queue size x 1000 rows per row group x 0.1 MB).

Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth

0%| | 0.00/97.8M [00:00<?, ?B/s]

GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs strategy is <pytorch_lightning.strategies.single_device.SingleDeviceStrategy object at 0x7fe6fbffb1c0> LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------- 0 | model | ResNet | 25.6 M --------------------------------- 25.6 M Trainable params 0 Non-trainable params 25.6 M Total params 102.228 Total estimated model params size (MB)

Training: 0it [00:00, ?it/s]

Epoch 0 started at 1685645442.8318713 seconds ++ [0] Epoch: 0

Cancelled

INFO:TorchDistributor:Started local training with 4 processes

Cancelled

INFO:TorchDistributor:Started distributed training with 16 executor proceses /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( warnings.warn( warnings.warn( 2023-06-01 19:24:09.634747: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.638089: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.635955: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.653088: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.657789: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.654541: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.656603: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.653706: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.655972: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.653779: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.639011: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.655034: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.731215: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.729585: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.727029: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-01 19:24:09.730542: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release /databricks/python/lib/python3.10/site-packages/pkg_resources/__init__.py:123: PkgResourcesDeprecationWarning: 1.1build1 is an invalid version and will not be supported in a future release warnings.warn( warnings.warn( Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth Downloading: "https://download.pytorch.org/models/resnet50-11ad3fa6.pth" to /root/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth 100%|██████████| 97.8M/97.8M [00:00<00:00, 204MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 229MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 247MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 253MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 300MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 199MB/s] strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f48273aa080> Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/16 100%|██████████| 97.8M/97.8M [00:00<00:00, 175MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 170MB/s] strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f09a935eb60> Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f7678db4d00> Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/16 100%|██████████| 97.8M/97.8M [00:00<00:00, 140MB/s] strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f4883e7acb0> Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/16 100%|██████████| 97.8M/97.8M [00:00<00:00, 140MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 120MB/s] 100%|██████████| 97.8M/97.8M [00:00<00:00, 113MB/s] 100%|██████████| 97.8M/97.8M [00:01<00:00, 91.4MB/s] 100%|██████████| 97.8M/97.8M [00:01<00:00, 93.5MB/s] strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f15549a83d0> Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7ff434a55360> Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f29f77a9ff0> Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/16 100%|██████████| 97.8M/97.8M [00:01<00:00, 68.2MB/s] 100%|██████████| 97.8M/97.8M [00:01<00:00, 63.4MB/s] GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f031857c190> Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7fd319fac2b0> Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f2ccbfa4ca0> Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f98707a6f20> Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f2ff7aaef20> Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7ff357784190> Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f35c0976b90> Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7fa469d72c80> Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/16 strategy is <pytorch_lightning.strategies.ddp.DDPStrategy object at 0x7f6c7c37e650> Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/16 ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 16 processes ---------------------------------------------------------------------------------------------------- You are using a CUDA device ('NVIDIA A10G') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [3] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [2] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [1] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params --------------------------------- 0 | model | ResNet | 25.6 M --------------------------------- 25.6 M Trainable params 0 Non-trainable params 25.6 M Total params 102.228 Total estimated model params size (MB) Epoch 0 started at 1685647520.2087996 seconds Epoch 0 started at 1685647520.2089117 seconds ++ [5] Epoch: 0 ++ [13] Epoch: 0 Epoch 0 started at 1685647520.2088752 seconds Epoch 0 started at 1685647520.2089064 seconds ++ [11] Epoch: 0 ++ [15] Epoch: 0 Epoch 0 started at 1685647520.208757 seconds ++ [2] Epoch: 0 Epoch 0 started at 1685647520.209095 seconds Epoch 0 started at 1685647520.208909 seconds ++ [10] Epoch: 0 Epoch 0 started at 1685647520.2091427 seconds ++ [14] Epoch: 0 Epoch 0 started at 1685647520.2088459 seconds ++ [6] Epoch: 0 ++ [8] Epoch: 0 Epoch 0 started at 1685647520.2086282 seconds ++ [4] Epoch: 0 Epoch 0 started at 1685647520.2088623 seconds ++ [7] Epoch: 0 Epoch 0 started at 1685647520.2086344 seconds ++ [12] Epoch: 0 Epoch 0 started at 1685647520.8743188 seconds ++ [3] Epoch: 0 Epoch 0 started at 1685647521.1677103 seconds ++ [1] Epoch: 0 Epoch 0 started at 1685647521.2961361 seconds ++ [9] Epoch: 0 Epoch 0: 0%| | 0/377 [00:00<?, ?it/s] Epoch 0 started at 1685647521.3167934 seconds ++ [0] Epoch: 0 Epoch 0: 8%|▊ | 29/...(truncated)26, 3.01s/it, v_num=533a] Validation: 0it [00:00, ?it/s] Validation: 0%| | 0/14 [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/14 [00:00<?, ?it/s] Validation DataLoader 0: 7%|▋ | 1/14 [00:00<00:02, 4.89it/s] Validation DataLoader 0: 14%|█▍ | 2/14 [00:00<00:02, 4.32it/s] Validation DataLoader 0: 21%|██▏ | 3/14 [00:00<00:02, 4.29it/s] Validation DataLoader 0: 29%|██▊ | 4/14 [00:01<00:02, 3.57it/s] Validation DataLoader 0: 36%|███▌ | 5/14 [00:01<00:02, 3.59it/s] Validation DataLoader 0: 43%|████▎ | 6/14 [00:01<00:02, 3.69it/s] Validation DataLoader 0: 50%|█████ | 7/14 [00:01<00:01, 3.70it/s] Validation DataLoader 0: 57%|█████▋ | 8/14 [00:02<00:01, 3.73it/s] Validation DataLoader 0: 64%|██████▍ | 9/14 [00:02<00:01, 3.59it/s] Validation DataLoader 0: 71%|███████▏ | 10/14 [00:02<00:01, 3.60it/s] Validation DataLoader 0: 79%|███████▊ | 11/14 [00:03<00:00, 3.63it/s] Validation DataLoader 0: 86%|████████▌ | 12/14 [00:03<00:00, 3.67it/s] Validation DataLoader 0: 93%|█████████▎| 13/14 [00:03<00:00, 3.68it/s] Epoch 0: 100%|██████████| 377/377 [15:04<00:00, 2.40s/it, v_num=533a]s] Epoch 1 started at 1685648472.836523 seconds ++ [13] Epoch: 1 Epoch 1 started at 1685648472.8362768 seconds ++ [12] Epoch: 1 Epoch 1 started at 1685648472.8365521 seconds Epoch 1 started at 1685648472.8364458 seconds Epoch 1 started at 1685648472.8365004 seconds Epoch 1 started at 1685648472.8364177 seconds Epoch 1 started at 1685648472.836327 seconds Epoch 1 started at 1685648472.8363826 seconds Epoch 1 started at 1685648472.836241 seconds Epoch 1 started at 1685648472.8363469 seconds Epoch 1 started at 1685648472.840834 seconds Epoch 1 started at 1685648472.8365378 seconds Epoch 1: 0%| | 0/377 [00:00<?, ?it/s, v_num=533a]Epoch 1 started at 1685648472.83661 seconds Epoch 1 started at 1685648472.8365269 seconds Epoch 1 started at 1685648472.8364668 seconds ++ [9] Epoch: 1 ++ [15] Epoch: 1 ++ [11] Epoch: 1 ++ [3] Epoch: 1 ++ [14] Epoch: 1 ++ [6] Epoch: 1 ++ [4] Epoch: 1 ++ [8] Epoch: 1 ++ [2] Epoch: 1 ++ [1] Epoch: 1 ++ [0] Epoch: 1 ++ [5] Epoch: 1 ++ [10] Epoch: 1 Epoch 1 started at 1685648474.21078 seconds ++ [7] Epoch: 1 Epoch 1: 8%...(truncated)9/377 [01:27<17:33, 3.03s/it, v_num=533a] Validation: 0it [00:00, ?it/s] Validation: 0%| | 0/14 [00:00<?, ?it/s] Validation DataLoader 0: 0%| | 0/14 [00:00<?, ?it/s] Validation DataLoader 0: 7%|▋ | 1/14 [00:01<00:24, 1.89s/it] Validation DataLoader 0: 14%|█▍ | 2/14 [00:02<00:12, 1.08s/it] Validation DataLoader 0: 21%|██▏ | 3/14 [00:02<00:08, 1.25it/s] Validation DataLoader 0: 29%|██▊ | 4/14 [00:02<00:06, 1.44it/s] Validation DataLoader 0: 36%|███▌ | 5/14 [00:03<00:05, 1.63it/s] Validation DataLoader 0: 43%|████▎ | 6/14 [00:03<00:04, 1.80it/s] Validation DataLoader 0: 50%|█████ | 7/14 [00:03<00:03, 1.95it/s] Validation DataLoader 0: 57%|█████▋ | 8/14 [00:03<00:02, 2.07it/s] Validation DataLoader 0: 64%|██████▍ | 9/14 [00:10<00:05, 1.20s/it] Validation DataLoader 0: 71%|███████▏ | 10/14 [00:11<00:04, 1.10s/it] Validation DataLoader 0: 79%|███████▊ | 11/14 [00:11<00:03, 1.03s/it] Validation DataLoader 0: 86%|████████▌ | 12/14 [00:11<00:01, 1.04it/s] Validation DataLoader 0: 93%|█████████▎| 13/14 [00:11<00:00, 1.09it/s] Epoch 1: 100%|██████████| 377/377 [15:09<00:00, 2.41s/it, v_num=533a]s] Epoch 1: 100%|██████████| 377/377 [15:11<00:00, 2.42s/it, v_num=533a]`Trainer.fit` stopped: `max_epochs=2` reached. Epoch 1: 100%|██████████| 377/377 [15:11<00:00, 2.42s/it, v_num=533a] INFO:TorchDistributor:Finished distributed training with 16 executor proceses

distributed-data-loading-petastorm(Python)

Distributed data loading with Petastorm for distributed training

Requirements

Setup MLflow experiment

Load the dataset from Delta table

Set up the model

DataLoader Class

Create the training function

Train the model locally with 1 GPU

Single node multi-GPU setup

Multi-node setup